Bioconductor: Difference between revisions

From 太極
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Project =
= Project =
[https://journal.r-project.org/news/RJ-2022-4-bioconductor/ Bioconductor Notes] published in The R Journal, Nov 2022.
== Release News ==
== Release News ==
* [http://bioconductor.org/news/bioc_3_9_release/ 3.9] release
* [http://bioconductor.org/news/bioc_3_9_release/ 3.9] release
Line 71: Line 73:


=== Upgrading installed Bioconductor packages ===
=== Upgrading installed Bioconductor packages ===
CAlling '''BiocManager::install()''' without any arguments will attempt to update all installed packages in your R environment. It checks for updates on Bioconductor, CRAN, and GitHub, and if newer versions of the packages are available, it will download and install them.
When I run it on Docker, it will also install '''BiocVersion''' package from Bioconductor.
<pre>
<pre>
BiocManager::install()
BiocManager::install()
Line 78: Line 83:
[https://bioconductor.org/packages/release/bioc/html/BiocVersion.html BiocVersion].  
[https://bioconductor.org/packages/release/bioc/html/BiocVersion.html BiocVersion].  


* The BiocManager::install() function in R automatically installs the BiocVersion package during its first use. The BiocVersion package provides repository information for the appropriate version of Bioconductor. It helps BiocManager determine the version of Bioconductor in use. If BiocVersion has not yet been installed, the version is determined by code in base R. This ensures that all Bioconductor packages work best when they are all from the same release. See the document [https://bioconductor.github.io/BiocManager/reference/BiocManager-package.html Install or update Bioconductor, CRAN, or GitHub packages] from '''BiocManager'''.
* This package was suggested (not a default option by install.packages()) by BiocManager and more importantly '''imported''' by AnnotationHub.
* This package was suggested (not a default option by install.packages()) by BiocManager and more importantly '''imported''' by AnnotationHub.
* Not sure the purpose of this package. It seems BiocVersion gives the patch version if we like to get that information. For example the current Bioc is 3.14 but I have Bioconductor version 3.13.1.
* It seems BiocVersion gives the patch version if we like to get that information. For example the current Bioc is 3.14 but I have Bioconductor version 3.13.1.
<pre>
<pre>
R> BiocManager::version()
R> BiocManager::version()
Line 102: Line 108:


= Package source =
= Package source =
* Bioconductor Code & [https://code.bioconductor.org/ Git Browser]
* [[R_repository|R > create Bioconductor repository]]
* [[R_repository|R > create Bioconductor repository]]
* https://bioconductor.org/developers/how-to/git/. [https://bioconductor.org/developers/how-to/git/create-local-repository/ Create a local repository for private use]  
* https://bioconductor.org/developers/how-to/git/. [https://bioconductor.org/developers/how-to/git/create-local-repository/ Create a local repository for private use]  
Line 111: Line 118:


= Resource =
= Resource =
[http://rafalab.github.io/pages/teaching.html Teaching resources], [https://rafalab.github.io/pages/harvardx.html HarvardX  Biomedical Data Science Open Online Training] from rafalab
* [http://rafalab.github.io/pages/teaching.html Teaching resources], [https://rafalab.github.io/pages/harvardx.html HarvardX  Biomedical Data Science Open Online Training] from rafalab
* [https://gallantries.github.io/video-library/modules/bioconductor Module: Bioconductor R analyses] from GTN Video Library


= [https://cran.r-project.org/web/packages/BiocManager/index.html BiocManager] from CRAN =
= [https://cran.r-project.org/web/packages/BiocManager/index.html BiocManager] from CRAN =

Latest revision as of 13:45, 17 April 2024

Project

Bioconductor Notes published in The R Journal, Nov 2022.

Release News

Annual reports

http://bioconductor.org/about/annual-reports/

Download stats

From the director of the project

useR 2019 & youtube

Community blog

https://bioconductor.github.io/biocblog/

Publications

https://www.bioconductor.org/help/publications/

Bioconductor User Support

https://support.bioconductor.org/

Bioconductor Packages: Development, Maintenance, and Peer Review

https://contributions.bioconductor.org/index.html

Submit packages

Create A Package from Bioc2020

Update to newer version

http://bioconductor.org/install/. Suppose I have version 3.11 and I like to upgrade to version 3.12.

BiocManager::install(version = "3.12") # Upgrade all my installed packages
BiocManager::install(ask = FALSE) # To use the latest version of Bioconductor for your version of R

BiocManager::version()
# [1] ‘3.12’

Check if my current Bioconductor version is outdated

Below is an example that shows I am using an outdated Bioconductor.

require(BiocManager)
# Loading required package: BiocManager
# Bioconductor version 3.15 (BiocManager 1.30.20), R 4.2.2
#   Patched (2022-11-10 r83330)
# Bioconductor version '3.15' is out-of-date; the current
#   release version '3.16' is available with R version
#   '4.2'; see https://bioconductor.org/install

Upgrade to the latest Bioconductor version

Using Bioconductor

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.12") # the latest version can be obtained from the 
                                       # output of require(BiocManager) OR
                                       # checking the Bioconductor website 
# Upgrade 53 packages to Bioconductor version '3.12'? [y/n]: y

Upgrading installed Bioconductor packages

CAlling BiocManager::install() without any arguments will attempt to update all installed packages in your R environment. It checks for updates on Bioconductor, CRAN, and GitHub, and if newer versions of the packages are available, it will download and install them.

When I run it on Docker, it will also install BiocVersion package from Bioconductor.

BiocManager::install()

BiocVersion

BiocVersion.

  • The BiocManager::install() function in R automatically installs the BiocVersion package during its first use. The BiocVersion package provides repository information for the appropriate version of Bioconductor. It helps BiocManager determine the version of Bioconductor in use. If BiocVersion has not yet been installed, the version is determined by code in base R. This ensures that all Bioconductor packages work best when they are all from the same release. See the document Install or update Bioconductor, CRAN, or GitHub packages from BiocManager.
  • This package was suggested (not a default option by install.packages()) by BiocManager and more importantly imported by AnnotationHub.
  • It seems BiocVersion gives the patch version if we like to get that information. For example the current Bioc is 3.14 but I have Bioconductor version 3.13.1.
R> BiocManager::version()
[1] ‘3.13’
R> packageVersion("BiocVersion")
[1] ‘3.13.1’

Devel version of Bioconductor

BiocManager::install(version = "devel")

Mirrors

https://www.bioconductor.org/about/mirrors/

Github mirror

Japan

http://bioconductor.jp

Package source

Code search

http://search.bioconductor.jp/

code.bioconductor.org

https://code.bioconductor.org/

Resource

BiocManager from CRAN

The reason for using BiocManager instead of biocLite() is mostly to stop sourcing an R script from URL which isn’t so safe. So biocLite() should not be recommended anymore.

It allows to have multiple versions of Bioconductor installed on the same computer. For example, R 3.5 works with Bioconductor 3.7 and 3.8.

On the other hand, setRepositories(ind=1:4) and install.packages() still lets you install Bioconductor packages.

Hacking Bioconductor

BiocPkgTools

BioCExplorer

Explore Bioconductor packages more nicely

source("https://bioconductor.org/biocLite.R")
biocLite("BiocUpgrade")
biocLite("biocViews")
devtools::install_github("seandavi/BiocPkgTools")
devtools::install_github("shians/BioCExplorer")
library(BioCExplorer)
bioc_explore()

BiocExplorer.png

BiocViews

  • Software
  • AnnotationData
    • ChipManufacturer
    • ChipName
    • CustomArray
    • CustomCDF
    • CustomDBSchema
    • FunctionalAnnotation
    • Organism
    • PackageType
    • SequenceAnnotation
  • ExperimentData
    • AssayDomainData
    • DiseaseModel
    • OrganismData
    • PackageTypeData
    • RepositoryData
    • ReproducibleResearch
    • SpecimenSource
    • TechnologyData
  • Workflow
    • AnnotationWorkflow
    • BasicWorkflow
    • DifferentialSplicingWorkflow
    • EpigeneticsWorkflow
    • GeneExpressionWorkflow
    • GenomicVariantsWorkflow
    • ImmunoOncologyWorkflow
    • ProteomicsWorkflow
    • ResourceQueryingWorkflow
    • SingleCellWorkflow

Annotation packages

  • Introduction to Bioconductor annotation resources 2020 Bioc workshop
    • Package type Example
      ChipDb hgu133plus2.db
      OrgDb org.Hs.eg.db
      TxDb/EnsDb TxDb.Hsapiens.UCSC.hg19.knownGene; EnsDb.Hsapiens.v75
      OrganismDb Homo.sapiens
      BSgenome BSgenome.Hsapiens.UCSC.hg19
      Others GO.db; KEGG.db
      biomaRt
    • Database concept was used. So there are some terms like keys, select, ..., borrowed from there.
    • How do you know what the central keys are? If it's ChipDb (e.g. hugene20sttranscriptcluster.db), the central key are the manufactorer's probe IDs. If it's in the name - org.Hs.eg.db, where 'eg' means Entrez gene ID. You can see examples using e.g., head(keys(annopkg)), and infer from that.
    • Functions (select, mapIds, transcripts accessors work on ChipDb, TxDb, OrgDb, ...). See the blog Bioconductor annotation packages
      AnnotationDbi::keys(hgu133plus2.db, keytype="PROBEID")
      AnnotationDbi::keytypes(hgu133plus2.db)  # columns(hgu133plus2.db)
      AnnotationDbi::select(annopkg, keys, columns, keytype): keytype is optional
      AnnotationDbi::mapIds(annopkg, keys, column, keytype, multiVals): 
                     keytype is required, multiVals is used to handle duplicates.
      
    • Packages
      TxDb packages - eg TxDb.Hsapiens.UCSC.hg19.knownGene. It contains positional information.
      EnsDb packages - eg EnsDb.Hsapiens.v79
      
    • Examples: ChipDb
      library(hgu133plus2.db)  # Loading required package: org.Hs.eg.db
      keytypes(hgu133plus2.db)
      # [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
      # [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
      #[11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
      #[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
      #[21] "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
      #[26] "UNIGENE"      "UNIPROT"  
      ls("package:hgu133plus2.db")
      
      # Get all probe IDs
      length(hgu133plus2ACCNUM)
      # [1] 54675
      x <- hgu133plus2ACCNUM 
      # hgu133plus2ACCNUM is an R object that contains mappings between 
      # a manufacturer’s identifiers and manufacturers accessions.
      mapped_probes <- mappedkeys(x)
      str(mapped_probes)
      # chr [1:54675] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" ...
      length(unique(mapped_probes))
      # [1] 54675
      
      length(hgu133plus2ENTREZID)
      # [1] 54675
      x <- hgu133plus2ENTREZID
      mapped_probes <- mappedkeys(x)
      str(mapped_probes)
      # chr [1:41905] "1053_at" "117_at" "121_at" "1255_g_at" "1316_at" "1320_at" ...
      length(unique(mapped_probes))
      # [1] 41905
      
      # Using select() to map from probe ID to others
      # Note select() will return 1:many mapping
      library(hugene20sttranscriptcluster.db)
      ids <- c('16737401','16657436' ,'16678303')
      select(hugene20sttranscriptcluster.db, ids, c("SYMBOL","ALIAS", "ENSEMBL", "ENTREZID"))
      ## 'select()' returned 1:many mapping between keys and columns
      #     PROBEID   SYMBOL    ALIAS         ENSEMBL  ENTREZID
      # 1  16737401    TRAF6 MGC:3310 ENSG00000175104      7189
      # 2  16737401    TRAF6    RNF85 ENSG00000175104      7189
      # 3  16737401    TRAF6    TRAF6 ENSG00000175104      7189
      # 4  16657436  DDX11L1  DDX11L1 ENSG00000223972 100287102
      # 5  16657436 DDX11L17 DDX11L17 ENSG00000223972 102725121
      # 6  16657436  DDX11L2  DDX11L2 ENSG00000223972     84771
      # 7  16657436  DDX11L9  DDX11L9 ENSG00000248472 100288486
      # 8  16657436  DDX11L9  DDX11L9 ENSG00000288170 100288486
      # 9  16657436 DDX11L10  DDX11L1 ENSG00000233614 100287029
      # 10 16657436 DDX11L10   DDX11P ENSG00000233614 100287029
      # 11 16657436 DDX11L10 DDX11L10 ENSG00000233614 100287029
      # 12 16657436  DDX11L5  DDX11L5 ENSG00000223972 100287596
      # 13 16657436  DDX11L5  DDX11L5 ENSG00000248472 100287596
      # 14 16657436  DDX11L5  DDX11L5 ENSG00000288170 100287596
      # 15 16657436 DDX11L16 DDX11L16 ENSG00000223972    727856
      # 16 16657436 DDX11L16 DDX11L16 ENSG00000248472    727856
      # 17 16657436 DDX11L16 DDX11L16 ENSG00000233614    727856
      # 18 16657436 DDX11L16 DDX11L16 ENSG00000288170    727856
      # 19 16678303     ARF1    PVNH8 ENSG00000143761       375
      # 20 16678303     ARF1     ARF1 ENSG00000143761       375
      
      keytypes(hugene20sttranscriptcluster.db)
      ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
      ##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
      ## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
      ## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
      ## [21] "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
      ## [26] "UNIGENE"      "UNIPROT"
      
      # mapIds
      # Note that an additional argument, multiVals can be used to control duplicates
      mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID") # only select the 1st one
      ##  16737401  16657436  16678303 
      ##   "TRAF6" "DDX11L1"    "ARF1"
      mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "list")  # list
      mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "CharacterList") # IRanges
      mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "filter") # character vector
      
      # OrganismDb packages
      library(Homo.sapiens)
      Homo.sapiens
      head(genes(Homo.sapiens, columns = c("ENTREZID", "ALIAS", "UNIPROT")), 4)
      

Gene centric

library(hgu133a.db)
library(AnnotationDbi)
k <- head(keys(hgu133a.db, keytype="PROBEID"))
k
# [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"
# then call select
select(hgu133a.db, keys=k, columns=c("SYMBOL","GENENAME"), keytype="PROBEID")

# 'select()' returned 1:many mapping between keys and columns
#     PROBEID  SYMBOL                                     GENENAME
# 1 1007_s_at    DDR1  discoidin domain receptor tyrosine kinase 1
# 2 1007_s_at MIR4640                                microRNA 4640
# 3   1053_at    RFC2               replication factor C subunit 2
# 4    117_at   HSPA6 heat shock protein family A (Hsp70) member 6
# 5    121_at    PAX8                                 paired box 8
# 6 1255_g_at  GUCA1A               guanylate cyclase activator 1A
# 7   1294_at    UBA7  ubiquitin like modifier activating enzyme 7
# 8   1294_at MIR5193                                microRNA 5193

ExpressionSets objects

For example: load(file.path(system.file(package = "Bioc2020Anno", "extdata"), "eset.Rdata")

  • exprs(eset)
  • pData(phenoData(eset))
  • pData(featureData(eset))

Genomic centric

  • GRanges()
  • GrangesList()

GRanges() and GRangesLists act like data.frames and lists, and can be subsetted using the [ function.

SummarizedExperiment objects

SummarizedExperiment objects are like ExpressionSets, but the row-wise annotations are GRanges, so you can subset by genomic locations.

SummarizedExperiment objects are like ExpressionSets, but the row-wise annotations are GRanges, so you can subset by genomic locations.

SummarizedExperiment objects are popular objects for representing expression data and other rectangular data (feature x sample data).

  • array() or assays(): genes x samples
  • colData(): samples x sample features
  • rowData(): genes x gene features. The gene features could be result from running analyses; see EnrichmentBrowser vignette.

Examples: GSEABenchmarkeR package.

biomaRt

library(biomaRt)
listMarts(host = "useast.ensembl.org")

# biomaRt data sets
mart <- useEnsembl("ensembl")
head(listDatasets(mart))

# biomaRt queries
mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")

# biomaRt attributes and filters
atrib <- listAttributes(mart)
filts <- listFilters(mart)
head(atrib)
head(filts)

# biomaRt query
afyids <- c("1000_at","1001_at","1002_f_at","1007_s_at")
getBM(c("affy_hg_u95av2", "hgnc_symbol"), c("affy_hg_u95av2"), afyids, mart)

AnnotationHub

library(AnnotationHub)
hub <- AnnotationHub()
hub
# Querying AnnotationHub
names(mcols(hub))

# AnnotationHub Data providers
unique(hub$dataprovider)

# AnnotationHub Data classes
unique(hub$rdataclass)

# AnnotationHub Species
head(unique(hub$species))
length(unique(hub$species))

# AnnotationHub Data sources
unique(hub$sourcetype)

# AnnotationHub query
qry <- query(hub, c("granges","homo sapiens","ensembl"))
qry
qry$sourceurl

# Selecting AnnotationHub resource
whatIwant <- qry[["AH80077"]]
GRCh38TxDb <- makeTxDbFromGRanges(whatIwant)
GRCh38TxDb

Map Ensembl IDs to Entrez IDs

See the vignette from SingscoreAMLMutations.

Multimapped Ensembl IDs are replaced by NAs, then discarded to enforce unique mapping. Similarly, Entrez IDs that map to multiple Ensembl IDs are identified from the mapping, and discarded. Only features with unique Ensembl ID to Entrez ID mappings remain.

Gene name errors from Excel

Gene name errors from Excel. Gene names such as MAR1, DEC1, OCT4 and SEPT9 are now reformatted as dates.

Example

Meet NOTCH2NL, the human-specific genes that may have given us our big brains

library(hgu133plus2.db)
select(hgu133plus2.db, keys="NOTCH2NL", columns=c("SYMBOL","GENENAME"), keytype="SYMBOL")
# Error in .testForValidKeys(x, keys, keytype, fks) : 
#   None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method 
#   to see a listing of valid arguments.
k <- keys(hgu133plus2.db, keytype="SYMBOL")
k <- sort(k)
grep("NOTCH", k)
# [1] 41224 41225 41226 41227 41228 41229 41230 41231 41232
k[grep("NOTCH", k) ]
[1] "NOTCH1"    "NOTCH2"    "NOTCH2NLA" "NOTCH2NLB" "NOTCH2NLC" "NOTCH2NLR" "NOTCH2P1" 
[8] "NOTCH3"    "NOTCH4" 

Workflow

Using Bioconductor for Sequence Data

CyTOF

CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets

Some packages

Biobase, GEOquery and limma

How to create an ExpressionSet object from scratch? Here we use the code from GEO2R to help to do this task.

library(Biobase)
library(GEOquery)
library(limma)

# Load series and platform data from GEO

gset <- getGEO("GSE32474", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
# save(gset, file = "~/Downloads/gse32474_gset.rda")
# load("~/Downloads/gse32474_gset.rda")
table(pData(gset)[, "cell line:ch1"])
pData(gset)

# Create an ExpressionSet object from scratch
# We take a shortcut to obtain the pheno data and feature data matrices 
# from the output of getGEO()
phenoDat <- new("AnnotatedDataFrame",
                 data=pData(gset))
featureDat <- new("AnnotatedDataFrame",
                  data=fData(gset))
exampleSet <- ExpressionSet(assayData=exprs(gset),
                            phenoData=phenoDat,
                            featureData=featureDat,
                            annotation="hgu133plus2")
gset <- exampleSet

# Make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
gsms <- paste0("00000000111111111XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
               "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
               "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
               "XXXXXXXXXXXXXXXXXXXXXXXX")
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }

# Subset an ExpressionSet by eliminating samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# Decide if it is necessary to do a log2 transformation
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0) ||
  (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }

# Set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

# Display the result with selected columns
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
tT[1:2, ]
#                  ID  adj.P.Val      P.Value        t        B    logFC Gene.symbol
# 209108_at 209108_at 0.08400054 4.438757e-06 6.686977 3.786222 3.949088      TSPAN6
# 204975_at 204975_at 0.08400054 6.036355e-06 6.520775 3.550036 2.919995        EMP2
#                             Gene.title
# 209108_at                 tetraspanin 6
# 204975_at epithelial membrane protein 2

Biostrings

library(Biostrings) 
library(BSgenome.Hsapiens.UCSC.hg19) 
vmatchPattern("GCGATCGC", Hsapiens)

plyranges

http://bioconductor.org/packages/devel/bioc/vignettes/plyranges/inst/doc/an-introduction.html

Misc

Package release history

https://support.bioconductor.org/p/69657/

Search the DESCRIPTION file (eg. VariantAnnotation package) in github and the release information can be found there.

Papers/Overview

Cloud

Bioconductor On The Cloud

Conferences