Bioconductor: Difference between revisions

Revision as of 14:59, 25 January 2019

Release News

3.7 release
3.6 release
3.5 release

Github mirror

https://support.bioconductor.org/p/68824/ Announcement (Update: it is dead)

Package source

R > create Bioconductor repository

BiocManager from CRAN

The reason for using BiocManager instead of biocLite() is mostly to stop sourcing an R script from URL which isn’t so safe. So biocLite() should not be recommended anymore.

It allows to have multiple versions of Bioconductor installed on the same computer. For example, R 3.5 works with Bioconductor 3.7 and 3.8.

On the other hand, setRepositories(ind=1:4) and install.packages() still lets you install Bioconductor packages.

Hacking Bioconductor

BioCExplorer

Explore Bioconductor packages more nicely

source("https://bioconductor.org/biocLite.R")
biocLite("BiocUpgrade")
biocLite("biocViews")
devtools::install_github("seandavi/BiocPkgTools")
devtools::install_github("shians/BioCExplorer")
library(BioCExplorer)
bioc_explore()

Annotation packages

http://bioconductor.org/help/course-materials/2012/SeattleFeb2012/Annotation.pdf
https://bioconductor.org/help/course-materials/2017/CSAMA/lectures/1-monday/lecture-04-a-annotation-intro/lecture-04a-annotation-intro.html
Making and Utilizing TxDb Objects
Genomic Annotation Resources Introduction to using gene, pathway, gene ontology, homology annotations and the AnnotationHub. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
- AnnotationHub
- OrgDb
- TxDb
- OrganismDb
- BSgenome
- biomaRt
http://genomicsclass.github.io/book/pages/bioc1_annoCheat.html
ensembldb package

Gene centric

Genomic centric

Web based

Workflow

Using Bioconductor for Sequence Data

Some packages

Biobase, GEOquery and limma

How to create an ExpressionSet object from scratch? Here we use the code from GEO2R to help to do this task.

library(Biobase)
library(GEOquery)
library(limma)

# Load series and platform data from GEO

gset <- getGEO("GSE32474", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
# save(gset, file = "~/Downloads/gse32474_gset.rda")
# load("~/Downloads/gse32474_gset.rda")
table(pData(gset)[, "cell line:ch1"])
pData(gset)

# Create an ExpressionSet object from scratch
# We take a shortcut to obtain the pheno data and feature data matrices 
# from the output of getGEO()
phenoDat <- new("AnnotatedDataFrame",
                 data=pData(gset))
featureDat <- new("AnnotatedDataFrame",
                  data=fData(gset))
exampleSet <- ExpressionSet(assayData=exprs(gset),
                            phenoData=phenoDat,
                            featureData=featureDat,
                            annotation="hgu133plus2")
gset <- exampleSet

# Make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
gsms <- paste0("00000000111111111XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
               "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
               "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
               "XXXXXXXXXXXXXXXXXXXXXXXX")
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }

# Subset an ExpressionSet by eliminating samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# Decide if it is necessary to do a log2 transformation
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0) ||
  (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }

# Set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

# Display the result with selected columns
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
tT[1:2, ]
#                  ID  adj.P.Val      P.Value        t        B    logFC Gene.symbol
# 209108_at 209108_at 0.08400054 4.438757e-06 6.686977 3.786222 3.949088      TSPAN6
# 204975_at 204975_at 0.08400054 6.036355e-06 6.520775 3.550036 2.919995        EMP2
#                             Gene.title
# 209108_at                 tetraspanin 6
# 204975_at epithelial membrane protein 2

Biostrings

Find the location of a particular sequence. ?vmatchPattern
https://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/BiostringsBSgenomeOverview.pdf

library(Biostrings) 
library(BSgenome.Hsapiens.UCSC.hg19) 
vmatchPattern("GCGATCGC", Hsapiens)

plyranges

http://bioconductor.org/packages/devel/bioc/vignettes/plyranges/inst/doc/an-introduction.html

Misc

Package release history

https://support.bioconductor.org/p/69657/

Search the DESCRIPTION file (eg. VariantAnnotation package) in github and the release information can be found there.

@@ Line 46: / Line 46: @@
 ** biomaRt
 * http://genomicsclass.github.io/book/pages/bioc1_annoCheat.html
+* [https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz031/5301311?rss=1 ensembldb] package
 === Gene centric ===