Bioconductor: Difference between revisions
(70 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== [https:// | = Project = | ||
[https://journal.r-project.org/news/RJ-2022-4-bioconductor/ Bioconductor Notes] published in The R Journal, Nov 2022. | |||
== [http:// | == Release News == | ||
* [http://bioconductor.org/news/bioc_3_9_release/ 3.9] release | |||
* [http://bioconductor.org/news/bioc_3_8_release/ 3.8] release | |||
* [http://bioconductor.org/news/bioc_3_7_release/ 3.7] release | |||
* [https://bioconductor.org/news/bioc_3_6_release/ 3.6] release | |||
* [https://bioconductor.org/news/bioc_3_5_release/ 3.5] release | |||
== Annual reports == | |||
http://bioconductor.org/about/annual-reports/ | |||
== Download stats == | |||
* See the overview vignette of [http://bioconductor.org/packages/release/bioc/html/BiocPkgTools.html BiocPkgTools] | |||
* Download stats for Bioconductor | |||
** [http://bioconductor.org/packages/stats/ Download stats for Bioconductor software packages] | |||
* bioconductor.riken.jp mirror in Japan | |||
** [https://bioconductor.riken.jp/packages/stats/index.html Download stats for Bioconductor software packages] | |||
** [http://bioconductor.riken.jp/packages/stats/data-annotation.html Download stats for Bioconductor annotation packages] | |||
* [https://github.com/lgatto/biocpkgs biocpkg] package in github | |||
== From the director of the project == | |||
[https://t.co/rfKF3ABFJp?amp=1 useR 2019] & [https://youtu.be/YEQ5xFewbdA?t=891 youtube] | |||
== Community blog == | |||
https://bioconductor.github.io/biocblog/ | |||
== Publications == | |||
https://www.bioconductor.org/help/publications/ | |||
== Bioconductor User Support == | |||
https://support.bioconductor.org/ | |||
== Bioconductor Packages: Development, Maintenance, and Peer Review == | |||
https://contributions.bioconductor.org/index.html | |||
== Submit packages == | |||
[https://kayla-morrell.github.io/CreateAPackage/ Create A Package] from [https://bioc2020.bioconductor.org/workshops.html Bioc2020] | |||
== Update to newer version == | |||
http://bioconductor.org/install/. Suppose I have version 3.11 and I like to upgrade to version 3.12. | |||
<pre> | <pre> | ||
BiocManager::install(version = "3.12") # Upgrade all my installed packages | |||
BiocManager::install(ask = FALSE) # To use the latest version of Bioconductor for your version of R | |||
BiocManager::version() | |||
# [1] ‘3.12’ | |||
</pre> | </pre> | ||
=== Check if my current Bioconductor version is outdated === | |||
Below is an example that shows I am using an outdated Bioconductor. | |||
<syntaxhighlight lang='rsplus'> | |||
require(BiocManager) | |||
# Loading required package: BiocManager | |||
# Bioconductor version 3.15 (BiocManager 1.30.20), R 4.2.2 | |||
# Patched (2022-11-10 r83330) | |||
# Bioconductor version '3.15' is out-of-date; the current | |||
# release version '3.16' is available with R version | |||
# '4.2'; see https://bioconductor.org/install | |||
</syntaxhighlight> | |||
=== Upgrade to the latest Bioconductor version === | |||
[https://www.bioconductor.org/install/ Using Bioconductor] | |||
<pre> | <pre> | ||
if (!require("BiocManager", quietly = TRUE)) | |||
install.packages("BiocManager") | |||
BiocManager::install(version = "3.12") # the latest version can be obtained from the | |||
# output of require(BiocManager) OR | |||
# checking the Bioconductor website | |||
# Upgrade 53 packages to Bioconductor version '3.12'? [y/n]: y | |||
</pre> | </pre> | ||
and the | |||
=== Upgrading installed Bioconductor packages === | |||
CAlling '''BiocManager::install()''' without any arguments will attempt to update all installed packages in your R environment. It checks for updates on Bioconductor, CRAN, and GitHub, and if newer versions of the packages are available, it will download and install them. | |||
When I run it on Docker, it will also install '''BiocVersion''' package from Bioconductor. | |||
<pre> | <pre> | ||
BiocManager::install() | |||
</pre> | </pre> | ||
=== BiocVersion === | |||
[https://bioconductor.org/packages/release/bioc/html/BiocVersion.html BiocVersion]. | |||
* The BiocManager::install() function in R automatically installs the BiocVersion package during its first use. The BiocVersion package provides repository information for the appropriate version of Bioconductor. It helps BiocManager determine the version of Bioconductor in use. If BiocVersion has not yet been installed, the version is determined by code in base R. This ensures that all Bioconductor packages work best when they are all from the same release. See the document [https://bioconductor.github.io/BiocManager/reference/BiocManager-package.html Install or update Bioconductor, CRAN, or GitHub packages] from '''BiocManager'''. | |||
* This package was suggested (not a default option by install.packages()) by BiocManager and more importantly '''imported''' by AnnotationHub. | |||
* It seems BiocVersion gives the patch version if we like to get that information. For example the current Bioc is 3.14 but I have Bioconductor version 3.13.1. | |||
<pre> | <pre> | ||
R> BiocManager::version() | |||
[1] ‘3.13’ | |||
R> packageVersion("BiocVersion") | |||
[1] ‘3.13.1’ | |||
</pre> | </pre> | ||
== Devel version of Bioconductor == | |||
<pre> | <pre> | ||
BiocManager::install(version = "devel") | |||
</pre> | </pre> | ||
== [https://github.com/Shians/BioCExplorer BioCExplorer] | = Mirrors = | ||
https://www.bioconductor.org/about/mirrors/ | |||
== [https://github.com/Bioconductor-mirror Github mirror] == | |||
* https://support.bioconductor.org/p/68824/ Announcement (Update: it is dead) | |||
== Japan == | |||
http://bioconductor.jp | |||
= Package source = | |||
* Bioconductor Code & [https://code.bioconductor.org/ Git Browser] | |||
* [[R_repository|R > create Bioconductor repository]] | |||
* https://bioconductor.org/developers/how-to/git/. [https://bioconductor.org/developers/how-to/git/create-local-repository/ Create a local repository for private use] | |||
== Code search == | |||
http://search.bioconductor.jp/ | |||
== code.bioconductor.org == | |||
https://code.bioconductor.org/ | |||
= Resource = | |||
* [http://rafalab.github.io/pages/teaching.html Teaching resources], [https://rafalab.github.io/pages/harvardx.html HarvardX Biomedical Data Science Open Online Training] from rafalab | |||
* [https://gallantries.github.io/video-library/modules/bioconductor Module: Bioconductor R analyses] from GTN Video Library | |||
= [https://cran.r-project.org/web/packages/BiocManager/index.html BiocManager] from CRAN = | |||
The reason for using BiocManager instead of biocLite() is mostly to stop sourcing an R script from URL which isn’t so safe. So biocLite() should not be recommended anymore. | |||
It allows to have multiple versions of Bioconductor installed on the same computer. For example, R 3.5 works with Bioconductor 3.7 and 3.8. | |||
On the other hand, setRepositories(ind=1:4) and install.packages() still lets you install Bioconductor packages. | |||
[https://www.jumpingrivers.com/blog/security-r-hacking-bioconductor/ Hacking Bioconductor] | |||
= BiocPkgTools = | |||
* [http://bioconductor.org/packages/release/bioc/html/BiocPkgTools.html BiocPkgTools]: Collection of simple tools for learning about Bioc Packages | |||
* https://www.biorxiv.org/content/10.1101/642132v1 | |||
* [https://kevinrue.github.io/post/biocpkgtools/ GitHub Action executed as a CRON job to update the website daily]. Note that the 'bioc' version needs to be updated once a new version of Bioconductor has been released. | |||
= [https://github.com/Shians/BioCExplorer BioCExplorer] = | |||
Explore Bioconductor packages more nicely | Explore Bioconductor packages more nicely | ||
Line 50: | Line 148: | ||
</pre> | </pre> | ||
== Annotation packages | [[File:BiocExplorer.png|350px]] | ||
= [http://www.bioconductor.org/packages/release/BiocViews.html BiocViews] = | |||
* Software | |||
** AssayDomain | |||
** [https://bioconductor.org/packages/release/BiocViews.html#___BiologicalQuestion BiologicalQuestion] | |||
** [https://bioconductor.org/packages/release/BiocViews.html#___Infrastructure Infrastructure] | |||
** [https://bioconductor.org/packages/release/BiocViews.html#___ResearchField ResearchField] | |||
** [https://bioconductor.org/packages/release/BiocViews.html#___StatisticalMethod StatisticalMethod] | |||
** [https://bioconductor.org/packages/release/BiocViews.html#___Technology Technology] | |||
** [https://bioconductor.org/packages/release/BiocViews.html#___WorkflowStep WorkflowStep] | |||
* AnnotationData | |||
** ChipManufacturer | |||
** ChipName | |||
** CustomArray | |||
** CustomCDF | |||
** CustomDBSchema | |||
** FunctionalAnnotation | |||
** Organism | |||
** PackageType | |||
** SequenceAnnotation | |||
* ExperimentData | |||
** AssayDomainData | |||
** DiseaseModel | |||
** OrganismData | |||
** PackageTypeData | |||
** RepositoryData | |||
** ReproducibleResearch | |||
** SpecimenSource | |||
** TechnologyData | |||
* Workflow | |||
** AnnotationWorkflow | |||
** BasicWorkflow | |||
** DifferentialSplicingWorkflow | |||
** EpigeneticsWorkflow | |||
** GeneExpressionWorkflow | |||
** GenomicVariantsWorkflow | |||
** ImmunoOncologyWorkflow | |||
** ProteomicsWorkflow | |||
** ResourceQueryingWorkflow | |||
** [http://www.bioconductor.org/packages/release/workflows/html/simpleSingleCell.html SingleCellWorkflow] | |||
= Annotation packages = | |||
* http://bioconductor.org/help/course-materials/2012/SeattleFeb2012/Annotation.pdf | * http://bioconductor.org/help/course-materials/2012/SeattleFeb2012/Annotation.pdf | ||
* https://bioconductor.org/help/course-materials/2017/CSAMA/lectures/1-monday/lecture-04-a-annotation-intro/lecture-04a-annotation-intro.html | * https://bioconductor.org/help/course-materials/2017/CSAMA/lectures/1-monday/lecture-04-a-annotation-intro/lecture-04a-annotation-intro.html | ||
* [http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf Making and Utilizing TxDb Objects] | * [http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf Making and Utilizing TxDb Objects] | ||
* [https:// | * [https://bioconductor.org/packages/release/workflows/html/annotation.html Genomic Annotation Resources] (2018) Introduction to using gene, pathway, gene ontology, homology annotations and the AnnotationHub. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources. | ||
** AnnotationHub | ** AnnotationHub allows us to query and download many different annotation objects without having to explicitly install them. | ||
** OrgDb | ** OrgDb | ||
** TxDb | ** TxDb | ||
** OrganismDb | ** OrganismDb packages are meta-packages that contain an OrgDb, a TxDb, and a GO.db packages and allow cross-queries between those packages. | ||
** BSgenome | ** BSgenome packages contain sequence information for a given species/build. | ||
** biomaRt | ** biomaRt allows queries to an Ensembl Biomart server. | ||
* http://genomicsclass.github.io/book/pages/bioc1_annoCheat.html | * http://genomicsclass.github.io/book/pages/bioc1_annoCheat.html | ||
* [https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz031/5301311?rss=1 ensembldb] package | |||
<ul> | |||
<li>[https://github.com/jmacdon/Bioc2020Anno Introduction to Bioconductor annotation resources] 2020 Bioc workshop | |||
<ul> | |||
<li> | |||
{| class="wikitable" | |||
|- | |||
! Package type | |||
! Example | |||
|- | |||
| ChipDb | |||
| [https://www.bioconductor.org/packages/release/data/annotation/html/hgu133plus2.db.html hgu133plus2.db] | |||
|- | |||
| OrgDb | |||
| org.Hs.eg.db | |||
|- | |||
| TxDb/EnsDb | |||
| TxDb.Hsapiens.UCSC.hg19.knownGene; EnsDb.Hsapiens.v75 | |||
|- | |||
| OrganismDb | |||
| Homo.sapiens | |||
|- | |||
| BSgenome | |||
| BSgenome.Hsapiens.UCSC.hg19 | |||
|- | |||
| Others | |||
| GO.db; KEGG.db | |||
|- | |||
| biomaRt | |||
| | |||
|} | |||
</li> | |||
<li>Database concept was used. So there are some terms like '''keys''', '''select''', ..., borrowed from there. </li> | |||
<li>How do you know what the central '''keys''' are? If it's ChipDb (e.g. hugene20sttranscriptcluster.db), the central key are the manufactorer's probe IDs. If it's in the name - org.Hs.eg.db, where 'eg' means '''Entrez''' gene ID. You can see examples using e.g., head(keys(annopkg)), and infer from that. | |||
</li> | |||
<li>Functions (select, mapIds, transcripts accessors work on ChipDb, TxDb, OrgDb, ...). See the blog [https://davetang.org/muse/2013/12/16/bioconductor-annotation-packages/ Bioconductor annotation packages] | |||
{{Pre}} | |||
AnnotationDbi::keys(hgu133plus2.db, keytype="PROBEID") | |||
AnnotationDbi::keytypes(hgu133plus2.db) # columns(hgu133plus2.db) | |||
AnnotationDbi::select(annopkg, keys, columns, keytype): keytype is optional | |||
AnnotationDbi::mapIds(annopkg, keys, column, keytype, multiVals): | |||
keytype is required, multiVals is used to handle duplicates. | |||
</pre> | |||
</li> | |||
<li>Packages | |||
<pre> | |||
TxDb packages - eg TxDb.Hsapiens.UCSC.hg19.knownGene. It contains positional information. | |||
EnsDb packages - eg EnsDb.Hsapiens.v79 | |||
</pre> | |||
</li> | |||
<li>Examples: ChipDb | |||
{{Pre}} | |||
library(hgu133plus2.db) # Loading required package: org.Hs.eg.db | |||
keytypes(hgu133plus2.db) | |||
# [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" | |||
# [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" | |||
#[11] "GO" "GOALL" "IPI" "MAP" "OMIM" | |||
#[16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID" | |||
#[21] "PROBEID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" | |||
#[26] "UNIGENE" "UNIPROT" | |||
ls("package:hgu133plus2.db") | |||
# Get all probe IDs | |||
length(hgu133plus2ACCNUM) | |||
# [1] 54675 | |||
x <- hgu133plus2ACCNUM | |||
# hgu133plus2ACCNUM is an R object that contains mappings between | |||
# a manufacturer’s identifiers and manufacturers accessions. | |||
mapped_probes <- mappedkeys(x) | |||
str(mapped_probes) | |||
# chr [1:54675] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" ... | |||
length(unique(mapped_probes)) | |||
# [1] 54675 | |||
length(hgu133plus2ENTREZID) | |||
# [1] 54675 | |||
x <- hgu133plus2ENTREZID | |||
mapped_probes <- mappedkeys(x) | |||
str(mapped_probes) | |||
# chr [1:41905] "1053_at" "117_at" "121_at" "1255_g_at" "1316_at" "1320_at" ... | |||
length(unique(mapped_probes)) | |||
# [1] 41905 | |||
# Using select() to map from probe ID to others | |||
# Note select() will return 1:many mapping | |||
library(hugene20sttranscriptcluster.db) | |||
ids <- c('16737401','16657436' ,'16678303') | |||
select(hugene20sttranscriptcluster.db, ids, c("SYMBOL","ALIAS", "ENSEMBL", "ENTREZID")) | |||
## 'select()' returned 1:many mapping between keys and columns | |||
# PROBEID SYMBOL ALIAS ENSEMBL ENTREZID | |||
# 1 16737401 TRAF6 MGC:3310 ENSG00000175104 7189 | |||
# 2 16737401 TRAF6 RNF85 ENSG00000175104 7189 | |||
# 3 16737401 TRAF6 TRAF6 ENSG00000175104 7189 | |||
# 4 16657436 DDX11L1 DDX11L1 ENSG00000223972 100287102 | |||
# 5 16657436 DDX11L17 DDX11L17 ENSG00000223972 102725121 | |||
# 6 16657436 DDX11L2 DDX11L2 ENSG00000223972 84771 | |||
# 7 16657436 DDX11L9 DDX11L9 ENSG00000248472 100288486 | |||
# 8 16657436 DDX11L9 DDX11L9 ENSG00000288170 100288486 | |||
# 9 16657436 DDX11L10 DDX11L1 ENSG00000233614 100287029 | |||
# 10 16657436 DDX11L10 DDX11P ENSG00000233614 100287029 | |||
# 11 16657436 DDX11L10 DDX11L10 ENSG00000233614 100287029 | |||
# 12 16657436 DDX11L5 DDX11L5 ENSG00000223972 100287596 | |||
# 13 16657436 DDX11L5 DDX11L5 ENSG00000248472 100287596 | |||
# 14 16657436 DDX11L5 DDX11L5 ENSG00000288170 100287596 | |||
# 15 16657436 DDX11L16 DDX11L16 ENSG00000223972 727856 | |||
# 16 16657436 DDX11L16 DDX11L16 ENSG00000248472 727856 | |||
# 17 16657436 DDX11L16 DDX11L16 ENSG00000233614 727856 | |||
# 18 16657436 DDX11L16 DDX11L16 ENSG00000288170 727856 | |||
# 19 16678303 ARF1 PVNH8 ENSG00000143761 375 | |||
# 20 16678303 ARF1 ARF1 ENSG00000143761 375 | |||
keytypes(hugene20sttranscriptcluster.db) | |||
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" | |||
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" | |||
## [11] "GO" "GOALL" "IPI" "MAP" "OMIM" | |||
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID" | |||
## [21] "PROBEID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" | |||
## [26] "UNIGENE" "UNIPROT" | |||
# mapIds | |||
# Note that an additional argument, multiVals can be used to control duplicates | |||
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID") # only select the 1st one | |||
## 16737401 16657436 16678303 | |||
## "TRAF6" "DDX11L1" "ARF1" | |||
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "list") # list | |||
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "CharacterList") # IRanges | |||
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "filter") # character vector | |||
# OrganismDb packages | |||
library(Homo.sapiens) | |||
Homo.sapiens | |||
head(genes(Homo.sapiens, columns = c("ENTREZID", "ALIAS", "UNIPROT")), 4) | |||
</pre> | |||
<li> | |||
</ul> | |||
</ul> | |||
== Gene centric == | |||
* [http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf#page=5 AnnotationDbi]: Introduction To Bioconductor Annotation Packages | |||
<syntaxhighlight lang='rsplus'> | |||
library(hgu133a.db) | |||
library(AnnotationDbi) | |||
k <- head(keys(hgu133a.db, keytype="PROBEID")) | |||
k | |||
# [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at" | |||
# then call select | |||
select(hgu133a.db, keys=k, columns=c("SYMBOL","GENENAME"), keytype="PROBEID") | |||
# 'select()' returned 1:many mapping between keys and columns | |||
# PROBEID SYMBOL GENENAME | |||
# 1 1007_s_at DDR1 discoidin domain receptor tyrosine kinase 1 | |||
# 2 1007_s_at MIR4640 microRNA 4640 | |||
# 3 1053_at RFC2 replication factor C subunit 2 | |||
# 4 117_at HSPA6 heat shock protein family A (Hsp70) member 6 | |||
# 5 121_at PAX8 paired box 8 | |||
# 6 1255_g_at GUCA1A guanylate cyclase activator 1A | |||
# 7 1294_at UBA7 ubiquitin like modifier activating enzyme 7 | |||
# 8 1294_at MIR5193 microRNA 5193 | |||
</syntaxhighlight> | |||
== ExpressionSets objects == | |||
For example: load(file.path(system.file(package = "Bioc2020Anno", "extdata"), "eset.Rdata") | |||
* exprs(eset) | |||
* pData(phenoData(eset)) | |||
* pData(featureData(eset)) | |||
== Genomic centric == | |||
* GRanges() | |||
* GrangesList() | |||
GRanges() and GRangesLists act like data.frames and lists, and can be subsetted using the '''[''' function. | |||
== SummarizedExperiment objects == | |||
SummarizedExperiment objects are like ExpressionSets, but the row-wise annotations are GRanges, so you can subset by genomic locations. | |||
[https://rdrr.io/bioc/SummarizedExperiment/man/SummarizedExperiment-class.html SummarizedExperiment] objects are like '''ExpressionSets''', but the row-wise annotations are GRanges, so you can subset by genomic locations. | |||
SummarizedExperiment objects are popular objects for representing expression data and other rectangular data (feature x sample data). | |||
* array() or assays(): genes x samples | |||
* colData(): samples x sample features | |||
* rowData(): genes x gene features. The gene features could be result from running analyses; see [http://www.bioconductor.org/packages/release/bioc/vignettes/EnrichmentBrowser/inst/doc/EnrichmentBrowser.pdf EnrichmentBrowser] vignette. | |||
Examples: [https://bioconductor.org/packages/release/bioc/vignettes/GSEABenchmarkeR/inst/doc/GSEABenchmarkeR.html GSEABenchmarkeR] package. | |||
== biomaRt == | |||
<pre> | |||
library(biomaRt) | |||
listMarts(host = "useast.ensembl.org") | |||
# biomaRt data sets | |||
mart <- useEnsembl("ensembl") | |||
head(listDatasets(mart)) | |||
# biomaRt queries | |||
mart <- useEnsembl("ensembl","hsapiens_gene_ensembl") | |||
# biomaRt attributes and filters | |||
atrib <- listAttributes(mart) | |||
filts <- listFilters(mart) | |||
head(atrib) | |||
head(filts) | |||
# biomaRt query | |||
afyids <- c("1000_at","1001_at","1002_f_at","1007_s_at") | |||
getBM(c("affy_hg_u95av2", "hgnc_symbol"), c("affy_hg_u95av2"), afyids, mart) | |||
</pre> | |||
== AnnotationHub == | |||
<pre> | |||
library(AnnotationHub) | |||
hub <- AnnotationHub() | |||
hub | |||
# Querying AnnotationHub | |||
names(mcols(hub)) | |||
# AnnotationHub Data providers | |||
unique(hub$dataprovider) | |||
# AnnotationHub Data classes | |||
unique(hub$rdataclass) | |||
# AnnotationHub Species | |||
head(unique(hub$species)) | |||
length(unique(hub$species)) | |||
# AnnotationHub Data sources | |||
unique(hub$sourcetype) | |||
# AnnotationHub query | |||
qry <- query(hub, c("granges","homo sapiens","ensembl")) | |||
qry | |||
qry$sourceurl | |||
# Selecting AnnotationHub resource | |||
whatIwant <- qry[["AH80077"]] | |||
GRCh38TxDb <- makeTxDbFromGRanges(whatIwant) | |||
GRCh38TxDb | |||
</pre> | |||
== Map Ensembl IDs to Entrez IDs == | |||
See the vignette from [https://bioconductor.org/packages/release/workflows/html/SingscoreAMLMutations.html SingscoreAMLMutations]. | |||
Multimapped Ensembl IDs are replaced by NAs, then discarded to enforce unique mapping. Similarly, Entrez IDs that map to multiple Ensembl IDs are identified from the mapping, and discarded. Only features with unique Ensembl ID to Entrez ID mappings remain. | |||
== Gene name errors from Excel == | |||
[[Data_science#Gene_name_errors_from_Excel|Gene name errors from Excel]]. Gene names such as MAR1, DEC1, OCT4 and SEPT9 are now reformatted as dates. | |||
== Example == | |||
[https://phys.org/news/2018-05-notch2nl-human-specific-genes-big-brains.html Meet '''NOTCH2NL''', the human-specific genes that may have given us our big brains] | |||
{{Pre}} | |||
library(hgu133plus2.db) | |||
select(hgu133plus2.db, keys="NOTCH2NL", columns=c("SYMBOL","GENENAME"), keytype="SYMBOL") | |||
# Error in .testForValidKeys(x, keys, keytype, fks) : | |||
# None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method | |||
# to see a listing of valid arguments. | |||
k <- keys(hgu133plus2.db, keytype="SYMBOL") | |||
k <- sort(k) | |||
grep("NOTCH", k) | |||
# [1] 41224 41225 41226 41227 41228 41229 41230 41231 41232 | |||
k[grep("NOTCH", k) ] | |||
[1] "NOTCH1" "NOTCH2" "NOTCH2NLA" "NOTCH2NLB" "NOTCH2NLC" "NOTCH2NLR" "NOTCH2P1" | |||
[8] "NOTCH3" "NOTCH4" | |||
</pre> | |||
= Workflow = | |||
== [https://www.bioconductor.org/help/workflows/high-throughput-sequencing/ Using Bioconductor for Sequence Data] == | |||
== CyTOF == | |||
[https://bioconductor.org/packages/release/workflows/html/cytofWorkflow.html CyTOF workflow]: differential discovery in high-throughput high-dimensional cytometry datasets | |||
= Some packages = | |||
== Biobase, GEOquery and limma == | |||
How to create an ExpressionSet object from scratch? Here we use the code from GEO2R to help to do this task. | |||
<syntaxhighlight lang='rsplus'> | |||
library(Biobase) | |||
library(GEOquery) | |||
library(limma) | |||
# Load series and platform data from GEO | |||
gset <- getGEO("GSE32474", GSEMatrix =TRUE, AnnotGPL=TRUE) | |||
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1 | |||
gset <- gset[[idx]] | |||
# save(gset, file = "~/Downloads/gse32474_gset.rda") | |||
# load("~/Downloads/gse32474_gset.rda") | |||
table(pData(gset)[, "cell line:ch1"]) | |||
pData(gset) | |||
# Create an ExpressionSet object from scratch | |||
# We take a shortcut to obtain the pheno data and feature data matrices | |||
# from the output of getGEO() | |||
phenoDat <- new("AnnotatedDataFrame", | |||
data=pData(gset)) | |||
featureDat <- new("AnnotatedDataFrame", | |||
data=fData(gset)) | |||
exampleSet <- ExpressionSet(assayData=exprs(gset), | |||
phenoData=phenoDat, | |||
featureData=featureDat, | |||
annotation="hgu133plus2") | |||
gset <- exampleSet | |||
# Make proper column names to match toptable | |||
fvarLabels(gset) <- make.names(fvarLabels(gset)) | |||
# group names for all samples | |||
gsms <- paste0("00000000111111111XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", | |||
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", | |||
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", | |||
"XXXXXXXXXXXXXXXXXXXXXXXX") | |||
sml <- c() | |||
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) } | |||
# Subset an ExpressionSet by eliminating samples marked as "X" | |||
sel <- which(sml != "X") | |||
sml <- sml[sel] | |||
gset <- gset[ ,sel] | |||
# Decide if it is necessary to do a log2 transformation | |||
ex <- exprs(gset) | |||
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T)) | |||
LogC <- (qx[5] > 100) || | |||
(qx[6]-qx[1] > 50 && qx[2] > 0) || | |||
(qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2) | |||
if (LogC) { ex[which(ex <= 0)] <- NaN | |||
exprs(gset) <- log2(ex) } | |||
# Set up the data and proceed with analysis | |||
sml <- paste("G", sml, sep="") # set group names | |||
fl <- as.factor(sml) | |||
gset$description <- fl | |||
design <- model.matrix(~ description + 0, gset) | |||
colnames(design) <- levels(fl) | |||
fit <- lmFit(gset, design) | |||
cont.matrix <- makeContrasts(G1-G0, levels=design) | |||
fit2 <- contrasts.fit(fit, cont.matrix) | |||
fit2 <- eBayes(fit2, 0.01) | |||
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250) | |||
# Display the result with selected columns | |||
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title")) | |||
tT[1:2, ] | |||
# ID adj.P.Val P.Value t B logFC Gene.symbol | |||
# 209108_at 209108_at 0.08400054 4.438757e-06 6.686977 3.786222 3.949088 TSPAN6 | |||
# 204975_at 204975_at 0.08400054 6.036355e-06 6.520775 3.550036 2.919995 EMP2 | |||
# Gene.title | |||
# 209108_at tetraspanin 6 | |||
# 204975_at epithelial membrane protein 2 | |||
</syntaxhighlight> | |||
== Biostrings == | |||
* Find the location of a particular sequence. [https://www.rdocumentation.org/packages/Biostrings/versions/2.40.2/topics/matchPattern ?vmatchPattern] | |||
* https://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/BiostringsBSgenomeOverview.pdf | |||
<syntaxhighlight lang='rsplus'> | |||
library(Biostrings) | |||
library(BSgenome.Hsapiens.UCSC.hg19) | |||
vmatchPattern("GCGATCGC", Hsapiens) | |||
</syntaxhighlight> | |||
== plyranges == | |||
http://bioconductor.org/packages/devel/bioc/vignettes/plyranges/inst/doc/an-introduction.html | |||
= Misc = | |||
== Package release history == | |||
https://support.bioconductor.org/p/69657/ | |||
Search the DESCRIPTION file (eg. [https://github.com/Bioconductor/VariantAnnotation/commits/master/DESCRIPTION VariantAnnotation] package) in github and the release information can be found there. | |||
== | == Papers/Overview == | ||
* [https://www.sciencedirect.com/science/article/pii/S1525157819303976 Using R and Bioconductor in Clinical Genomics and Transcriptomics] 2019 | |||
* [https://youtu.be/nzY7bPQOXUs Lori Ann Shepherd: "What is Bioconductor? : How the Core Team Plays an Essential Role] (video) 2021 | |||
* [https://almeidasilvaf.github.io/blog/2022-01-03-bioc_publications/ Where can I publish a paper describing my Bioconductor package?] | |||
=== | = Cloud = | ||
[https://f1000research.com/slides/9-1456?s=09 Bioconductor On The Cloud] | |||
= | = Conferences = | ||
* [https://bioc2021.bioconductor.org/ Bioc 2021], https://www.youtube.com/user/bioconductor | |||
* [https://bioc2020.bioconductor.org/ Bioc 2020], [https://biocasia2020.bioconductor.org/ BiocAsia 2020] & [https://www.youtube.com/playlist?list=PLdl4u5ZRDMQQT5Vw0T_CbVru5_YnzauRq all videos], [https://eurobioc2020.bioconductor.org/workshops?s=09 European Bioconductor Meeting 2020] | |||
* [http://bioc2019.bioconductor.org/ Bioc 2019], [https://bioconductor.github.io/BiocAsia/ BiocAsia 2019] | |||
* [https://csama2022.bioconductor.eu/ CSAMA Bioconductor 2022] https://github.com/Bioconductor/CSAMA |
Latest revision as of 13:45, 17 April 2024
Project
Bioconductor Notes published in The R Journal, Nov 2022.
Release News
Annual reports
http://bioconductor.org/about/annual-reports/
Download stats
- See the overview vignette of BiocPkgTools
- Download stats for Bioconductor
- bioconductor.riken.jp mirror in Japan
- biocpkg package in github
From the director of the project
Community blog
https://bioconductor.github.io/biocblog/
Publications
https://www.bioconductor.org/help/publications/
Bioconductor User Support
https://support.bioconductor.org/
Bioconductor Packages: Development, Maintenance, and Peer Review
https://contributions.bioconductor.org/index.html
Submit packages
Create A Package from Bioc2020
Update to newer version
http://bioconductor.org/install/. Suppose I have version 3.11 and I like to upgrade to version 3.12.
BiocManager::install(version = "3.12") # Upgrade all my installed packages BiocManager::install(ask = FALSE) # To use the latest version of Bioconductor for your version of R BiocManager::version() # [1] ‘3.12’
Check if my current Bioconductor version is outdated
Below is an example that shows I am using an outdated Bioconductor.
require(BiocManager) # Loading required package: BiocManager # Bioconductor version 3.15 (BiocManager 1.30.20), R 4.2.2 # Patched (2022-11-10 r83330) # Bioconductor version '3.15' is out-of-date; the current # release version '3.16' is available with R version # '4.2'; see https://bioconductor.org/install
Upgrade to the latest Bioconductor version
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.12") # the latest version can be obtained from the # output of require(BiocManager) OR # checking the Bioconductor website # Upgrade 53 packages to Bioconductor version '3.12'? [y/n]: y
Upgrading installed Bioconductor packages
CAlling BiocManager::install() without any arguments will attempt to update all installed packages in your R environment. It checks for updates on Bioconductor, CRAN, and GitHub, and if newer versions of the packages are available, it will download and install them.
When I run it on Docker, it will also install BiocVersion package from Bioconductor.
BiocManager::install()
BiocVersion
- The BiocManager::install() function in R automatically installs the BiocVersion package during its first use. The BiocVersion package provides repository information for the appropriate version of Bioconductor. It helps BiocManager determine the version of Bioconductor in use. If BiocVersion has not yet been installed, the version is determined by code in base R. This ensures that all Bioconductor packages work best when they are all from the same release. See the document Install or update Bioconductor, CRAN, or GitHub packages from BiocManager.
- This package was suggested (not a default option by install.packages()) by BiocManager and more importantly imported by AnnotationHub.
- It seems BiocVersion gives the patch version if we like to get that information. For example the current Bioc is 3.14 but I have Bioconductor version 3.13.1.
R> BiocManager::version() [1] ‘3.13’ R> packageVersion("BiocVersion") [1] ‘3.13.1’
Devel version of Bioconductor
BiocManager::install(version = "devel")
Mirrors
https://www.bioconductor.org/about/mirrors/
Github mirror
- https://support.bioconductor.org/p/68824/ Announcement (Update: it is dead)
Japan
Package source
- Bioconductor Code & Git Browser
- R > create Bioconductor repository
- https://bioconductor.org/developers/how-to/git/. Create a local repository for private use
Code search
http://search.bioconductor.jp/
code.bioconductor.org
https://code.bioconductor.org/
Resource
- Teaching resources, HarvardX Biomedical Data Science Open Online Training from rafalab
- Module: Bioconductor R analyses from GTN Video Library
BiocManager from CRAN
The reason for using BiocManager instead of biocLite() is mostly to stop sourcing an R script from URL which isn’t so safe. So biocLite() should not be recommended anymore.
It allows to have multiple versions of Bioconductor installed on the same computer. For example, R 3.5 works with Bioconductor 3.7 and 3.8.
On the other hand, setRepositories(ind=1:4) and install.packages() still lets you install Bioconductor packages.
BiocPkgTools
- BiocPkgTools: Collection of simple tools for learning about Bioc Packages
- https://www.biorxiv.org/content/10.1101/642132v1
- GitHub Action executed as a CRON job to update the website daily. Note that the 'bioc' version needs to be updated once a new version of Bioconductor has been released.
BioCExplorer
Explore Bioconductor packages more nicely
source("https://bioconductor.org/biocLite.R") biocLite("BiocUpgrade") biocLite("biocViews") devtools::install_github("seandavi/BiocPkgTools") devtools::install_github("shians/BioCExplorer") library(BioCExplorer) bioc_explore()
BiocViews
- Software
- AnnotationData
- ChipManufacturer
- ChipName
- CustomArray
- CustomCDF
- CustomDBSchema
- FunctionalAnnotation
- Organism
- PackageType
- SequenceAnnotation
- ExperimentData
- AssayDomainData
- DiseaseModel
- OrganismData
- PackageTypeData
- RepositoryData
- ReproducibleResearch
- SpecimenSource
- TechnologyData
- Workflow
- AnnotationWorkflow
- BasicWorkflow
- DifferentialSplicingWorkflow
- EpigeneticsWorkflow
- GeneExpressionWorkflow
- GenomicVariantsWorkflow
- ImmunoOncologyWorkflow
- ProteomicsWorkflow
- ResourceQueryingWorkflow
- SingleCellWorkflow
Annotation packages
- http://bioconductor.org/help/course-materials/2012/SeattleFeb2012/Annotation.pdf
- https://bioconductor.org/help/course-materials/2017/CSAMA/lectures/1-monday/lecture-04-a-annotation-intro/lecture-04a-annotation-intro.html
- Making and Utilizing TxDb Objects
- Genomic Annotation Resources (2018) Introduction to using gene, pathway, gene ontology, homology annotations and the AnnotationHub. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
- AnnotationHub allows us to query and download many different annotation objects without having to explicitly install them.
- OrgDb
- TxDb
- OrganismDb packages are meta-packages that contain an OrgDb, a TxDb, and a GO.db packages and allow cross-queries between those packages.
- BSgenome packages contain sequence information for a given species/build.
- biomaRt allows queries to an Ensembl Biomart server.
- http://genomicsclass.github.io/book/pages/bioc1_annoCheat.html
- ensembldb package
- Introduction to Bioconductor annotation resources 2020 Bioc workshop
-
Package type Example ChipDb hgu133plus2.db OrgDb org.Hs.eg.db TxDb/EnsDb TxDb.Hsapiens.UCSC.hg19.knownGene; EnsDb.Hsapiens.v75 OrganismDb Homo.sapiens BSgenome BSgenome.Hsapiens.UCSC.hg19 Others GO.db; KEGG.db biomaRt - Database concept was used. So there are some terms like keys, select, ..., borrowed from there.
- How do you know what the central keys are? If it's ChipDb (e.g. hugene20sttranscriptcluster.db), the central key are the manufactorer's probe IDs. If it's in the name - org.Hs.eg.db, where 'eg' means Entrez gene ID. You can see examples using e.g., head(keys(annopkg)), and infer from that.
- Functions (select, mapIds, transcripts accessors work on ChipDb, TxDb, OrgDb, ...). See the blog Bioconductor annotation packages
AnnotationDbi::keys(hgu133plus2.db, keytype="PROBEID") AnnotationDbi::keytypes(hgu133plus2.db) # columns(hgu133plus2.db) AnnotationDbi::select(annopkg, keys, columns, keytype): keytype is optional AnnotationDbi::mapIds(annopkg, keys, column, keytype, multiVals): keytype is required, multiVals is used to handle duplicates.
- Packages
TxDb packages - eg TxDb.Hsapiens.UCSC.hg19.knownGene. It contains positional information. EnsDb packages - eg EnsDb.Hsapiens.v79
- Examples: ChipDb
library(hgu133plus2.db) # Loading required package: org.Hs.eg.db keytypes(hgu133plus2.db) # [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" # [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" #[11] "GO" "GOALL" "IPI" "MAP" "OMIM" #[16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID" #[21] "PROBEID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" #[26] "UNIGENE" "UNIPROT" ls("package:hgu133plus2.db") # Get all probe IDs length(hgu133plus2ACCNUM) # [1] 54675 x <- hgu133plus2ACCNUM # hgu133plus2ACCNUM is an R object that contains mappings between # a manufacturer’s identifiers and manufacturers accessions. mapped_probes <- mappedkeys(x) str(mapped_probes) # chr [1:54675] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" ... length(unique(mapped_probes)) # [1] 54675 length(hgu133plus2ENTREZID) # [1] 54675 x <- hgu133plus2ENTREZID mapped_probes <- mappedkeys(x) str(mapped_probes) # chr [1:41905] "1053_at" "117_at" "121_at" "1255_g_at" "1316_at" "1320_at" ... length(unique(mapped_probes)) # [1] 41905 # Using select() to map from probe ID to others # Note select() will return 1:many mapping library(hugene20sttranscriptcluster.db) ids <- c('16737401','16657436' ,'16678303') select(hugene20sttranscriptcluster.db, ids, c("SYMBOL","ALIAS", "ENSEMBL", "ENTREZID")) ## 'select()' returned 1:many mapping between keys and columns # PROBEID SYMBOL ALIAS ENSEMBL ENTREZID # 1 16737401 TRAF6 MGC:3310 ENSG00000175104 7189 # 2 16737401 TRAF6 RNF85 ENSG00000175104 7189 # 3 16737401 TRAF6 TRAF6 ENSG00000175104 7189 # 4 16657436 DDX11L1 DDX11L1 ENSG00000223972 100287102 # 5 16657436 DDX11L17 DDX11L17 ENSG00000223972 102725121 # 6 16657436 DDX11L2 DDX11L2 ENSG00000223972 84771 # 7 16657436 DDX11L9 DDX11L9 ENSG00000248472 100288486 # 8 16657436 DDX11L9 DDX11L9 ENSG00000288170 100288486 # 9 16657436 DDX11L10 DDX11L1 ENSG00000233614 100287029 # 10 16657436 DDX11L10 DDX11P ENSG00000233614 100287029 # 11 16657436 DDX11L10 DDX11L10 ENSG00000233614 100287029 # 12 16657436 DDX11L5 DDX11L5 ENSG00000223972 100287596 # 13 16657436 DDX11L5 DDX11L5 ENSG00000248472 100287596 # 14 16657436 DDX11L5 DDX11L5 ENSG00000288170 100287596 # 15 16657436 DDX11L16 DDX11L16 ENSG00000223972 727856 # 16 16657436 DDX11L16 DDX11L16 ENSG00000248472 727856 # 17 16657436 DDX11L16 DDX11L16 ENSG00000233614 727856 # 18 16657436 DDX11L16 DDX11L16 ENSG00000288170 727856 # 19 16678303 ARF1 PVNH8 ENSG00000143761 375 # 20 16678303 ARF1 ARF1 ENSG00000143761 375 keytypes(hugene20sttranscriptcluster.db) ## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" ## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" ## [11] "GO" "GOALL" "IPI" "MAP" "OMIM" ## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID" ## [21] "PROBEID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" ## [26] "UNIGENE" "UNIPROT" # mapIds # Note that an additional argument, multiVals can be used to control duplicates mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID") # only select the 1st one ## 16737401 16657436 16678303 ## "TRAF6" "DDX11L1" "ARF1" mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "list") # list mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "CharacterList") # IRanges mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "filter") # character vector # OrganismDb packages library(Homo.sapiens) Homo.sapiens head(genes(Homo.sapiens, columns = c("ENTREZID", "ALIAS", "UNIPROT")), 4)
-
Gene centric
- AnnotationDbi: Introduction To Bioconductor Annotation Packages
library(hgu133a.db) library(AnnotationDbi) k <- head(keys(hgu133a.db, keytype="PROBEID")) k # [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at" # then call select select(hgu133a.db, keys=k, columns=c("SYMBOL","GENENAME"), keytype="PROBEID") # 'select()' returned 1:many mapping between keys and columns # PROBEID SYMBOL GENENAME # 1 1007_s_at DDR1 discoidin domain receptor tyrosine kinase 1 # 2 1007_s_at MIR4640 microRNA 4640 # 3 1053_at RFC2 replication factor C subunit 2 # 4 117_at HSPA6 heat shock protein family A (Hsp70) member 6 # 5 121_at PAX8 paired box 8 # 6 1255_g_at GUCA1A guanylate cyclase activator 1A # 7 1294_at UBA7 ubiquitin like modifier activating enzyme 7 # 8 1294_at MIR5193 microRNA 5193
ExpressionSets objects
For example: load(file.path(system.file(package = "Bioc2020Anno", "extdata"), "eset.Rdata")
- exprs(eset)
- pData(phenoData(eset))
- pData(featureData(eset))
Genomic centric
- GRanges()
- GrangesList()
GRanges() and GRangesLists act like data.frames and lists, and can be subsetted using the [ function.
SummarizedExperiment objects
SummarizedExperiment objects are like ExpressionSets, but the row-wise annotations are GRanges, so you can subset by genomic locations.
SummarizedExperiment objects are like ExpressionSets, but the row-wise annotations are GRanges, so you can subset by genomic locations.
SummarizedExperiment objects are popular objects for representing expression data and other rectangular data (feature x sample data).
- array() or assays(): genes x samples
- colData(): samples x sample features
- rowData(): genes x gene features. The gene features could be result from running analyses; see EnrichmentBrowser vignette.
Examples: GSEABenchmarkeR package.
biomaRt
library(biomaRt) listMarts(host = "useast.ensembl.org") # biomaRt data sets mart <- useEnsembl("ensembl") head(listDatasets(mart)) # biomaRt queries mart <- useEnsembl("ensembl","hsapiens_gene_ensembl") # biomaRt attributes and filters atrib <- listAttributes(mart) filts <- listFilters(mart) head(atrib) head(filts) # biomaRt query afyids <- c("1000_at","1001_at","1002_f_at","1007_s_at") getBM(c("affy_hg_u95av2", "hgnc_symbol"), c("affy_hg_u95av2"), afyids, mart)
AnnotationHub
library(AnnotationHub) hub <- AnnotationHub() hub # Querying AnnotationHub names(mcols(hub)) # AnnotationHub Data providers unique(hub$dataprovider) # AnnotationHub Data classes unique(hub$rdataclass) # AnnotationHub Species head(unique(hub$species)) length(unique(hub$species)) # AnnotationHub Data sources unique(hub$sourcetype) # AnnotationHub query qry <- query(hub, c("granges","homo sapiens","ensembl")) qry qry$sourceurl # Selecting AnnotationHub resource whatIwant <- qry[["AH80077"]] GRCh38TxDb <- makeTxDbFromGRanges(whatIwant) GRCh38TxDb
Map Ensembl IDs to Entrez IDs
See the vignette from SingscoreAMLMutations.
Multimapped Ensembl IDs are replaced by NAs, then discarded to enforce unique mapping. Similarly, Entrez IDs that map to multiple Ensembl IDs are identified from the mapping, and discarded. Only features with unique Ensembl ID to Entrez ID mappings remain.
Gene name errors from Excel
Gene name errors from Excel. Gene names such as MAR1, DEC1, OCT4 and SEPT9 are now reformatted as dates.
Example
Meet NOTCH2NL, the human-specific genes that may have given us our big brains
library(hgu133plus2.db) select(hgu133plus2.db, keys="NOTCH2NL", columns=c("SYMBOL","GENENAME"), keytype="SYMBOL") # Error in .testForValidKeys(x, keys, keytype, fks) : # None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method # to see a listing of valid arguments. k <- keys(hgu133plus2.db, keytype="SYMBOL") k <- sort(k) grep("NOTCH", k) # [1] 41224 41225 41226 41227 41228 41229 41230 41231 41232 k[grep("NOTCH", k) ] [1] "NOTCH1" "NOTCH2" "NOTCH2NLA" "NOTCH2NLB" "NOTCH2NLC" "NOTCH2NLR" "NOTCH2P1" [8] "NOTCH3" "NOTCH4"
Workflow
Using Bioconductor for Sequence Data
CyTOF
CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets
Some packages
Biobase, GEOquery and limma
How to create an ExpressionSet object from scratch? Here we use the code from GEO2R to help to do this task.
library(Biobase) library(GEOquery) library(limma) # Load series and platform data from GEO gset <- getGEO("GSE32474", GSEMatrix =TRUE, AnnotGPL=TRUE) if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1 gset <- gset[[idx]] # save(gset, file = "~/Downloads/gse32474_gset.rda") # load("~/Downloads/gse32474_gset.rda") table(pData(gset)[, "cell line:ch1"]) pData(gset) # Create an ExpressionSet object from scratch # We take a shortcut to obtain the pheno data and feature data matrices # from the output of getGEO() phenoDat <- new("AnnotatedDataFrame", data=pData(gset)) featureDat <- new("AnnotatedDataFrame", data=fData(gset)) exampleSet <- ExpressionSet(assayData=exprs(gset), phenoData=phenoDat, featureData=featureDat, annotation="hgu133plus2") gset <- exampleSet # Make proper column names to match toptable fvarLabels(gset) <- make.names(fvarLabels(gset)) # group names for all samples gsms <- paste0("00000000111111111XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", "XXXXXXXXXXXXXXXXXXXXXXXX") sml <- c() for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) } # Subset an ExpressionSet by eliminating samples marked as "X" sel <- which(sml != "X") sml <- sml[sel] gset <- gset[ ,sel] # Decide if it is necessary to do a log2 transformation ex <- exprs(gset) qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T)) LogC <- (qx[5] > 100) || (qx[6]-qx[1] > 50 && qx[2] > 0) || (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2) if (LogC) { ex[which(ex <= 0)] <- NaN exprs(gset) <- log2(ex) } # Set up the data and proceed with analysis sml <- paste("G", sml, sep="") # set group names fl <- as.factor(sml) gset$description <- fl design <- model.matrix(~ description + 0, gset) colnames(design) <- levels(fl) fit <- lmFit(gset, design) cont.matrix <- makeContrasts(G1-G0, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2, 0.01) tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250) # Display the result with selected columns tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title")) tT[1:2, ] # ID adj.P.Val P.Value t B logFC Gene.symbol # 209108_at 209108_at 0.08400054 4.438757e-06 6.686977 3.786222 3.949088 TSPAN6 # 204975_at 204975_at 0.08400054 6.036355e-06 6.520775 3.550036 2.919995 EMP2 # Gene.title # 209108_at tetraspanin 6 # 204975_at epithelial membrane protein 2
Biostrings
- Find the location of a particular sequence. ?vmatchPattern
- https://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/BiostringsBSgenomeOverview.pdf
library(Biostrings) library(BSgenome.Hsapiens.UCSC.hg19) vmatchPattern("GCGATCGC", Hsapiens)
plyranges
http://bioconductor.org/packages/devel/bioc/vignettes/plyranges/inst/doc/an-introduction.html
Misc
Package release history
https://support.bioconductor.org/p/69657/
Search the DESCRIPTION file (eg. VariantAnnotation package) in github and the release information can be found there.
Papers/Overview
- Using R and Bioconductor in Clinical Genomics and Transcriptomics 2019
- Lori Ann Shepherd: "What is Bioconductor? : How the Core Team Plays an Essential Role (video) 2021
- Where can I publish a paper describing my Bioconductor package?