GEO: Difference between revisions
Line 83: | Line 83: | ||
high-throughput sequencing 2,073 | high-throughput sequencing 2,073 | ||
</pre> | </pre> | ||
Some examples | |||
* hgu-133a [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL96 GPL96] 22283 rows | |||
* hgu-133b [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL97 GPL97] 22645 rows | |||
* hgu-133 plus 2.0 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570 GPL570] 54675 rows | |||
* hgu-133a 2.0 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL571 GPL571] 22277 rows | |||
=== Samples/Samples Type === | === Samples/Samples Type === |
Revision as of 13:54, 16 March 2018
Gene Expression Omnibus (GEO) website is located at http://www.ncbi.nlm.nih.gov/geo/. GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.
Browse Content
Repository Browser/Summary
Click on 'Browse Content' > 'Repository Browser' to go to the summary page. It has 4 tabs.
Series/Series Type
Expression profiling by array 40,319 Expression profiling by genome tiling array 639 Expression profiling by high throughput sequencing 4,772 Expression profiling by SAGE 242 Expression profiling by MPSS 20 Expression profiling by RT-PCR 329 Expression profiling by SNP array 13 Genome variation profiling by array 596 Genome variation profiling by genome tiling array 1,068 Genome variation profiling by high throughput sequencing 63 Genome variation profiling by SNP array 826 Genome binding/occupancy profiling by array 174 Genome binding/occupancy profiling by genome tiling array 2,114 Genome binding/occupancy profiling by high throughput sequencing 3,940 Genome binding/occupancy profiling by SNP array 12 Methylation profiling by array 556 Methylation profiling by genome tiling array 718 Methylation profiling by high throughput sequencing 764 Methylation profiling by SNP array 9 Protein profiling by protein array 167 Protein profiling by Mass Spec 6 SNP genotyping by SNP array 514 Other 1,147 Non-coding RNA profiling by array 2,166 Non-coding RNA profiling by genome tiling array 104 Non-coding RNA profiling by high throughput sequencing 1,478 Third-party reanalysis 135
The R code to query this information is
library(GEOmetadb) sfile = 'GEOmetadb.sqlite' if(!file.exists(sfile)) { sfile = getSQLiteFile() } library(dplyr) gmdb = src_sqlite(sfile) # List available tables in the database src_tbls(gmdb) tgse = tbl(gmdb,'gse') tgsm = tbl(gmdb,'gsm') tgpl = tbl(gmdb,'gpl') library(tidyr) gse_type = select(tgse,gse,type) %>% transform(type = strsplit(type,';\\t')) %>% unnest(type) type_count = select(gse_type,type) %>% group_by(type) %>% summarize(count=n()) %>% arrange(desc(count)) pander(type_count,justify=c('left','right'))
Platform/Technology
Technology Count in situ oligonucleotide 5,657 spotted oligonucleotide 2,852 spotted DNA/cDNA 2,869 antibody 24 MS 17 SAGE NlaIII 67 SAGE Sau3A 4 SAGE RsaI 1 SARST 2 MPSS 18 RT-PCR 277 other 174 oligonucleotide beads 227 mixed spotted oligonucleotide/cDNA 16 spotted peptide or protein 110 high-throughput sequencing 2,073
Some examples
- hgu-133a GPL96 22283 rows
- hgu-133b GPL97 22645 rows
- hgu-133 plus 2.0 GPL570 54675 rows
- hgu-133a 2.0 GPL571 22277 rows
Samples/Samples Type
Sample type Count RNA 1,017,959 genomic 244,511 protein 12,860 SAGE 1,763 mixed 3,976 other 7,509 SARST 9 MPSS 207 SRA 135,247
Organism
A partial list:
Organism Series Platforms Samples Homo sapiens 22,477 4,590 792,844 Mus musculus 15,758 1,959 240,935 Rattus norvegicus 2,358 475 68,583 Saccharomyces cerevisiae 1,790 550 37,435 Arabidopsis thaliana 2,416 331 30,709 Drosophila melanogaster 2,422 317 23,601 Sus scrofa 405 107 9,809 Caenorhabditis elegans 1,154 183 8,898 Zea mays 265 91 8,667 Bos taurus 462 147 7,780 Oryza sativa 493 173 5,616 Glycine max 179 41 5,863 Gallus gallus 375 105 5,509 Escherichia coli 508 127 5,056 Macaca mulatta 245 40 4,504 Xenopus laevis 111 25 1,054
Series, Samples, Platforms, DataSets
R packages
GEOmetadb
- http://gbnci.abcc.ncifcrf.gov/geo/ Meltzerlab GEO Microarray Tool
- https://nsaunders.wordpress.com/2010/08/30/geo-database-curation-lagging-behind-submission/
- http://rpubs.com/seandavi/GEOMetadbSurvey2014 dplyr and the GEOmetadb package for mining NCBI GEO metadata]
GEOsubmission
Converts a microarray dataset and the corresponding sample information into a SOFT file to be used for GEO submission.
GEOquery
- http://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html
- http://watson.nci.nih.gov/~sdavis/tutorials/CSHL2010/publicRepos.html
- Accessing Public Data using R and Bioconductor
- http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo
- https://www.biostars.org/p/4896/, https://www.biostars.org/p/111791/
- Creating Annotated Data Frames from GEO with the GEOquery package
Question: how many GSE series from GPL198? How many samples in each of these series?
SRAdb
SRA website is located at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi
Some cases
GSE22631
Agilent-014879 Whole Rat Genome Microarray 4x44K G4131F. 12 samples.
Soft format
27MB after decompression. It contains gene annotation for that platform, gene expression and sample information. The format is however not a matrix format. For example, after the gene annotation, the rest of files are separated by samples with 1 column (VALUE) representing gene expression.
See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-gse22631_family-soft.
Series matrix
These files are suitable for loading in MS-Excel.
6MB after decompression. It contains gene expression (no gene annotation) and sample information.
See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-gse22631_series_matrix-txt.
Experiment descriptor (ArrayTools)
See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-experiment-descriptors.
GDS507
This is a data used by GEOquery package. 17 samples. HG-U133B platform.
There is a related GSE number (GSE781) for this GDS data. However, GSE781 is a larger dataset and contains two platforms (HG-U133A and HG-U133B). It has 34 samples.
The raw cel files are available too.
GSE30786
This is actually an RNA-Seq data. There are 4 samples with 2 conditions. See