GEO

From 太極
Revision as of 11:08, 12 October 2018 by Brb (talk | contribs) (→‎Tips)
Jump to navigation Jump to search

Gene Expression Omnibus (GEO) website is located at http://www.ncbi.nlm.nih.gov/geo/. GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.

Browse Content

Repository Browser/Summary

Click on 'Browse Content' > 'Repository Browser' to go to the summary page. It has 4 tabs.

Series/Series Type

Expression profiling by array	40,319
Expression profiling by genome tiling array	639
Expression profiling by high throughput sequencing	4,772
Expression profiling by SAGE	242
Expression profiling by MPSS	20
Expression profiling by RT-PCR	329
Expression profiling by SNP array	13
Genome variation profiling by array	596
Genome variation profiling by genome tiling array	1,068
Genome variation profiling by high throughput sequencing	63
Genome variation profiling by SNP array	826
Genome binding/occupancy profiling by array	174
Genome binding/occupancy profiling by genome tiling array	2,114
Genome binding/occupancy profiling by high throughput sequencing	3,940
Genome binding/occupancy profiling by SNP array	12
Methylation profiling by array	556
Methylation profiling by genome tiling array	718
Methylation profiling by high throughput sequencing	764
Methylation profiling by SNP array	9
Protein profiling by protein array	167
Protein profiling by Mass Spec	6
SNP genotyping by SNP array	514
Other	1,147
Non-coding RNA profiling by array	2,166
Non-coding RNA profiling by genome tiling array	104
Non-coding RNA profiling by high throughput sequencing	1,478
Third-party reanalysis	135

The R code to query this information is

library(GEOmetadb)
sfile = 'GEOmetadb.sqlite'
if(!file.exists(sfile)) {
  sfile = getSQLiteFile() 
}

library(dplyr)
gmdb = src_sqlite(sfile)
# List available tables in the database
src_tbls(gmdb)

tgse = tbl(gmdb,'gse')
tgsm = tbl(gmdb,'gsm')
tgpl = tbl(gmdb,'gpl')

library(tidyr)
gse_type = select(tgse,gse,type) %>%
  transform(type = strsplit(type,';\\t')) %>%
  unnest(type) 
type_count = select(gse_type,type) %>%
  group_by(type) %>%
  summarize(count=n()) %>% 
  arrange(desc(count))
pander(type_count,justify=c('left','right'))

Platform/Technology

Technology	Count
in situ oligonucleotide	5,657
spotted oligonucleotide	2,852
spotted DNA/cDNA	2,869
antibody	24
MS	17
SAGE NlaIII	67
SAGE Sau3A	4
SAGE RsaI	1
SARST	2
MPSS	18
RT-PCR	277
other	174
oligonucleotide beads	227
mixed spotted oligonucleotide/cDNA	16
spotted peptide or protein	110
high-throughput sequencing	2,073

Some examples

  • hgu-133a GPL96 22283 rows
  • hgu-133b GPL97 22645 rows
  • hgu-133 plus 2.0 GPL570 54675 rows
  • hgu-133a 2.0 GPL571 22277 rows

Samples/Samples Type

Sample type	Count
RNA	1,017,959
genomic	244,511
protein	12,860
SAGE	1,763
mixed	3,976
other	7,509
SARST	9
MPSS	207
SRA	135,247

Organism

A partial list:

Organism	Series	Platforms	Samples
Homo sapiens	22,477	4,590	792,844
Mus musculus	15,758	1,959	240,935
Rattus norvegicus	2,358	475	68,583
Saccharomyces cerevisiae	1,790	550	37,435
Arabidopsis thaliana	2,416	331	30,709
Drosophila melanogaster	2,422	317	23,601
Sus scrofa	405	107	9,809
Caenorhabditis elegans	1,154	183	8,898
Zea mays	265	91	8,667
Bos taurus	462	147	7,780
Oryza sativa	493	173	5,616
Glycine max	179	41	5,863
Gallus gallus	375	105	5,509
Escherichia coli	508	127	5,056
Macaca mulatta	245	40	4,504
Xenopus laevis	111	25	1,054

Series, Samples, Platforms, DataSets

Geo series.png Geo samples.png Geo platform.png Geo datasets.png

R packages

GEOmetadb

GEOsubmission

Converts a microarray dataset and the corresponding sample information into a SOFT file to be used for GEO submission.

GEOquery

Question: how many GSE series from GPL198? How many samples in each of these series?

SRAdb

SRA website is located at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi

rentrez

Provides an R interface to the NCBI's 'EUtils' API, allowing users to search databases like 'GenBank' <https://www.ncbi.nlm.nih.gov/genbank/> and 'PubMed' <https://www.ncbi.nlm.nih.gov/pubmed/>, process the results of those searches and pull data into their R sessions.

BART (bioinformatics array research tool): R shiny application for microarray analysis

Some cases

GSE22631

Agilent-014879 Whole Rat Genome Microarray 4x44K G4131F. 12 samples.

Soft format

27MB after decompression. It contains gene annotation for that platform, gene expression and sample information. The format is however not a matrix format. For example, after the gene annotation, the rest of files are separated by samples with 1 column (VALUE) representing gene expression.

See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-gse22631_family-soft.

Series matrix

These files are suitable for loading in MS-Excel.

6MB after decompression. It contains gene expression (no gene annotation) and sample information.

See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-gse22631_series_matrix-txt.

Experiment descriptor (ArrayTools)

See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-experiment-descriptors.

GDS507

This is a data used by GEOquery package. 17 samples. HG-U133B platform.

There is a related GSE number (GSE781) for this GDS data. However, GSE781 is a larger dataset and contains two platforms (HG-U133A and HG-U133B). It has 34 samples.

The raw cel files are available too.

GSE30786

This is actually an RNA-Seq data. There are 4 samples with 2 conditions. See

GSE20986

HGU133 plus 2

The CEL files were processed with the BioConductor gcrma function.

GSE68465

HGU133A.

The CEL files were processed with MAS5.

Tips

Search by SRRxxxxxx

Search the SRA website; for example SRR902884.

Explore GEO/SRA

MetaSRA & the paper