Revision as of 10:54, 11 April 2019

Gene Expression Omnibus (GEO) website is located at http://www.ncbi.nlm.nih.gov/geo/. GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.

Browse Content

Repository Browser/Summary

Click on 'Browse Content' > 'Repository Browser' to go to the summary page. It has 4 tabs.

Series/Series Type

Expression profiling by array	40,319
Expression profiling by genome tiling array	639
Expression profiling by high throughput sequencing	4,772
Expression profiling by SAGE	242
Expression profiling by MPSS	20
Expression profiling by RT-PCR	329
Expression profiling by SNP array	13
Genome variation profiling by array	596
Genome variation profiling by genome tiling array	1,068
Genome variation profiling by high throughput sequencing	63
Genome variation profiling by SNP array	826
Genome binding/occupancy profiling by array	174
Genome binding/occupancy profiling by genome tiling array	2,114
Genome binding/occupancy profiling by high throughput sequencing	3,940
Genome binding/occupancy profiling by SNP array	12
Methylation profiling by array	556
Methylation profiling by genome tiling array	718
Methylation profiling by high throughput sequencing	764
Methylation profiling by SNP array	9
Protein profiling by protein array	167
Protein profiling by Mass Spec	6
SNP genotyping by SNP array	514
Other	1,147
Non-coding RNA profiling by array	2,166
Non-coding RNA profiling by genome tiling array	104
Non-coding RNA profiling by high throughput sequencing	1,478
Third-party reanalysis	135

The R code to query this information is

library(GEOmetadb)
sfile = 'GEOmetadb.sqlite'
if(!file.exists(sfile)) {
  sfile = getSQLiteFile() 
}

library(dplyr)
gmdb = src_sqlite(sfile)
# List available tables in the database
src_tbls(gmdb)

tgse = tbl(gmdb,'gse')
tgsm = tbl(gmdb,'gsm')
tgpl = tbl(gmdb,'gpl')

library(tidyr)
gse_type = select(tgse,gse,type) %>%
  transform(type = strsplit(type,';\\t')) %>%
  unnest(type) 
type_count = select(gse_type,type) %>%
  group_by(type) %>%
  summarize(count=n()) %>% 
  arrange(desc(count))
pander(type_count,justify=c('left','right'))

Platform/Technology

Technology	Count
in situ oligonucleotide	5,657
spotted oligonucleotide	2,852
spotted DNA/cDNA	2,869
antibody	24
MS	17
SAGE NlaIII	67
SAGE Sau3A	4
SAGE RsaI	1
SARST	2
MPSS	18
RT-PCR	277
other	174
oligonucleotide beads	227
mixed spotted oligonucleotide/cDNA	16
spotted peptide or protein	110
high-throughput sequencing	2,073

Some examples

hgu-133a GPL96 22283 rows
hgu-133b GPL97 22645 rows
hgu-133 plus 2.0 GPL570 54675 rows
hgu-133a 2.0 GPL571 22277 rows

Samples/Samples Type

Sample type	Count
RNA	1,017,959
genomic	244,511
protein	12,860
SAGE	1,763
mixed	3,976
other	7,509
SARST	9
MPSS	207
SRA	135,247

Organism

A partial list:

Organism	Series	Platforms	Samples
Homo sapiens	22,477	4,590	792,844
Mus musculus	15,758	1,959	240,935
Rattus norvegicus	2,358	475	68,583
Saccharomyces cerevisiae	1,790	550	37,435
Arabidopsis thaliana	2,416	331	30,709
Drosophila melanogaster	2,422	317	23,601
Sus scrofa	405	107	9,809
Caenorhabditis elegans	1,154	183	8,898
Zea mays	265	91	8,667
Bos taurus	462	147	7,780
Oryza sativa	493	173	5,616
Glycine max	179	41	5,863
Gallus gallus	375	105	5,509
Escherichia coli	508	127	5,056
Macaca mulatta	245	40	4,504
Xenopus laevis	111	25	1,054

Series, Samples, Platforms, DataSets

R packages

GEOmetadb

http://gbnci.abcc.ncifcrf.gov/geo/ Meltzerlab GEO Microarray Tool
https://nsaunders.wordpress.com/2010/08/30/geo-database-curation-lagging-behind-submission/
http://rpubs.com/seandavi/GEOMetadbSurvey2014 dplyr and the GEOmetadb package for mining NCBI GEO metadata]

GEOsubmission

Converts a microarray dataset and the corresponding sample information into a SOFT file to be used for GEO submission.

GEOquery

http://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html. It seems the output class of getGEO() depends on the GSEMatrix argument.

library(GEOquery)

# Series matrix
gse <- getGEO("GSE9782",GSEMatrix=TRUE)
class(gse)
# [1] "list"
head(Meta(gse))
# Error in (function (classes, fdef, mtable)  : 
#  unable to find an inherited method for function ‘Meta’ for signature ‘"list"’
names(GPLList(gse))
# Error in (function (classes, fdef, mtable)  : 
#  unable to find an inherited method for function ‘GPLList’ for signature ‘"list"’

show(gse)  # length of 2, GPL96 & GPL97

# Get the expression matrix
mat1 <- exprs(gse[[1]])
dim(mat1) # 22283 x 264
mat1[1:2, 1:5]
          GSM246523 GSM246524 GSM246525 GSM246526 GSM246527
1007_s_at   235.523  498.2220  309.2070   307.569   37.3808
1053_at      41.447   69.0219   69.3994    36.931   43.5677

mat2 <- exprs(gse[[2]])
dim(mat2) # 22645 x 264

library(magrittr)
intersect(rownames(mat1), rownames(mat2)) %>% length()
# [1] 168

# Get the phenotype information
pData(phenoData(gse[[1]]))[1:2, 1:3]
#               title geo_accession                status
# GSM246523 MPM002090     GSM246523 Public on Dec 06 2007
# GSM246524 MPM002091     GSM246524 Public on Dec 06 2007

# the expression matrix is an ExpressionSet object, ready for limma
class(gse[[1]])
# [1] "ExpressionSet"
# attr(,"package")
# [1] "Biobase"

# Obtain the soft format file
gse2 <- getGEO("GSE9782",GSEMatrix=F)
class(gse2)
# [1] "GSE"
# attr(,"package")
# [1] "GEOquery"
head(Meta(gse2))
# $contact_address
# [1] "35 Landsdowne Stree"
# $contact_city
# [1] "Cambridge"
# ...
names(GSMList(gse2)) %>% head()
# [1] "GSM246523" "GSM246524" "GSM246525" "GSM246526" "GSM246527" "GSM246528"
names(GSMList(gse2)) %>% length()
# [1] 528
names(GPLList(gse2))
# [1] "GPL96" "GPL97"
length(GSMList(gse2))
# [1] 528
class(GSMList(gse2)[[1]])
# [1] "GSM"
# attr(,"package")
[1] "GEOquery"

Now we can retrieve the annotation data.

gpl96 <- getGEO("GPL96")
Meta(gpl96)
colnames(Table(gpl96))
#  [1] "ID"                               "GB_ACC"                          
#  [3] "SPOT_ID"                          "Species Scientific Name"         
#  [5] "Annotation Date"                  "Sequence Type"                   
#  [7] "Sequence Source"                  "Target Description"              
#  [9] "Representative Public ID"         "Gene Title"                      
# [11] "Gene Symbol"                      "ENTREZ_GENE_ID"                  
# [13] "RefSeq Transcript ID"             "Gene Ontology Biological Process"
# [15] "Gene Ontology Cellular Component" "Gene Ontology Molecular Function"
Table(gpl96)[1:5, c("ID", "Gene Symbol", "ENTREZ_GENE_ID")]
#          ID      Gene Symbol    ENTREZ_GENE_ID
# 1 1007_s_at DDR1 /// MIR4640 780 /// 100616237
# 2   1053_at             RFC2              5982
# 3    117_at            HSPA6              3310
# 4    121_at             PAX8              7849
# 5 1255_g_at           GUCA1A              2978

dim(Table(gpl96))
# [1] 22283    16
gpl97 <- getGEO("GPL97")
dim(Table(gpl97))
# [1] 22645    16

We can further to subset the genes by getting the gene expression data for all the probes with a gene symbol; see eg here.

Question: how many GSE series from GPL198? How many samples in each of these series?

SRAdb

SRA website is located at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi

rentrez

Provides an R interface to the NCBI's 'EUtils' API, allowing users to search databases like 'GenBank' <https://www.ncbi.nlm.nih.gov/genbank/> and 'PubMed' <https://www.ncbi.nlm.nih.gov/pubmed/>, process the results of those searches and pull data into their R sessions.

BART (bioinformatics array research tool): R shiny application for microarray analysis

Some cases

GSE22631

Agilent-014879 Whole Rat Genome Microarray 4x44K G4131F. 12 samples.

Soft format

27MB after decompression. It contains gene annotation for that platform, gene expression and sample information. The format is however not a matrix format. For example, after the gene annotation, the rest of files are separated by samples with 1 column (VALUE) representing gene expression.

See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-gse22631_family-soft.

Series matrix

These files are suitable for loading in MS-Excel.

6MB after decompression. It contains gene expression (no gene annotation) and sample information.

The sample information is on the header. But we need to transpose that in order to get the normal form where each row is a sample and each column is a variable.

See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-gse22631_series_matrix-txt.

Experiment descriptor (ArrayTools)

See https://gist.github.com/arraytools/9a6e954ef423d6634695#file-experiment-descriptors.

GDS507

This is a data used by GEOquery package. 17 samples. HG-U133B platform.

There is a related GSE number (GSE781) for this GDS data. However, GSE781 is a larger dataset and contains two platforms (HG-U133A and HG-U133B). It has 34 samples.

The raw cel files are available too.

GSE30786

This is actually an RNA-Seq data. There are 4 samples with 2 conditions. See

GSE20986

HGU133 plus 2

The CEL files were processed with the BioConductor gcrma function.

GSE68465

HGU133A.

The CEL files were processed with MAS5.

GSE6532 (HG-U133A + HG-U133B)

Tips

Search by SRRxxxxxx

Search the SRA website; for example SRR902884.

Explore GEO/SRA

MetaSRA & the paper

A Step‐by‐Step Guide to Submitting RNA‐Seq Data to NCBI

https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpbi.67

GEO: Difference between revisions