| Comparative Analysis |
| Comparative Analysis of GATExplorer with other
RELATED APPLICATIONS PUBLISHED |
There are some other bioinformatic studies that have undertaken an alternative mapping of
probes to genes for Affymetrix microarrays. Some of these re-mapping approaches and tools
are limited to some microarray models or they do not apply to all-exon expression
microarrays (i.e. Gene_1.0 and Exon_1.0). Moreover, as far as we can see none of them presents
mapping to intronic regions or ncRNAs. In the case of GATExplorer web server,
the expression probes mapped can be interactively found and viewed in each gene loci with
detailed location.
We have done a comparative study of some of these studies and publications (quoted in
detail in Links) and we present the results in
the table below: |
| METHOD / TOOL |
Gautier et al. |
Harbig et al. |
Dai et al. (BrainArray) |
Liu et al. (AffyProbeMiner) |
Risueño et al. (GATExplorer) |
| YEAR |
2004 |
2005 |
2005 (updated 2010) |
2007 |
2009 |
| SOURCE DATABASES |
RefSeq |
RefSeq |
UniGene RefSeq DoTS Entrez gene Ensembl |
RefSeq GenBank |
Ensembl |
| ORGANISMS |
human |
human |
human mouse rat others |
human mouse rat others |
human mouse rat |
| MICROARRAY MODELS |
-Expression Arrays •HG_U133A •HG_U95Av2 |
-Expression Arrays •HG_U133_Plus_2 |
-Expression Arrays -Exon Arrays -Tiling Arrays -Promoter Arrays |
-Expression Arrays •U95 serie •U133 serie |
-Expression Arrays -Exon Arrays |
| Minimal Nº of probes in a probeset |
1 |
NA |
3 |
5 |
3 |
| Type of mapping data provided |
R objects |
Text Files Normalization Tool |
R Packages (CDFs) |
R Packages (CDFs) |
R Packages (CDFs) Text files (.txt) |
| Biomolecular entities mapped |
genes |
genes |
genes transcripts exons |
genes transcripts |
genes transcripts exons ncRNAs |
Nº of human genes (mapped with unambiguous probes) |
11640 [HG_U133A] |
( 20415 ) (only done for U133_Plus_2) |
11853 [HG_U133A] |
12550 [HG_U133A] |
12576 [HG_U133A] |
| Mapping to ncRNA |
NO |
NO |
NO |
NO |
YES |
| Web page with data |
YES |
YES |
YES |
YES |
YES |
Integration of the mapping in a GENOMIC WEB SERVER |
NO |
NO |
NO |
NO |
YES |
|
The first attempt to provide alternative mapping of Affymetrix microarray probes to the
latest versions of human genes was reported by Gautier et al. in 2004. Since this
report, several studies have been published providing redefinition of Affymetrix array
probe and probesets to genes and transcripts, including tools to use such redefinitions (see
references in Links). Dai et al.
(2005) developed probably the most comprehensive mapping of microarray probes from several
species. Despite the reannotation of Affymetrix microarray probes and probesets to genes
and transcripts having been reported previously, GATExplorer is the first system that
integrates mapping of probes (including maps to ncRNAs) with simple genomic views and location,
plus expression signals at probe level.
Our results show that a complete unambiguous mapping of the probes from expression
microarrays to the currently known version of the gene-set of a given organism (i.e.
re-mapping at genome-wide "omic" scale) is needed to achieve adequate gene expression
calculation. Details about the coverage and efficiency of the probe mapping done in
GATExplorer are presented in Statistics.
|
| Example of EXPRESSION calculated with mapping to
PROBESETS (standard CDFs) or to GENES (GeneMapper CDFs) |
In order to show in practice the efficiency of applying a complete re-mapping of microarray
probes to the currently known genes for any expression analysis and profiling, we
present in the table below the results of a comparative study
to find significant differential expression in several datasets of microarrays that are
analyzed either using the standard Chip Definition Files (CDFs) to "probeset" or using the
new Chip Definition Files (CDFs) that include the gene-specific remapping and assignment
provided by GATExplorer. These analyses are done using first three different expression
signal calculation algorithms (MAS5.0, FARMS and RMA) with CDFs to
"probesets" and then using RMA with CDFs to "genes" (i.e. the GeneMapper package). The 3 algorithms are: MAS5 (Liu et al., 2002), FARMS
(Hochreiter et al., 2006) and RMA (Irizarry et al., 2003) (see Affycomp), and RMA is at present the
most widely used to calculate microarray gene expression signals. Following the application of the
expression calculation algorithms with different CDFs, a common robust algorithm to measure
differential expression called SAM (Tusher et al., 2001) was applied to
the data.
The samples compared are a collection of mouse microarrays corresponding to 5 sets
of 6 samples. Each set includes 3 biological replicates of knock-out (KO) mice for a
specific gene that are compared to 3 biological replicates of the corresponding wild-type (WT)
mice. The 5 gene KOs are: APOE-/-, IRS2-/-, NRAS-/-, SCD1-/- and ENG+/-.
The full name of these 5 genes, the Ensembl ID number (ENSGx) and the probesets
assigned by Affymetrix to them are indicated in red in the table
below.
The comparative analysis tries to answer a simple question: what is the ranking of the KO gene
found by each method in each set of KO samples versus WT samples?. In optimum conditions, the
gene that is not present in the KO mice should be expected to suffer one of the most dramatic
differences showing a significant "repression" or "down-regulation" when compared with the WT
mice. As indicated, the differential expression analyses are done with the same algorithm, and the
main feature to be evaluated in the comparison is: how the mapping with the CDFs to "probesets"
or the mapping with the CDFs to "genes" affect the differential expression analyses,
particularly for the KO genes. In this way, the uniform comparison is presented in the last two
columns of the table (framed with a black line). These results
show that the reassigned probes to genes perform at least as well as the original probesets,
because in all cases the KO gene is found as one of the most significant gene repressed.
The biological/functional signature associated to each KO gene will be different and we do not
know a priori how many other genes associated to this KO gene can be affected, therefore we
cannot assume that the KO gene will always be the most repressed. In any case, the comparison is
using exactly the same samples and we observe clear differences in the four methods presented,
that cannot be due to the differences in the biological samples but only to the manner in which
the microarrays are analysed. The table also shows an even comparison
between three different algorithms MAS5, FARMS and RMA using the same CDFs
to "probesets". This comparison between methods is also useful to confirm the widely reported
observation that RMA performs effectively to identify differential expression signatures
(Bolstad et al., 2003; Barash et al., 2004). Indeed, the method that we propose to apply to the
new probe mappings uses RMA in the three steps of the expression calculation: background
correction, normalization and summarization.
The best results are enhanced in yellow in the table below, showing
that the KO gene gives best rank in 4 cases when we use GeneMapper
CDFs: IRS2, NRAS, SCD1 and ENG. The % of significant gene loci
(q-value < 0.10) with respect to the total number of gene loci is the largest for 2 KO
genes (APOE and NRAS), and the p-value of the statistical test is the lowest
for 2 KO genes (IRS2 and ENG). The results consistently indicate that the method
that uses CDFs with the new remapping to "genes" provides at least as significant changes as the
best of the three methods based on standard Affymetrix CDFs.
|
| View larger
image |
 |
A more detailed comparison of the differential expression signal achieved with each
method is shown in the figure below, that presents the Volcano Plots obtained
for the case of gene NRAS-/- 3 KOs versus 3 WTs. The Volcano Plot arrange
genes along two dimensions of biological and statistical significance. The first (horizontal)
dimension is the fold change between KO and WT samples (on a log2 scale, so that
up and down regulation appear symmetric), and the second (vertical) axis represents the q-value
(i.e. p-values corrected for multiple testing) of the statistical test done with SAM, that
labels the significance of the differences between samples. These p-values are most conveniently
presented in negative log10 scale, so that smaller q-values appear higher up. In this way,
the first axis indicates biological size or impact of the change and the second axis
indicates the statistical evidence or reliability of the change.
The results of these Volcano Plots show that the use of GeneMapper
provides at least as good results as the best method to find significant differential expression
in these samples (it finds 22 significant genes) and as a whole it provides similar data
distribution to RMA using standard Affymetrix CDFs, but with a clear improvement in the
number of significant genes found. |
| View larger
image |
 |
|
Download the microarrays corresponding
to the mouse gene KOs data sets described above
(6 .CEL files for each KO gene) |
| APOEko |
NRASko |
SCD1ko |
IRS2ko |
ENGko |
|
|
| Public data repository where microarrays
are available |
| APOE |
GEO Series GSE2372 - Platform
GPL1261 - Arrays number: GSM44658, GSM44663, GSM44659, GSM44660, GSM44661, GSM44662. |
| NRAS |
GEO Series GSE14829 -
Platform GPL81 - Arrays number: GSM371168, GSM371169, GSM371170, GSM371174, GSM371175, GSM371176. |
| SCD1 |
GEO Series GSE2926 - Platform
GPL32 - Arrays number: GSM63851, GSM63852, GSM63853, GSM63856, GSM63857, GSM63858. |
| IRS2 |
Arrays still not published (original data borrowed from FONT DE MORA J. et al. 2010). |
| ENG |
Arrays still not published (original data borrowed from RODRIGUEZ-BARBERO A. et al. 2010). |
|