Comparative Analysis
Comparative Analysis of GATExplorer with other RELATED APPLICATIONS PUBLISHED

There are some other bioinformatic studies that have undertaken an alternative mapping of probes to genes for Affymetrix microarrays. Some of these re-mapping approaches and tools are limited to some microarray models or they do not apply to all-exon expression microarrays (i.e. Gene_1.0 and Exon_1.0). Moreover, as far as we can see none of them presents mapping to intronic regions or ncRNAs. In the case of GATExplorer web server, the expression probes mapped can be interactively found and viewed in each gene loci with detailed location.

We have done a comparative study of some of these studies and publications (quoted in detail in Links) and we present the results in the table below:
METHOD / TOOL Gautier et al. Harbig et al. Dai et al.
(BrainArray)
Liu et al.
(AffyProbeMiner)
Risueño et al.
(GATExplorer)
YEAR 2004 2005 2005
(updated 2010)
2007 2009
SOURCE DATABASES RefSeq RefSeq UniGene
RefSeq
DoTS
Entrez gene
Ensembl
RefSeq
GenBank
Ensembl
ORGANISMS human human human
mouse
rat
others
human
mouse
rat
others
human
mouse
rat
MICROARRAY MODELS -Expression Arrays
  •HG_U133A
  •HG_U95Av2
-Expression Arrays
  •HG_U133_Plus_2
-Expression Arrays
-Exon Arrays
-Tiling Arrays
-Promoter Arrays
-Expression Arrays
  •U95 serie
  •U133 serie
-Expression Arrays
-Exon Arrays
Minimal Nº of probes in a probeset 1 NA 3 5 3
Type of mapping data provided R objects Text Files
Normalization Tool
R Packages (CDFs) R Packages (CDFs) R Packages (CDFs)
Text files (.txt)
Biomolecular entities mapped genes genes genes
transcripts
exons
genes
transcripts
genes
transcripts
exons
ncRNAs
Nº of human genes
(mapped with unambiguous probes)
11640
[HG_U133A]
( 20415 )
(only done for U133_Plus_2)
11853
[HG_U133A]
12550
[HG_U133A]
12576
[HG_U133A]
Mapping to ncRNA NO NO NO NO YES
Web page with data YES YES YES YES YES
Integration of the mapping in a
GENOMIC WEB SERVER
NO NO NO NO YES

The first attempt to provide alternative mapping of Affymetrix microarray probes to the latest versions of human genes was reported by Gautier et al. in 2004. Since this report, several studies have been published providing redefinition of Affymetrix array probe and probesets to genes and transcripts, including tools to use such redefinitions (see references in Links). Dai et al. (2005) developed probably the most comprehensive mapping of microarray probes from several species. Despite the reannotation of Affymetrix microarray probes and probesets to genes and transcripts having been reported previously, GATExplorer is the first system that integrates mapping of probes (including maps to ncRNAs) with simple genomic views and location, plus expression signals at probe level.

Our results show that a complete unambiguous mapping of the probes from expression microarrays to the currently known version of the gene-set of a given organism (i.e. re-mapping at genome-wide "omic" scale) is needed to achieve adequate gene expression calculation. Details about the coverage and efficiency of the probe mapping done in GATExplorer are presented in Statistics.
Example of EXPRESSION calculated with mapping to PROBESETS (standard CDFs) or to GENES (GeneMapper CDFs)

In order to show in practice the efficiency of applying a complete re-mapping of microarray probes to the currently known genes for any expression analysis and profiling, we present in the table below the results of a comparative study to find significant differential expression in several datasets of microarrays that are analyzed either using the standard Chip Definition Files (CDFs) to "probeset" or using the new Chip Definition Files (CDFs) that include the gene-specific remapping and assignment provided by GATExplorer. These analyses are done using first three different expression signal calculation algorithms (MAS5.0, FARMS and RMA) with CDFs to "probesets" and then using RMA with CDFs to "genes" (i.e. the GeneMapper package). The 3 algorithms are: MAS5 (Liu et al., 2002), FARMS (Hochreiter et al., 2006) and RMA (Irizarry et al., 2003) (see Affycomp), and RMA is at present the most widely used to calculate microarray gene expression signals. Following the application of the expression calculation algorithms with different CDFs, a common robust algorithm to measure differential expression called SAM (Tusher et al., 2001) was applied to the data.

The samples compared are a collection of mouse microarrays corresponding to 5 sets of 6 samples. Each set includes 3 biological replicates of knock-out (KO) mice for a specific gene that are compared to 3 biological replicates of the corresponding wild-type (WT) mice. The 5 gene KOs are: APOE-/-, IRS2-/-, NRAS-/-, SCD1-/- and ENG+/-. The full name of these 5 genes, the Ensembl ID number (ENSGx) and the probesets assigned by Affymetrix to them are indicated in red in the table below.

The comparative analysis tries to answer a simple question: what is the ranking of the KO gene found by each method in each set of KO samples versus WT samples?. In optimum conditions, the gene that is not present in the KO mice should be expected to suffer one of the most dramatic differences showing a significant "repression" or "down-regulation" when compared with the WT mice. As indicated, the differential expression analyses are done with the same algorithm, and the main feature to be evaluated in the comparison is: how the mapping with the CDFs to "probesets" or the mapping with the CDFs to "genes" affect the differential expression analyses, particularly for the KO genes. In this way, the uniform comparison is presented in the last two columns of the table (framed with a black line). These results show that the reassigned probes to genes perform at least as well as the original probesets, because in all cases the KO gene is found as one of the most significant gene repressed. The biological/functional signature associated to each KO gene will be different and we do not know a priori how many other genes associated to this KO gene can be affected, therefore we cannot assume that the KO gene will always be the most repressed. In any case, the comparison is using exactly the same samples and we observe clear differences in the four methods presented, that cannot be due to the differences in the biological samples but only to the manner in which the microarrays are analysed. The table also shows an even comparison between three different algorithms MAS5, FARMS and RMA using the same CDFs to "probesets". This comparison between methods is also useful to confirm the widely reported observation that RMA performs effectively to identify differential expression signatures (Bolstad et al., 2003; Barash et al., 2004). Indeed, the method that we propose to apply to the new probe mappings uses RMA in the three steps of the expression calculation: background correction, normalization and summarization.

The best results are enhanced in yellow in the table below, showing that the KO gene gives best rank in 4 cases when we use GeneMapper CDFs: IRS2, NRAS, SCD1 and ENG. The % of significant gene loci (q-value < 0.10) with respect to the total number of gene loci is the largest for 2 KO genes (APOE and NRAS), and the p-value of the statistical test is the lowest for 2 KO genes (IRS2 and ENG). The results consistently indicate that the method that uses CDFs with the new remapping to "genes" provides at least as significant changes as the best of the three methods based on standard Affymetrix CDFs.
View larger image

A more detailed comparison of the differential expression signal achieved with each method is shown in the figure below, that presents the Volcano Plots obtained for the case of gene NRAS-/- 3 KOs versus 3 WTs. The Volcano Plot arrange genes along two dimensions of biological and statistical significance. The first (horizontal) dimension is the fold change between KO and WT samples (on a log2 scale, so that up and down regulation appear symmetric), and the second (vertical) axis represents the q-value (i.e. p-values corrected for multiple testing) of the statistical test done with SAM, that labels the significance of the differences between samples. These p-values are most conveniently presented in negative log10 scale, so that smaller q-values appear higher up. In this way, the first axis indicates biological size or impact of the change and the second axis indicates the statistical evidence or reliability of the change.

The results of these Volcano Plots show that the use of GeneMapper provides at least as good results as the best method to find significant differential expression in these samples (it finds 22 significant genes) and as a whole it provides similar data distribution to RMA using standard Affymetrix CDFs, but with a clear improvement in the number of significant genes found.
View larger image
Download the microarrays corresponding to the mouse gene KOs data sets described above
(6 .CEL files for each KO gene)
APOEko NRASko SCD1ko IRS2ko ENGko
Public data repository where microarrays are available
APOE GEO Series GSE2372 - Platform GPL1261 - Arrays number: GSM44658, GSM44663, GSM44659, GSM44660, GSM44661, GSM44662.
NRAS GEO Series GSE14829 - Platform GPL81 - Arrays number: GSM371168, GSM371169, GSM371170, GSM371174, GSM371175, GSM371176.
SCD1 GEO Series GSE2926 - Platform GPL32 - Arrays number: GSM63851, GSM63852, GSM63853, GSM63856, GSM63857, GSM63858.
IRS2 Arrays still not published (original data borrowed from FONT DE MORA J. et al. 2010).
ENG Arrays still not published (original data borrowed from RODRIGUEZ-BARBERO A. et al. 2010).