Which set of genes does GREAT use?
Human and mouse
To limit the gene sets to only extremely high-confidence gene predictions, GREAT uses only the subset of the UCSC Known Genes<ref name="hsu">Hsu, F. et al. The UCSC Known Genes. Bioinformatics. 22(9):1036-1046 (2006).1 Ashburner M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium, Nat Genet. 25(1):25-29 (2000).2 .
|1||that are protein-coding (cdsStart != cdsEnd), are on non-random and non-haplotype chromosomes, and possess at least one meaningful Gene Ontology (GO) annotation|
GO includes information on the biological processes, cellular components, and molecular functions of genes. Thus, GREAT assumes that if a gene has been annotated for function at all then it is annotated in GO. Uninformative GO terms that do not allow entry into the gene set are 'Gene Ontology', 'biological process', 'cellular component', 'molecular function', 'obsolete biological process', 'obsolete cellular component', and 'obsolete molecular function'.
The zebrafish genome has no gold-standard set of coding genes mapped to the danRer6 genomes. In particular, danRer6 has no UCSC Known Genes set. Furthermore, most ontology data is linked to ZFIN gene identifiers, which are not mapped to the genome.
As GREAT relies on high-quality mappings of genes to the genome, we created a custom, quality gene set using the following transcript/gene sources:
First, we mapped all RefSeq transcripts using the latest transcripts downloaded from NCBI. Next, we mapped all Ensembl transcripts belonging to coding genes and retained only those loci that did not already contain a RefSeq mapping since many Ensembl genes were further refined by Refseq after the Ensembl build. In a third step, we mapped zebrafish proteins from RefSeq and Uniprot, again keeping only loci that did not already contain a RefSeq or Ensembl transcript mapping.
All transcripts and proteins were mapped using BLAT requiring that at least 80% of the sequence matches with at least 95% identity to one co-linear locus in the zebrafish genome. These parameters are more stringent than the used in the mappings provided by the UCSC genome browser, which also annotates genes to loci where only a small fraction of the gene sequence matches. For GREAT, we need a higher stringency as inflating the number of loci for a gene compromises GREAT's statistical tests.
We retained only the best hit per locus (@Michael+Saatvik: Do you mean query transcript or protein here rather than locus?), which effectively handles matches of paralogs. As a substantial number of bona-fide genes (such as Ctnnbl1 or Wnt9a) map to scaffolds, we include all gene-containing scaffolds in zebrafish GREAT. In contrast to the human and mouse gene sets, we keep also genes that currently do not possess a meaningful GO annotation because manual inspection found that the human ortholog often has annotations, indicating that zebrafish genes are simply less well annotated in GO. Furthermore, many of these genes have annotations in other ontologies. We expect that many of the genes currently without GO annotations will get annotations in the near future.
Our set of reliably mapped genes contains 14,039 genes mapped to 14,834 genomic loci for danRer6.
How does GREAT determine a single transcription start site for each gene?
Many genes have multiple splice variants, however the vast majority of annotations available for these genes do not (and often cannot) distinguish between the different isoforms. Motivated by this observation, GREAT uses a single transcription start site to represent each gene in calculating gene regulatory domains. So, GREAT uses the transcription start site of the canonical isoform of a gene. The definition of the canonical isoform is taken from the