Which set of genes does GREAT use?
To limit the gene sets to only extremely high-confidence gene predictions, we use only the subset of the UCSC Known Genes (Hsu et al., 2006) that are protein-coding (cdsStart != cdsEnd), are on non-random and non-haplotype chromosomes, and possess at least one meaningful Gene Ontology (GO) annotation (Ashburner et al., 2000). GO is an ontological representation of information related to the biological processes, cellular components, and molecular functions of genes, and thus we rely on the idea that if a gene has been annotated for function it should be included in the gene set. Uninformative GO terms that do not allow entry into the gene set are 'Gene Ontology', 'biological process', 'cellular component', 'molecular function', 'obsolete biological process', 'obsolete cellular component', and 'obsolete molecular function'.
How does GREAT determine a single transcription start site for each gene?
A single gene may have multiple splice variants. GREAT uses a single transcription start site for each gene to calculate the basin of attraction for each gene. To choose which transcription start site to use as the canonical transcription start site for a gene, we rely on the definition of the canonical isoform given by the
knownCanonical table used by the UCSC Known Genes track (Hsu et al., 2006).