Supported file formats
GREAT requires its input files to be in BED format. BED is a standard file format used by the UCSC genome browser (and others) for defining genomic regions.
What is BED format?
Browser Extensible Data (BED) format is a file format used by the UCSC genome browser for defining genomic regions. It defines one genomic region (a "BED record") per line. GREAT requires each line to contain three mandatory fields - chromosome, start position, and end position for the region - separated by white space (i.e. space or tab). GREAT also accepts an optional region name as the fourth input field. Additional optional fields (5 and beyond) are ignored by GREAT but are fine to include in your input file. Full documentation of the BED format is available from UCSC.
The coordinates in a BED record are both 0-based, meaning the first base on a chromosome is numbered 0. A BED interval is also half-opened half-closed. So, the coordinates in a BED record are slightly different than those used to find a region in the genome browser. The genome browser region "chr1:1-1000" would be described in a BED record as "chr1 0 1000" with the start coordinate being one smaller and the end coordinate being the same, describing the half-closed half-open interval [0,1000) of length 1000bp starting at base 0. UCSC discusses this discrepancy here.
Can I use a different format?
GREAT only supports BED format, which is a popular standard used by the UCSC genome browser and others. Converting to this format is often very straight forward. If you do, make sure all your BED records all have unique names.
What should my test regions file contain?
The test regions file should contain one BED record per input region. You must assign each region a unique name.
How can I create a test set from a UCSC Genome Browser annotation track?
The UCSC Table Browser1 provides an interface for exporting an annotation track or a combination of annotation tracks to a file. One option for the output format is "BED - Browser Extensible Data", the input format used by GREAT. So, you can use the Table Browser to identify regions of interest in the genome, and then easily and directly use GREAT to examine the functional annotation enrichments of these regions.
For example, you can export the most conserved of the non-coding regions in the genome to BED format with the Table Browser (protocol explained in 2 ), then pass the BED file as input to GREAT to see the biological roles of the conserved regions.
What should my background regions file contain?
The background regions file, like the test regions file, must be in BED format. Again, you must assign each region a unique name.
Importantly, the background must be a superset of the main input set (that is, every record in the input set must also be in the background set).
|1||Karolchik D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D493-6.|
|2||Bejerano, et al. Computational screening of conserved genomic DNA in search of functional noncoding elements. Nature Methods. 2005 Jul;2(7):535-45.|