Previous topic | Next topic
Author Message
 Post subject: Correcting for LD?
PostPosted: Mon Jul 01, 2013 6:09 pm 
I have did a region-based association analysis (think SKAT) focusing on regulatory regions. I want to see if the top 10% (for example) of regions are enriched for some functions vs. all regions that went into the analysis.

A collaborator pointed out that at least some of the top 5% of regions will be in LD with each other (since there are multiple regulatory regions per gene and some of those regions will be in LD) and is concerned that GREAT assumes that all the foreground regions are independent, while the foreground regions in my situation are not independent.

I don't want to just pick one region per gene since that would penalize me too severely (not all of the regions for a given gene are actually in LD and you might expect different regulatory regions for a given gene to have different distinct effects), but I'm not sure how concerned I should be about LD and what I can do?
Report this post   
Reply with quote  
 Post subject:
PostPosted: Tue Jul 09, 2013 10:20 am 
I would like to try some permutations to see how badly LD is affecting my GREAT results for a particular GO term. However I'm having trouble replicating (in R) the GREAT p-value.

I used an explicit foreground and background with these values (variable names based on the GREAT Statistics page):

N = 44749
n = 2237* (GREAT reported that "12 of all 2,237 genomic regions (0.5%) are not associated with any genes", but this shouldn't make a big difference in p-val)
K = 15
k = 8

GREAT gave my GO term of interest a p value of 1.81e-7

When I use R to calculate the p-value, I get a rather different result:
1-phyper(8, 15, 44749-12-15, 2237-12) = 6.981758e-09
1-phyper(8, 15, 44749, 2237) = 7.279212e-09
Report this post   
Reply with quote  
 Post subject:
PostPosted: Tue Jul 09, 2013 9:58 pm 
Oops I see, if I do:
1-phyper(7, 15, 44749-12-15, 2237-12)
[1] 1.737449e-07

That's close enough to GREAT's value of 1.8e-7.... now back to addressing the LD. :)
Report this post   
Reply with quote  
 Post subject:
PostPosted: Thu Jul 18, 2013 11:28 am 
Site Admin

1) To obtain the p-value that GREAT outputs use 1-phyper(7, 15, 44749-15, 2237). The 12 regions that are unassociated with genes can be included in the statistic since the question we are posing is if the set of regions we have selected as our foreground is significantly enriched for a particular function. Regions that are unassociated, can thus be taken as regions that do not have the function we are testing.

2) Since I am unfamiliar with how your regions are being chosen (or the exact hypothesis being tested), I cannot provide a definitive answer. But, I would suggest the following:

Run GREAT on your dataset and obtaining the list of region/gene associations (from "Global Controls" > "View all region-gene associations").

Next, for each gene that has multiple regions, I would compute the level of LD between the regions using a tool such as Haploview ( Alternatively, you can try downloading precomputed data from sources such as HapMap ( or the UCSC genome browser (

If the LD between any two regions that target the same gene is high, I may consider dropping one of the regions (without losing other genes...since one region may be in the regulatory domain of two genes) unless you have reason to believe both regions are important.

Let us know how this turns out, since I am sure other users would be interested in this type of analysis.

Bejerano Team
Report this post   
Reply with quote  
Post New Topic » Reply »  Page 1 of 1   [ 4 posts ]  

Who is online

Users browsing this forum: No registered users and 3 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
Fatal: Not able to open ./cache/data_global.php