Utility of Hardy-Weinberg Equilibrium filtering in UKB genomic data: p<1e-15 is not a good cutoff.

This is less of a question and more of a discussion point. Feel free to chime in in the comments.

 

I got a comment on my GWAS git repo about why I did not include a hwe filter on my QC step and thought I should bring my response here as well.

 

In my experience, I find filtering by Hardy-Weinberg Equilibrium to remove risk variants and variants related to population structure that I would prefer to keep in the analysis. This is especially bad in such a large cohort as the UK biobank where even very minor deviations from HWE are considered highly significant. If you do filter on HWE you must use a very small p-value due to the massive N in the UKB. The recommended value of 1e-15 in the tutorials is not strict enough. In my testing, I was using 1e-30 and still getting too many false positives. You probably should use something like 1e-60 or even smaller. The plink2 manual even suggests 1e-50 as a reasonable starting filter with the acknowledgement that it should be even smaller for very large datasets.

 

The makers of plink also have a nice hwe calculator here which is a great way to compare the effects of sample size on hwe.

 

test1: N=4990, het=1100, homoz1=3800 homoz2=90, hwe p-value = 0.311

test2: N=49,900, het=11000, homoz1=38000 homoz2=900, hwe p-value = 0.0017

test3: N=499,000, het=110000, homoz1=380000 homoz2=9000, hwe p-value = 3.7e-23

 

MAF is the same in all tests. Tests 1 and 2 the size of reasonable lab and consortia recruitments. Test 3 is the size of the UKB. The same variant that would be pass hwe in the smaller scale studies would be excluded in the UKB analysis.

 

So, to summarize: Can you use HWE for filtering? Yes. Should you use it? It depends on what you are looking for. If you do use it, you MUST use a much stricter p-value than the current tutorials suggest. Otherwise you will be excluding a very large number of SNPs that you should keep in your analysis.

 

And, yes this is a critique of many published UKB genomic studies.

 

Link to the plink2 reference page on hwe:

https://www.cog-genomics.org/plink/2.0/filter#hwe

Comments

2 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    I enjoy reading this analysis and have a very naive question.

    I saw a tutorial (see box1) said that HW threshold in case could be less stringent than control. If we try to map your suggestion above to that, should we consider this suggested threshold for the control? Or would you say it depends on how skew the case/control is?

    0
  • Comment author
    Former User of DNAx Community_28

    I never bothered with different thresholds for cases and controls because when you reject a SNP in either, it will not be included in the analysis. I always used the case threshold cutoff for the pooled dataset. My initial number filter level was 1e-10 for sample sizes under 10K and I would relax that depending on the percentage of SNPS were removed from the analysis.

     

    I started reducing the p-value even more once I realized it was removing SNPs due to population structure in mixed ancestry cohorts. About 3 years ago I started working on UKB data and found that the sample size here amplified the least deviation from HWE. I eventually came to the conclusion that filtering for HWE poorly may have a more deleterious effect on the outcome than not filtering at all.

     

    Thowing a couple a papers on HWE and population structure and QC:

     

    https://www.frontiersin.org/articles/10.3389/fgene.2020.00210/full

    https://www.frontiersin.org/articles/10.3389/fgene.2017.00167/full

     

    0

Please sign in to leave a comment.