Number of sites removed by ukb23158_500k_OQFE.90pct10dp_qc_variants.txt
Hi all,
I am applying a set of QC filters on the whole-exome sequencing and I wanted to validate some numbers with others who might have experience with this file (ukb23158_500k_OQFE.90pct10dp_qc_variants.txt).
It contains 5,798,366 sites, removing 21.4% (n=27,051,678) of all variants in the UKB exome (autosomes: 21.0%, X chromosome: 37.8%).
It is by far the QC step that removes the most variant sites (other filters remove <3% sites). I wanted to make sure the variant sites in this file were correct and see if other users came up with a similar proportion of variant sites removed using this file?
Thanks in advance,
Comments
6 comments
Hi,
To date we have not had queries from other researchers regarding this filtering file (the last update was in July 2022).
For more information about how this file was generated please see: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/generation-and-utilization-of-quality-control-set-90pct10dp-on-oqfe-data/details-on-processing-the-300k-exome-data-to-generate-the-quality-control-set
For best analysis practices please see: https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=914
Hope this helps
Hi George - thanks for your reply.
I applied the filter (90% readings with DP≥10) in Hail and I have the same numbers I get when I apply the ukb23158_500k_OQFE.90pct10dp_qc_variants.txt file.
Would it be possible to update Table 1 in https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/generation-and-utilization-of-quality-control-set-90pct10dp-on-oqfe-data/details-on-processing-the-300k-exome-data-to-generate-the-quality-control-set with filtering proportions in the final release?
Also, are the proportions removed for SNPs in Table 1 obtained after filtering for allele frequency (eg, >1%)? It still seems to me there are quite a lot of variant sites in the file compared to the proportions reported in Table 1.
Bests,
@Diana Cornejo I am continuing this conversation from the old community board. We also removed 15-20% using DP indel filter 10 and DP snp filter 7 (90% readings for both). We are using bcftools.
Hi Bastien,
Thanks for sharing these numbers!
I was wondering—does this file contain low-quality variants that were filtered out, or does it include the high-quality variants that passed QC? I want to make sure I interpret the filtering correctly.
Anushka Sinha From this page it looks like they are SNPs that fail QC, specifically:
Bastien Rioux , can you find the URL for the gitbook page on GitHub? In theory it's possible to make a PR against the docs, but I think they made the docs closed source.
Please sign in to leave a comment.