Hello, I would like to know what would possibly be the best way of doing QC for the WES final data. Previously we have generated our in-house pipelines to QC the data in terms of genotypes and variants by using SoS in our own cluster
(when the 200K pVCFs where available to download). However this is not transferrable to the RAP system. Has anyone had success on doing variant and genotype specific quality control on the pVCFs within RAP. We have been discussing if using HAIL would be ideal for this case but even with HAIL I've seen many people complaining and there's no good way of calculating the amount of resources that it would take (as in instance type and time of analysis). I would appreciate any insights from someone that had successfully QC'ed this data. Thanks in advance
I heard a good review for Hail regarding the QC and its flexibility. It needs serious learning and could be expensive if you aren?t sure what need to be done because you would have to run large cluster using on-demand pricing. However, if you will go with this option, Check out these two threads for the community version of Hail and the problem with Hail we ran into so far.
I recommend an easier approach by using our sample notebook to QC data. You can see documentation here. This tutorial also include new Regenie app which we will make official announcement next week. https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/gwas-ex
QC isn't exactly a fix process though, so I would love to hear from other community members if they like to share their best practice.
I would also recommend this thread by @Phil Greer? . It's a good read and emphasize that people might need to rethink of several standard processing when it come to Biobank level of data.
{@005t000000149vjAAA}? thanks a lot for the quick reply. So I have some follow up questions.
I've peeked at the links you posted and even tho they provide some useful information, I still wonder if our need is to filter variants, genotypes and samples. The samples part I can see that at least for the 200K WES there's a way of filtering samples in which reported sex matches with genetic sex, removed individuals with aneuploidies, filter by british ancestry and so forth...However in terms of genotype QC we would like to perform filtering based on DP, QC and AB. And for variant QC something like call rate, etc.. I do not see a simple way of doing this, I've tried with swiss-army-knife as well using bcftools, but this will imply to copy each pVCF blocks into the cloud when each one of these files is 30Gb. My understanding is that there's not a data folder that is mounted for each of the worker nodes to use this data without having to copy first (I'm not sure I'm phrasing this correctly and you understand). This is why Hail seemed like a better way of doing QC. Thanks a lot for your response.
Hmm. It's fair to say that if you want to avoid saving intermediate files, doing the filtering Hail is a good solution (albeit the problem with resource estimation that you raise).
However, before settle on that option, have you check if this QC description would be useful for you?
Maybe if you create the list of variant/individual you want to remove first, you can just use that to filter PLINK or BGEN which would be more manageable than pVCF.
Comments
4 comments
There are several options for this.
https://community.dnanexus.com/s/question/0D5t000004CbOq3CAF/isnt-it-about-time-the-jupyterlab-instance-got-an-overhaul
https://community.dnanexus.com/s/question/0D5t000004AflSiCAJ/hail-troubleshooting-for-ukb-data
Our member, Anastazie would also present end-to-end target discovery with GWAS and PheWAS webinar on March 9. I recommend you join it. https://community.dnanexus.com/s/question/0D5t000004SBd6xCAD/webinar-mar-9-end-to-end-target-discovery-with-gwas-and-phewas Also, if you have follow up questions and like to discuss interactively in real time, there is a breakout session that you can consult with her here. https://dnanexus.zoom.us/meeting/register/tJIpduGoqTojHNVHdSQoINr0P46ZL5Quh12k See note on breakout room 1
https://saigegit.github.io/SAIGE-doc/docs/UK_Biobank_WES_analysis.html
https://community.dnanexus.com/s/question/0D5t00000416OTeCAM/performing-gwas-on-rap-using-dx-toolkit-and-swissarmyknife
QC isn't exactly a fix process though, so I would love to hear from other community members if they like to share their best practice.
I would also recommend this thread by @Phil Greer? . It's a good read and emphasize that people might need to rethink of several standard processing when it come to Biobank level of data.
https://community.dnanexus.com/s/question/0D5t000004AeqshCAB/utility-of-hardyweinberg-equilibrium-filtering-in-ukb-genomic-data-p1e15-is-not-a-good-cutoff
{@005t000000149vjAAA}? thanks a lot for the quick reply. So I have some follow up questions.
I've peeked at the links you posted and even tho they provide some useful information, I still wonder if our need is to filter variants, genotypes and samples. The samples part I can see that at least for the 200K WES there's a way of filtering samples in which reported sex matches with genetic sex, removed individuals with aneuploidies, filter by british ancestry and so forth...However in terms of genotype QC we would like to perform filtering based on DP, QC and AB. And for variant QC something like call rate, etc.. I do not see a simple way of doing this, I've tried with swiss-army-knife as well using bcftools, but this will imply to copy each pVCF blocks into the cloud when each one of these files is 30Gb. My understanding is that there's not a data folder that is mounted for each of the worker nodes to use this data without having to copy first (I'm not sure I'm phrasing this correctly and you understand). This is why Hail seemed like a better way of doing QC. Thanks a lot for your response.
Hmm. It's fair to say that if you want to avoid saving intermediate files, doing the filtering Hail is a good solution (albeit the problem with resource estimation that you raise).
However, before settle on that option, have you check if this QC description would be useful for you?
https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/generation-and-utilization-of-quality-control-set-90pct10dp-on-oqfe-data
https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/generation-and-utilization-of-quality-control-set-90pct10dp-on-oqfe-data/details-on-processing-the-300k-exome-data-to-generate-the-quality-control-set
Maybe if you create the list of variant/individual you want to remove first, you can just use that to filter PLINK or BGEN which would be more manageable than pVCF.
Please sign in to leave a comment.