A variant from a UKB case in gnomAD is missing from the UKB VCF file
I have a variant that has allele count of 2 in gnomAD, 1 one which is a UKB exome case. I looked up the variant in the corresponding VCF file and folder (Population level exome OQFE variants, pVCF format - final release) using bcftools. But I couldn't find the variant in the file. There were variants from other coordinates in that region of genome in the VCF file which confirmes I have used the correct file. but the variant from gnomAD is missing from the file. I am wondering what can be the reason for that?
Comments
10 comments
How do you know that one is a UKB exome case?
The gnomad site https://gnomad.broadinstitute.org/help suggests that it isn't possible to tell which cohort a variant comes from, ie:
Can I filter out a particular cohort for my analysis?
Unfortunately, for many reasons (including consent and data usage restrictions) we cannot provide the information required for such filtering. In addition, as gnomAD has grown, we have stopped creating subsets and plan to phase out support of past subsets in the browser.
Am I misinterpreting that?
If it definitely was from UKB, then it is possible that the UKB participant concerned has since withdrawn their consent to the use of their data, in which case their genetic variant would not be visible in the UKB file. Since the UKB study started, approximately 1100 participants out of ~500k have withdrawn consent, so it isn't very likely, but certainly possible.
Is the variant present in the UKB Allele Frequency Browser. https://afb.ukbiobank.ac.uk/ ?
Thank you for your response and sharing the link to the browser. The GnomAD browser and gnomAD VCF files both clearly indicate if a variant is non-UKB. Although they don't directly specify if variants are from the UKB database, it is still clear which ones are.
The variant is present in the browser. Does this mean that it should be present in the pVCF files as well?
It means that at the time when the allele frequency browser summary file was generated there must have been a not-withdrawn ukb participant with that variant.
The browser is based on the Whole Genome Sequencing data, not on the Exome Sequencing data. It is still possible that the participant withdrew between the date the summary file was generated and now, but it is even less likely.
Have you tried the RAP cohort browser Genomic search? That is using the Exome Sequencing data. See https://documentation.dnanexus.com/user/cohort-browser#add-genomic-filter
I am not a geneticist, but I suspect there could be some kind of filtering for quality control. It is a bit odd that it could have got to gnomad if it is low quality, but they could have applied different cut-offs.
If the variant is present in the RAP cohort browser Genomic search, then I believe that would suggest that it should be in the VCF. If you find it in the RAP cohort browser, then you can get the participant EID and then look in the Exome VCF, field 23142.
If it is in the RAP cohort browser, but not in the VCF, please raise a ticket.
Thank you very much! This is information is very helpful.
Based on your response, the individual with the variant in the UKB browser (the genome data), is not the same individual who has the variant in gnomAD (UKB exome cohort).
For using the RAP cohort browser, do we need to prepare our own dataset or is there an exome dataset available for running with the RAP cohort browser?
Sorry, I wasn't clear. For such a rare variant, it probably is the same participant in the UKB Genome data as in the UKB Exome data. Most UKB participants have both Exome Sequencing data and Genome Sequencing data.
There is no need to prepare a dataset for the RAP cohort browser. From the cohort browser, select Genomics, click Add filter, select Geno, select Variant ID, enter variant rsid or location, select All/homo/hetero, select Apply geno filter, select Data Preview, and you should see the EID(s) of the participant(s) with that variant.
Thank you so much for the response. It makes sense that the participant has both exome and genome.
I looked for the variant in the RAP cohort browser, but it was absent. Just to confirm, does the RAP browser contain the entire genome dataset or just the data we currently have in the project? My project currently only has half of the genome data since I created it last year, and I need to create a new project for the data to be updated.
The RAP cohort browser genomic search is using Exome Sequencing data, so creating a new project to include the new Whole Genome Sequencing data will have no effect on the cohort browser search results. Since the gnomad data uses UKB Exome Sequencing data, it shouldn't be necessary for you to look in the Whole Genome Sequencing data.
I spoke to one of my colleagues about this, and they said it looks as if the variant you want may be in the individual VCF files, but may have failed some quality control step so it is not in the pVCF (and not in the genomic search).
Thanks. I created a new project and searched in the genome pVCF files and found the variant. Not sure why it's not in the exome data but I have the individual ID now. Are individual IDs the same between projects? For example if the variant is found in case 10, is it going to be still case 10 in my other project?
Yes, both RAP projects will be part of the same AMS application, so they will have the same IDs.
Please sign in to leave a comment.