I'm trying to analysis the 200k pVCF data with some Swiss Army Knife tools, (e.g., bcftools, plink2) on the RAP. However, I always got errors.
After checking some pVCF files, I found that the header line nearby variants was like:
?
.......
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT2907487 3384636 2220572 4291816 4205425 3315497....
?
chr9 138250014 chr9:138250014:SG C G 14257 PASS AAScore=0.3715;ABHet=0.3697;ABHetMulti=....
.......
?
The VCF format seems not valid: 1) 'FORMAT' in header line is not followed by a space, 2) there is an unexpected blank line after the last header line.
?
Anyone could repeat this issue? Please give me a hand...
?
?
Best,
Wei-Yang
0
Permanently deleted user
In Bulk > Whole Genome Sequences, I just see one folder regarding population level data "Population level WGS variants, pVCF format - interim 200k release" and it only includes . vcf.gz.tbi files. I am wondering where .vcf.gz files are located.
Besides for whole exome data there are 3 sets of population level data PLINK, BGEN and pVCF, So I am wondering if PLINK and/or BGEN files will become available later.
1
Permanently deleted user
Hi Delnaz,
?
You can just search files by 'Any Name' in 'Population level WGS variants, pVCF format - interim 200k release', e.g., search name.vcf.gz, then there should be only two files left, i.e., name.vcf.gz and name.vcf.gz.tbi
?
And if you are trying to analysis some vcf data, please let know if you can get it through...., I'm wondering if the vcf format valid or not...
Best regards,
Wei-Yang
0
Permanently deleted user
Thanks Wei_Yang this was very helpful.
I am not familiar with this pVCF format and don't think plink2 will accept it as an input:
Thank you for your message we (UK Biobank) are working with DNAnexus and the originator of the files to resolve this problem.
I will post an update when we have further information.
Regards,
Caroline
UK Biobank
0
Permanently deleted user
Thanks, Caroline. I'll wait for that.
Another question, if the BGEN or PLINK format for the 200k population-level data will comes available, and how soon it will get released?
Best ,
0
Permanently deleted user
Hi,
I just wanted to note that I'm having the same problem. I'm unable to query the the files as the vcf format is not tab delimted (per the error message). Please let me know if this issue is resvolvable.
Best,
Natalie
0
Permanently deleted user
Is there a timeline for the resolution of this problem?
0
Permanently deleted user
Also curious - thanks!
0
Permanently deleted user
Hello,
I would like to understand more about the pVCF that I find in "Population level WGS variants, pVCF format - interim 200k release" before starting paid for analysis.
Is there a resource that describes those files in some detail?
Do they differ at all from a generic multi-sample VCF in terms of format?
I see that for the 150k release there was a QC subfolder with some information, is there anything like that for 200k? Do data in 150k QC apply also to the pVCF in 200k?
The detailed coverage info is currently restricted as Aleks mentioned. However, the publication said the average coverage is 32.5× with at least 23.5× per individual for 150k data release.
0
Permanently deleted user
Hi,
According to the DNAnexus data release table (https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/data-release-versions), data field 23196 - the whole genome GATK joint call pVCF files should be available in the folder "/Bulk/Whole genome sequences/Whole genome GATK joint call pVCF/". However, the folder and the data are not found on RAP. Only the GraphTyper version of pVCF is available. What happen to the GATK version? Would it be available later?
Thanks Chai for your reply. I think you refer to the 200k WGS release as the new data? Both the GraphTyper's 150K & 200K release pVCF are available (23352 & 24304), but not the GATK version. Is there a way to request for its access?
I see. You are right. They are different protocols.
I think it's best if you send request to UKB directly since they control which data DNAnexus should make it available. If you do not know the E-mail, you can start from AMS message system. I will try to get clarification from UKB what is the appropriate contact info for this type of request since we got many of them recently.
They should be released in a future showcase refresh.
Best wishes,
Aleks
0
Permanently deleted user
Are any base-level coverage summary statistics (mean, median DP across samples) available for either the WGS or WES data? I.e. per-site, rather than per-sample data. If not, could you suggest the best approach to do this? I know that per-sample DP information is available for genotyped variants in the pVCFs, but I would like to have DP data for all sites, not just those with a called variant.
Comments
24 comments
Hello everyone,
?
I'm trying to analysis the 200k pVCF data with some Swiss Army Knife tools, (e.g., bcftools, plink2) on the RAP. However, I always got errors.
After checking some pVCF files, I found that the header line nearby variants was like:
?
.......
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT2907487 3384636 2220572 4291816 4205425 3315497....
?
chr9 138250014 chr9:138250014:SG C G 14257 PASS AAScore=0.3715;ABHet=0.3697;ABHetMulti=....
.......
?
The VCF format seems not valid: 1) 'FORMAT' in header line is not followed by a space, 2) there is an unexpected blank line after the last header line.
?
Anyone could repeat this issue? Please give me a hand...
?
?
Best,
Wei-Yang
In Bulk > Whole Genome Sequences, I just see one folder regarding population level data "Population level WGS variants, pVCF format - interim 200k release" and it only includes . vcf.gz.tbi files. I am wondering where .vcf.gz files are located.
Besides for whole exome data there are 3 sets of population level data PLINK, BGEN and pVCF, So I am wondering if PLINK and/or BGEN files will become available later.
Hi Delnaz,
?
You can just search files by 'Any Name' in 'Population level WGS variants, pVCF format - interim 200k release', e.g., search name.vcf.gz, then there should be only two files left, i.e., name.vcf.gz and name.vcf.gz.tbi
?
And if you are trying to analysis some vcf data, please let know if you can get it through...., I'm wondering if the vcf format valid or not...
Best regards,
Wei-Yang
Thanks Wei_Yang this was very helpful.
I am not familiar with this pVCF format and don't think plink2 will accept it as an input:
https://www.cog-genomics.org/plink/2.0/formats
bcftools also seems to accept only vcf and bcf files as input:
http://samtools.github.io/bcftools/bcftools.html
Hi Wei-Yang,
Thank you for your message we (UK Biobank) are working with DNAnexus and the originator of the files to resolve this problem.
I will post an update when we have further information.
Regards,
Caroline
UK Biobank
Thanks, Caroline. I'll wait for that.
Another question, if the BGEN or PLINK format for the 200k population-level data will comes available, and how soon it will get released?
Best ,
Hi,
I just wanted to note that I'm having the same problem. I'm unable to query the the files as the vcf format is not tab delimted (per the error message). Please let me know if this issue is resvolvable.
Best,
Natalie
Is there a timeline for the resolution of this problem?
Also curious - thanks!
Hello,
I would like to understand more about the pVCF that I find in "Population level WGS variants, pVCF format - interim 200k release" before starting paid for analysis.
Thanks
Hi Andrew,
Thank you for your question! I will follow up with UK Biobank for resources that I can point you to.
Best,
Brenton
The data comes from this field:
https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=24304
The pipeline used to produce this data is well documented in Nature publication:
https://www.nature.com/articles/s41586-022-04965-x
Here is a good reference on the pVCF format and how it differs from gVCF:
https://www.biorxiv.org/content/10.1101/343970v1.full.pdf
We have some QC metrics (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=187), but they are currently all restricted.
They should be released in a future showcase refresh.
Best,
Aleks
I would like to know if this was resolved? Thank you
Hi Diana,
Yes please see Alek S' answer below.
Best,
Brenton
Hello,
Is there a way to obtain the coverage information for the UKBiobank WGS?
The detailed coverage info is currently restricted as Aleks mentioned. However, the publication said the average coverage is 32.5× with at least 23.5× per individual for 150k data release.
Hi,
According to the DNAnexus data release table (https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/data-release-versions), data field 23196 - the whole genome GATK joint call pVCF files should be available in the folder "/Bulk/Whole genome sequences/Whole genome GATK joint call pVCF/". However, the folder and the data are not found on RAP. Only the GraphTyper version of pVCF is available. What happen to the GATK version? Would it be available later?
Thanks for your advice.
I think this got restricted after newer data is available. See the note here.
https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23196
Thanks Chai for your reply. I think you refer to the 200k WGS release as the new data? Both the GraphTyper's 150K & 200K release pVCF are available (23352 & 24304), but not the GATK version. Is there a way to request for its access?
Thanks.
I see. You are right. They are different protocols.
I think it's best if you send request to UKB directly since they control which data DNAnexus should make it available. If you do not know the E-mail, you can start from AMS message system. I will try to get clarification from UKB what is the appropriate contact info for this type of request since we got many of them recently.
Will do, thanks so much Chai!
Hi Andrew,
The data comes from this field:
https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=24304
Here is a good reference on the pVCF format and how it differs from gVCF:
https://www.biorxiv.org/content/10.1101/343970v1.full.pdf
The pipeline used to produce this data is well documented in Nature publication:
https://www.nature.com/articles/s41586-022-04965-x
We have some QC metrics (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=187), but they are currently restricted.
They should be released in a future showcase refresh.
Best wishes,
Aleks
Are any base-level coverage summary statistics (mean, median DP across samples) available for either the WGS or WES data? I.e. per-site, rather than per-sample data. If not, could you suggest the best approach to do this? I know that per-sample DP information is available for genotyped variants in the pVCFs, but I would like to have DP data for all sites, not just those with a called variant.
Any additional information may be present in helper_files subdirectories that may or may not be present.
Please sign in to leave a comment.