Have questions about the 200k WGS Joint Variant Call Data Release? Ask them here!

Brenton Pyle DNAnexus Team

08 June 2022 00:00
24 comments

The DNAnexus team will monitor this post to help answer any of your questions about accessing and working with the new data release on RAP.

Comments

24 comments

Permanently deleted user
- 10 June 2022 15:37
Hello everyone,
?

I'm trying to analysis the 200k pVCF data with some Swiss Army Knife tools, (e.g., bcftools, plink2) on the RAP. However, I always got errors.
After checking some pVCF files, I found that the header line nearby variants was like:
?
.......
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT2907487 3384636 2220572 4291816 4205425 3315497....
?
chr9 138250014 chr9:138250014:SG C G 14257 PASS AAScore=0.3715;ABHet=0.3697;ABHetMulti=....
.......
?
The VCF format seems not valid: 1) 'FORMAT' in header line is not followed by a space, 2) there is an unexpected blank line after the last header line.
?
Anyone could repeat this issue? Please give me a hand...
?
?
Best,
Wei-Yang

0
Permanently deleted user
- 14 June 2022 14:24
In Bulk > Whole Genome Sequences, I just see one folder regarding population level data "Population level WGS variants, pVCF format - interim 200k release" and it only includes . vcf.gz.tbi files. I am wondering where .vcf.gz files are located.
Besides for whole exome data there are 3 sets of population level data PLINK, BGEN and pVCF, So I am wondering if PLINK and/or BGEN files will become available later.

1
Permanently deleted user
- 14 June 2022 14:40
Hi Delnaz,
?
You can just search files by 'Any Name' in 'Population level WGS variants, pVCF format - interim 200k release', e.g., search name.vcf.gz, then there should be only two files left, i.e., name.vcf.gz and name.vcf.gz.tbi
?
And if you are trying to analysis some vcf data, please let know if you can get it through...., I'm wondering if the vcf format valid or not...

Best regards,
Wei-Yang

0
Permanently deleted user
- 14 June 2022 14:50
Thanks Wei_Yang this was very helpful.
I am not familiar with this pVCF format and don't think plink2 will accept it as an input:
https://www.cog-genomics.org/plink/2.0/formats
bcftools also seems to accept only vcf and bcf files as input:
http://samtools.github.io/bcftools/bcftools.html

0
Permanently deleted user
- 14 June 2022 15:47
Hi Wei-Yang,

Thank you for your message we (UK Biobank) are working with DNAnexus and the originator of the files to resolve this problem.

I will post an update when we have further information.

Regards,
Caroline
UK Biobank

0
Permanently deleted user
- 15 June 2022 03:51
Thanks, Caroline. I'll wait for that.
Another question, if the BGEN or PLINK format for the 200k population-level data will comes available, and how soon it will get released?

Best ,

0
Permanently deleted user
- 15 June 2022 20:00
Hi,

I just wanted to note that I'm having the same problem. I'm unable to query the the files as the vcf format is not tab delimted (per the error message). Please let me know if this issue is resvolvable.

Best,

Natalie

0
Permanently deleted user
- 20 June 2022 17:43
Is there a timeline for the resolution of this problem?

0
Permanently deleted user
- 14 July 2022 22:53
Also curious - thanks!

0
Permanently deleted user
- 10 November 2022 13:50
Hello,
I would like to understand more about the pVCF that I find in "Population level WGS variants, pVCF format - interim 200k release" before starting paid for analysis.
- Is there a resource that describes those files in some detail?
- Do they differ at all from a generic multi-sample VCF in terms of format?
- I see that for the 150k release there was a QC subfolder with some information, is there anything like that for 200k? Do data in 150k QC apply also to the pVCF in 200k?
Thanks
0
Brenton Pyle DNAnexus Team
- 10 November 2022 18:05
Hi Andrew,

Thank you for your question! I will follow up with UK Biobank for resources that I can point you to.

Best,
Brenton

0
Aleks S Data Analyst UKB Community team
- 14 November 2022 17:07
The data comes from this field:
https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=24304

The pipeline used to produce this data is well documented in Nature publication:
https://www.nature.com/articles/s41586-022-04965-x

Here is a good reference on the pVCF format and how it differs from gVCF:
https://www.biorxiv.org/content/10.1101/343970v1.full.pdf

We have some QC metrics (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=187), but they are currently all restricted.
They should be released in a future showcase refresh.

Best,
Aleks

0
Permanently deleted user
- 09 January 2023 19:18
I would like to know if this was resolved? Thank you

0
Brenton Pyle DNAnexus Team
- 09 January 2023 20:42
Hi Diana,

Yes please see Alek S' answer below.

Best,
Brenton

0
Permanently deleted user
- 06 February 2023 20:09
Hello,

Is there a way to obtain the coverage information for the UKBiobank WGS?

0
Chai Fungtammasan DNAnexus Team
- 13 February 2023 17:09
The detailed coverage info is currently restricted as Aleks mentioned. However, the publication said the average coverage is 32.5× with at least 23.5× per individual for 150k data release.

0
Permanently deleted user
- 24 February 2023 14:59
Hi,

According to the DNAnexus data release table (https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/data-release-versions), data field 23196 - the whole genome GATK joint call pVCF files should be available in the folder "/Bulk/Whole genome sequences/Whole genome GATK joint call pVCF/". However, the folder and the data are not found on RAP. Only the GraphTyper version of pVCF is available. What happen to the GATK version? Would it be available later?

Thanks for your advice.

0
Chai Fungtammasan DNAnexus Team
- 24 February 2023 17:30
I think this got restricted after newer data is available. See the note here.
https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23196

0
Permanently deleted user
- 24 February 2023 17:45
Thanks Chai for your reply. I think you refer to the 200k WGS release as the new data? Both the GraphTyper's 150K & 200K release pVCF are available (23352 & 24304), but not the GATK version. Is there a way to request for its access?
Thanks.

0
Chai Fungtammasan DNAnexus Team
- 24 February 2023 17:51
I see. You are right. They are different protocols.

I think it's best if you send request to UKB directly since they control which data DNAnexus should make it available. If you do not know the E-mail, you can start from AMS message system. I will try to get clarification from UKB what is the appropriate contact info for this type of request since we got many of them recently.

0
Permanently deleted user
- 24 February 2023 17:56
Will do, thanks so much Chai!

0
Aleks S Data Analyst UKB Community team
- 02 March 2023 12:46
Hi Andrew,

The data comes from this field:
https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=24304

Here is a good reference on the pVCF format and how it differs from gVCF:
https://www.biorxiv.org/content/10.1101/343970v1.full.pdf

The pipeline used to produce this data is well documented in Nature publication:
https://www.nature.com/articles/s41586-022-04965-x

We have some QC metrics (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=187), but they are currently restricted.
They should be released in a future showcase refresh.

Best wishes,
Aleks

0
Permanently deleted user
- 14 March 2023 16:12
Are any base-level coverage summary statistics (mean, median DP across samples) available for either the WGS or WES data? I.e. per-site, rather than per-sample data. If not, could you suggest the best approach to do this? I know that per-sample DP information is available for genotyped variants in the pVCFs, but I would like to have DP data for all sites, not just those with a called variant.

0
James Y UKB Community team Data Analyst
- 20 March 2023 10:16
Any additional information may be present in helper_files subdirectories that may or may not be present.

0

Please sign in to leave a comment.