Have questions about the 200k WGS Joint Variant Call Data Release? Ask them here!

Brenton Pyle DNAnexus Team
The DNAnexus team will monitor this post to help answer any of your questions about accessing and working with the new data release on RAP.

Comments

24 comments

  • Comment author
    Permanently deleted user

    Hello everyone,

    ?

     

    I'm trying to analysis the 200k pVCF data with some Swiss Army Knife tools, (e.g., bcftools, plink2) on the RAP. However, I always got errors.

    After checking some pVCF files, I found that the header line nearby variants was like:

    ?

    .......

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT2907487 3384636 2220572 4291816 4205425 3315497....

    ?

    chr9 138250014 chr9:138250014:SG C G 14257 PASS AAScore=0.3715;ABHet=0.3697;ABHetMulti=....

    .......

    ?

    The VCF format seems not valid: 1) 'FORMAT' in header line is not followed by a space, 2) there is an unexpected blank line after the last header line.

    ?

    Anyone could repeat this issue? Please give me a hand...

    ?

    ?

    Best,

    Wei-Yang

    0
  • Comment author
    Permanently deleted user

    In Bulk > Whole Genome Sequences, I just see one folder regarding population level data "Population level WGS variants, pVCF format - interim 200k release" and it only includes . vcf.gz.tbi files. I am wondering where .vcf.gz files are located.

    Besides for whole exome data there are 3 sets of population level data PLINK, BGEN and pVCF, So I am wondering if PLINK and/or BGEN files will become available later.

    1
  • Comment author
    Permanently deleted user

    Hi Delnaz,

    ?

    You can just search files by 'Any Name' in 'Population level WGS variants, pVCF format - interim 200k release', e.g., search name.vcf.gz, then there should be only two files left, i.e., name.vcf.gz and name.vcf.gz.tbi

    ?

    And if you are trying to analysis some vcf data, please let know if you can get it through...., I'm wondering if the vcf format valid or not...

     

    Best regards,

    Wei-Yang

    0
  • Comment author
    Permanently deleted user

    Thanks Wei_Yang this was very helpful.

    I am not familiar with this pVCF format and don't think plink2 will accept it as an input:

    https://www.cog-genomics.org/plink/2.0/formats

    bcftools also seems to accept only vcf and bcf files as input:

    http://samtools.github.io/bcftools/bcftools.html

    0
  • Comment author
    Permanently deleted user

    Hi Wei-Yang,

     

    Thank you for your message we (UK Biobank) are working with DNAnexus and the originator of the files to resolve this problem.

     

    I will post an update when we have further information.

     

    Regards,

    Caroline

    UK Biobank

    0
  • Comment author
    Permanently deleted user

    Thanks, Caroline. I'll wait for that.

    Another question, if the BGEN or PLINK format for the 200k population-level data will comes available, and how soon it will get released?

     

    Best ,

    0
  • Comment author
    Permanently deleted user

    Hi,

     

    I just wanted to note that I'm having the same problem. I'm unable to query the the files as the vcf format is not tab delimted (per the error message). Please let me know if this issue is resvolvable.

     

    Best,

     

    Natalie

    0
  • Comment author
    Permanently deleted user

    Is there a timeline for the resolution of this problem?

    0
  • Comment author
    Permanently deleted user

    Also curious - thanks!

    0
  • Comment author
    Permanently deleted user

    Hello,

    I would like to understand more about the pVCF that I find in "Population level WGS variants, pVCF format - interim 200k release" before starting paid for analysis.

    • Is there a resource that describes those files in some detail?
    • Do they differ at all from a generic multi-sample VCF in terms of format?
    • I see that for the 150k release there was a QC subfolder with some information, is there anything like that for 200k? Do data in 150k QC apply also to the pVCF in 200k?

    Thanks

    0
  • Comment author
    Brenton Pyle DNAnexus Team

    Hi Andrew,

     

    Thank you for your question! I will follow up with UK Biobank for resources that I can point you to.

     

    Best,

    Brenton

    0
  • Comment author
    Aleks S Data Analyst The helpers that keep the community running smoothly. UKB Community team

    The data comes from this field:

    https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=24304

     

    The pipeline used to produce this data is well documented in Nature publication:

    https://www.nature.com/articles/s41586-022-04965-x

     

    Here is a good reference on the pVCF format and how it differs from gVCF:

    https://www.biorxiv.org/content/10.1101/343970v1.full.pdf

     

    We have some QC metrics (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=187), but they are currently all restricted.

    They should be released in a future showcase refresh. 

     

    Best,

    Aleks

    0
  • Comment author
    Permanently deleted user

    I would like to know if this was resolved? Thank you

    0
  • Comment author
    Brenton Pyle DNAnexus Team

    Hi Diana,

     

    Yes please see Alek S' answer below.

     

    Best,

    Brenton

    0
  • Comment author
    Permanently deleted user

    Hello,

     

    Is there a way to obtain the coverage information for the UKBiobank WGS?

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    The detailed coverage info is currently restricted as Aleks mentioned. However, the publication said the average coverage is 32.5× with at least 23.5× per individual for 150k data release.

    0
  • Comment author
    Permanently deleted user

    Hi,

     

    According to the DNAnexus data release table (https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/data-release-versions), data field 23196 - the whole genome GATK joint call pVCF files should be available in the folder "/Bulk/Whole genome sequences/Whole genome GATK joint call pVCF/". However, the folder and the data are not found on RAP. Only the GraphTyper version of pVCF is available. What happen to the GATK version? Would it be available later?

     

    Thanks for your advice.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    I think this got restricted after newer data is available. See the note here.

    https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23196

    0
  • Comment author
    Permanently deleted user

    Thanks Chai for your reply. I think you refer to the 200k WGS release as the new data? Both the GraphTyper's 150K & 200K release pVCF are available (23352 & 24304), but not the GATK version. Is there a way to request for its access?

    Thanks.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    I see. You are right. They are different protocols.

     

    I think it's best if you send request to UKB directly since they control which data DNAnexus should make it available. If you do not know the E-mail, you can start from AMS message system. I will try to get clarification from UKB what is the appropriate contact info for this type of request since we got many of them recently.

    0
  • Comment author
    Permanently deleted user

    Will do, thanks so much Chai!

    0
  • Comment author
    Aleks S Data Analyst The helpers that keep the community running smoothly. UKB Community team

    Hi Andrew,

     

    The data comes from this field:

    https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=24304

     

    Here is a good reference on the pVCF format and how it differs from gVCF:

    https://www.biorxiv.org/content/10.1101/343970v1.full.pdf

     

    The pipeline used to produce this data is well documented in Nature publication:

    https://www.nature.com/articles/s41586-022-04965-x

     

    We have some QC metrics (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=187), but they are currently restricted.

    They should be released in a future showcase refresh. 

     

    Best wishes,

    Aleks

    0
  • Comment author
    Permanently deleted user

    Are any base-level coverage summary statistics (mean, median DP across samples) available for either the WGS or WES data? I.e. per-site, rather than per-sample data. If not, could you suggest the best approach to do this? I know that per-sample DP information is available for genotyped variants in the pVCFs, but I would like to have DP data for all sites, not just those with a called variant.

    0
  • Comment author
    James Y The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Any additional information may be present in helper_files subdirectories that may or may not be present.

    0

Please sign in to leave a comment.