I am analyzing the data from the 200k WGS pVCF (field 24304), and there are around 200 pVCF files per chromosome that have no variants (only the vcf file header), mostly between blocks 0 to 200. Is this intentional? Thank you!

Comments

9 comments

  • Comment author
    Chai Fungtammasan DNAnexus Team

    Hmm. I would expect some regions of genome to have no variant, but this is more than I expect. Could you share the list of file ID for this?

    0
  • Comment author
    Former User of DNAx Community_30

    Thank you for the answer. I also think that they are just too many. For example in chromosome 21 there are 167 empty pVCF blocks (e.g. ukb24304_c21_b934_v1.vcf.gz, ukb24304_c21_b35_v1.vcf.gz, ukb24304_c53_b934_v1.vcf.gz, ukb24304_c21_b56_v1.vcf.gz, ukb24304_c57_b934_v1.vcf.gz, ukb24304_c21_b227_v1.vcf.gz). All of these have the size of 1.55 MiB, corresponding to the vcf header and nothing more.

    0
  • Comment author
    Former User of DNAx Community_30

    just a correction for the two last files (I repeated them): other examples include ukb24304_c21_b87_v1.vcf.gz, ukb24304_c21_b90_v1.vcf.gz, ukb24304_c21_b75_v1.vcf.gz.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    I confirm that I see 2466 files like that. I will look into this and get back to you.

     

     

    0
  • Comment author
    Former User of DNAx Community_30

    Thanks! I'll wait!

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    {@00560000001jOfvAAE}?  and I looked into it and didn't find any evident that these files are empty by mistake. DNAnexus data team checked md5 when we get data from UKB, so it should be the exact data that were published.

     

    {@005t0000009fE8nAAE}?  found that the WGS files is chunked into 50kb window with a few exception at 5kb https://community.dnanexus.com/s/question/0D5t0000048q6XmCAI/is-there-a-map-for-which-regions-are-in-each-200k-wgs-pvcf-block Since there are unknown sequence region in GRCh38, unmappable regions, low heterozygosity region, it is expected that some regions would be empty especially around minisatellite. We also did a checking into 150k pVCF and the files that has no variant are pretty much corresponding between 150k and 200k data release with the exception for c10_b2683 and c21_b166. I'm not sure what's going on there. Maybe it's low het region that could get some variant if we sequence enough people or the chunking is done differently. Note that 150k data seems to have only 50kb window files, so for chr4 and 10, the number of files are different between 150k and 200k. 

     

    {@005t0000009fE8nAAE}?  FYI in case it's useful for your research planning.

    0
  • Comment author
    Former User of DNAx Community_30

    Thanks @Chai Fungtammasan for your reply and looking up the data. I sent an email to the UK Biobank almost two weeks ago without any reply yet... So, bottom line, the data is correct the way it is presented? Thank you.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    That would be my best guess. For sure, you have to be careful when finding the right file for your variant with the problem I mentioned.

    0
  • Comment author
    Former User of DNAx Community_30

    Got it. Thanks!

    0

Please sign in to leave a comment.