Why are >2000 WGS pVCF chunks (200k release) apparently missing from the QC file 'qc_metrics_graphtyper_v2.7.1_qc.tab.gz'?
Hi all,
I have been trying to create a table listing which genomic region is covered by each pVCF 200k WGS chunk. In other words, I would need the table to have the following columns (example between brackets for the first chunk of chromosome 1): CHR (e.g. "c1"), START (e.g. "1"), END (e.g. "50000"), BLOCK (e.g. "b0"), FILENAME (e.g. "ukb24304_c1_b0_v1.vcf.gz").
I thought the quickest workaround would be to use the file qc_metrics_graphtyper_v2.7.1_qc.tab.gz inside /Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/QC, by extracting unique occurrences from the 5th column (e.g. "chr1:1-50000") and sequentially assigning "b0....bN" block labels until the last chunk of each chromosome (assuming all chunks would be in the qc metrics file).
However, by doing so (using the command line interface), I obtain a table with 58165 chunks mapped, while there are 60630 WGS pVCF files on the platform - and I don't have a potential explanation in mind.
Of note, one cannot simply rely on the assumption that each chunk covers 50kb, as unfortunately there are exceptions (see: https://community.dnanexus.com/s/question/0D5t0000048q6XmCAI/is-there-a-map-for-which-regions-are-in-each-200k-wgs-pvcf-block).
Does anyone know why these >2000 WGS chunks seem to be missing from the qc metrics file? Also, how would you proceed to create the table I need?
Thanks
Comments
3 comments
Hello - this is a big issue and it's stopping dozens of researchers from working on WGS data. Please provide an index file for all to use!
We will let UKB know that this informatin is critical for working with this data. We did not receive this file from UKB, and most likely UKB didn't receive it from data provider.
Meanwhile, have you tried to check if the chunking information that the community member discussed in https://community.dnanexus.com/s/question/0D5t0000048q6XmCAI/is-there-a-map-for-which-regions-are-in-each-200k-wgs-pvcf-block explains the overall discrepancy?
Hi Emanuele,
The extra WGS chunks come from small 5kb blocks on chromosomes 4 and 10.
There is an excellent post by Rob Denroche in this thread:
https://community.dnanexus.com/s/question/0D5t0000048q6XmCAI/is-there-a-map-for-which-regions-are-in-each-200k-wgs-pvcf-block
It explains how to calculate genomic coordinates from block numbers. It's a good workaround until the "Whole Genome pVCF file blocks" resource (similar to the one we have for exomes: https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=837) is released.
Best wishes,
Aleks
Please sign in to leave a comment.