Why are >2000 WGS pVCF chunks (200k release) apparently missing from the QC file 'qc_metrics_graphtyper_v2.7.1_qc.tab.gz'?

Hi all,

I have been trying to create a table listing which genomic region is covered by each pVCF 200k WGS chunk. In other words, I would need the table to have the following columns (example between brackets for the first chunk of chromosome 1): CHR (e.g. "c1"), START (e.g. "1"), END (e.g. "50000"), BLOCK (e.g. "b0"), FILENAME (e.g. "ukb24304_c1_b0_v1.vcf.gz").

I thought the quickest workaround would be to use the file qc_metrics_graphtyper_v2.7.1_qc.tab.gz inside /Bulk/Whole genome sequences/Whole genome GraphTyper joint call pVCF/QC, by extracting unique occurrences from the 5th column (e.g. "chr1:1-50000") and sequentially assigning "b0....bN" block labels until the last chunk of each chromosome (assuming all chunks would be in the qc metrics file).

However, by doing so (using the command line interface), I obtain a table with 58165 chunks mapped, while there are 60630 WGS pVCF files on the platform - and I don't have a potential explanation in mind.

Of note, one cannot simply rely on the assumption that each chunk covers 50kb, as unfortunately there are exceptions (see: https://community.dnanexus.com/s/question/0D5t0000048q6XmCAI/is-there-a-map-for-which-regions-are-in-each-200k-wgs-pvcf-block).

Does anyone know why these >2000 WGS chunks seem to be missing from the qc metrics file? Also, how would you proceed to create the table I need?

Thanks

Comments

3 comments

Please sign in to leave a comment.