I note that chr4 and chr10 both have 9 more blocks than expected (when compared to the 150k data).
They also don't seem to be mapping as expected if I assume each block is 50kb...
c4_b0 does contain 0 to 50kb, but c4_1468 contains 72950kb to 72300kb, not the 73400kb to 73450kb you'd expect by multiplying 1468 * 50kb.
There is a discussion about block coordinate for WGS here where you can find name of file that could be used to indicate block boundary. Could you see if this is applicable?
So, that file (qc_metrics_graphtyper_v2.7.1_qc.tab.gz) includes the name of the block boundary but not the name of the block. E.g. It tells me I can find chr4:73403784:T:A in the chr4:73400001-73450000 block, but I don't know which pVCF file contains that block.
For the 150k, you could guess which file contained a block by assuming each is 50kb (so chr4:73400001-73450000 should be in c4_b1468). This doesn't seem to hold for the 200k (under /Bulk/Whole genome sequences/Population level WGS variants, pVCF format - interim 200k release/, right?) where c4_b1468 seems to contain chr4:72950000-72300000).
0
Permanently deleted user
It appears there are ten blocks from chr4 and ten from chr10 that only cover 5kb instead of the typical 50kb. Knowing this, I can map coordinates to blocks again, but do let me know if there's a reason for these blocks to be a different size - very unexpected!
The ten pVCF files that cover 5kb instead of 50kb on chr4 are:
Hi, I am trying to achieve the same (create the table/map you also needed).
I had assumed all chunks to be present in the qc_metrics_graphtyper_v2.7.1_qc.tab.gz file, but this is apparently not true (if I sequentially assign b0...bN to each chunk reported in the table (column 5) I get 58165 chunks, while on the platform there are 60630.
Did you find a solution, in the end?
Thanks!
0
Permanently deleted user
Yep, you can create the map once you know where the smaller blocks are. There are three cases to handle.
For genomic positions that fall before the smaller blocks (and for all positions on chromosomes other than 4 and 10) you can simply map by dividing the position by the block size: block = floor(position/50,000).
For genomic positions that are after the smaller blocks on chr4 and chr10 you just need to add an offset for the small blocks to the above: block = floor(position/50,000) + 9.
And for positions that fall into the small block range on chr4 and chr10, you use the standard block size of 50,000 until you reach the small block range and then handle adding the number of small blocks: block = floor(small_block_start/50,000) + floor((position - small_block_start)/5,000)
The small blocks all seem to be 5,000bp in size. For chr4 the small block range seems to be from 49,100,000 to 49,150,000, and for chr10 the range is from 41,850,000 to 41,900,000, but I can't say that we've confirmed the coordinates present in every block file.
Hope that helps!
Just to add that for those that are facing the same problem, I created a public GitHub repo (https://github.com/fmazzarotto/ukb_wgs_mapping) containing the WGS pVCF block map alongside with the code I used to create it and the necessary input table.
The 500k WES blocks are not the same as the 200k WGS blocks. I have created a file that has the minimum and maximum positions covered by each block, and code to identify which pVCF files cover regions of interest (given a bed file).
Comments
10 comments
There is a discussion about block coordinate for WGS here where you can find name of file that could be used to indicate block boundary. Could you see if this is applicable?
https://community.dnanexus.com/s/question/0D5t000003q8Y5yCAE/i-want-to-locate-the-pvcf-ukb23352c19-file-that-has-position-chr191764203317642056-there-are-1000-files-how-do-i-do-it
So, that file (qc_metrics_graphtyper_v2.7.1_qc.tab.gz) includes the name of the block boundary but not the name of the block. E.g. It tells me I can find chr4:73403784:T:A in the chr4:73400001-73450000 block, but I don't know which pVCF file contains that block.
For the 150k, you could guess which file contained a block by assuming each is 50kb (so chr4:73400001-73450000 should be in c4_b1468). This doesn't seem to hold for the 200k (under /Bulk/Whole genome sequences/Population level WGS variants, pVCF format - interim 200k release/, right?) where c4_b1468 seems to contain chr4:72950000-72300000).
It appears there are ten blocks from chr4 and ten from chr10 that only cover 5kb instead of the typical 50kb. Knowing this, I can map coordinates to blocks again, but do let me know if there's a reason for these blocks to be a different size - very unexpected!
The ten pVCF files that cover 5kb instead of 50kb on chr4 are:
ukb24304_c4_982_v1.vcf.gz
ukb24304_c4_983_v1.vcf.gz
...
ukb24304_c4_991_v1.vcf.gz
And the ten on chr10 are:
ukb24304_c10_837_v1.vcf.gz
ukb24304_c10_838_v1.vcf.gz
...
ukb24304_c10_846_v1.vcf.gz
Thanks for sharing your investigation. I don't think there is any particular reason, but let me ask UKB just to make sure.
Hi, I am trying to achieve the same (create the table/map you also needed).
I had assumed all chunks to be present in the qc_metrics_graphtyper_v2.7.1_qc.tab.gz file, but this is apparently not true (if I sequentially assign b0...bN to each chunk reported in the table (column 5) I get 58165 chunks, while on the platform there are 60630.
Did you find a solution, in the end?
Thanks!
Just to add that for those that are facing the same problem, I created a public GitHub repo (https://github.com/fmazzarotto/ukb_wgs_mapping) containing the WGS pVCF block map alongside with the code I used to create it and the necessary input table.
Has anyone checked if these blocks are the same for the 500k WES data?
The 500k WES blocks are not the same as the 200k WGS blocks. I have created a file that has the minimum and maximum positions covered by each block, and code to identify which pVCF files cover regions of interest (given a bed file).
https://github.com/powege/UKB_WES_file_mapping
If relevant to anyone - I have renamed the 200k map repository to https://github.com/fmazzarotto/ukb_wgs_mapping_200k , and created a new one with the map for the UKB 500k WGS release ( https://github.com/fmazzarotto/ukb_wgs_mapping_500k )
Please sign in to leave a comment.