Issue retrieving the data field 23196 ? Whole genome GATK joint call pVCF

06 September 2022 00:00
1 comment

Dear community, I'm interested in retrieving the data field 23196 ? Whole genome GATK joint call pVCF. However, when I go to /Bulk/Whole genome sequences/Whole genome GraphTyper joint call I can see the presence of 60K individual the ?vcf.gz? files and respective index files, beside the QC directory. I was wondering: - why can we only see 60?630 files instead of 150,076 as expected from the description ( https://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=23196 )? - Is there a unique file with the joint calls from all the individuals? We haven?t been able to spot it by ourselves, we are wondering if it has a specific name. Thank you in advance for your support, Veronica

Comments

1 comment

Chai Fungtammasan DNAnexus Team
- 07 September 2022 16:38
During the joint call process, the variants of the entire population for each position need to be considered together in order to calculate the likelihood of variant given the information from the whole population. However, you don't actually need to consider other positions even from the same individual in calculation, so it makes more sense to have each file contain information from all individuals, but keep it only for a block of chromosome rather than the whole genome to make the file easier to manage.

Therefore, each of 60k files here is indeed the data from the all 150k individuals for a certain section of chromosome. You can see their naming convention fieldId_chromosome_blockNumber_v1.vcf.gz. In other words, these are the files you are looking for. They are just sliced into a block of within a chromosome.

0

Please sign in to leave a comment.