Question regarding batches
I've been looking at the pVCF files of e.g. DRAGEN, and have some questions regarding batches and the VCF file-splits.
I saw in category 187 that there are e.g. different sequencing providers, different shipment batch numbers, etc.
Are there any recommendations from the UK Biobank or the community if any of these (or other QC-fields) should be included as covariates in analyses of the WGS data?
Also, in this context, is there any meaning to the 'b'-numbering of the VCF-files (e.g. ukb24310_c22_b288_v1.vcf.gz, ukb24310_c22_b289_v1.vcf.gz, ...) or are they simply consecutive chunks of variants split for size-reasons?
Comments
3 comments
To answer the second part of your question, the “b”-numbering pVCFs, list all the variants within a chunk of chromosome co-ordinates (approx 20000 bp per chunk for field 24310). Note that areas of low variability are likely to produce header-only VCFs.
We are currently working on releasing an index file for field 24310, as well as creating a git hub notebook so researchers can create index files from other pVCF fields.
This new post may be relevant: https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/17661209300637-Quality-control-and-metrics
This new post may be relevant https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/18672196405277
Please sign in to leave a comment.