Extracting group of samples from 500K WGS pvcf files
Hi,
I need to extract a small (~1000) group of samples from the joint called pvcf files (Data-Field 23374) for the 500K WGS data. There are about 150K segments of the pvcfs, so its v slow and expensive to download and filter all the segments. Is there any simple/cheap way to get WGS genotype information for a small subset of the data? I appreciate plink files will be available sometime in the future but i would prefer to get this moving earlier than that,
Thanks
Comments
4 comments
Would the VCF files in field 23370 https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23370 be better?
If you can't see field 23370 files in your RAP project, you might need to dispense the third bundle of data, see section 2 q3 in https://www.ukbiobank.ac.uk/media/dovbae03/uk-biobank-final-whole-genome-sequencing-release-faqs_v1-0.pdf
Hi Rachael,
I had seen them but they are gvcfs that haven't been joint called. Do i need to run these through the rest of GATK pipeline (eg GenotypeGVCFs, VQSR etc) to get similar results to the pvcfs?
Thanks
Jonny
I don't know, sorry. I'm not a geneticist, and I don't understand why you need the data for 1000 samples to be in pVCF format. If you describe what you need to do with the data, I'll ask one of my colleagues whether they can suggest anything.
Hi Jonny,
If you wish to filter the pVCF files, I would recommend submitting them as jobs, you can alter the instance and priority to reduce costs. bcftools and plink are part of the swiss-army-knife tool, however you can build your own apps or use a docker image.
For example:
while read FILE; do VCF=${FILE##*/}; CMD=bcftools view -S samples_list.txt $VCF > ${VCF%.vcf.gz}_filt1K.vcf \\
runid=$(dx run swiss-army-knife \\
-iin=${FILE} -icmd=${CMD} --name=filter_${out} --instance-type=mem2_ssd1_v2_x4 --destination=deCODE_SV_VCF/ \\
--yes --brief); done < VCF_list
Please sign in to leave a comment.