Extracting group of samples from 500K WGS pvcf files

30 July 2024 09:08
4 comments

Hi,

I need to extract a small (~1000) group of samples from the joint called pvcf files (Data-Field 23374) for the 500K WGS data. There are about 150K segments of the pvcfs, so its v slow and expensive to download and filter all the segments. Is there any simple/cheap way to get WGS genotype information for a small subset of the data? I appreciate plink files will be available sometime in the future but i would prefer to get this moving earlier than that,

Thanks

Comments

4 comments

Rachael W UKB Community team Data Analyst
- 30 July 2024 09:32
Would the VCF files in field 23370 https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23370 be better?
If you can't see field 23370 files in your RAP project, you might need to dispense the third bundle of data, see section 2 q3 in https://www.ukbiobank.ac.uk/media/dovbae03/uk-biobank-final-whole-genome-sequencing-release-faqs_v1-0.pdf

0
Jonny James Else
- 30 July 2024 09:48
Hi Rachael,
I had seen them but they are gvcfs that haven't been joint called. Do i need to run these through the rest of GATK pipeline (eg GenotypeGVCFs, VQSR etc) to get similar results to the pvcfs?
Thanks
Jonny

0
Rachael W UKB Community team Data Analyst
- 30 July 2024 10:25
I don't know, sorry. I'm not a geneticist, and I don't understand why you need the data for 1000 samples to be in pVCF format. If you describe what you need to do with the data, I'll ask one of my colleagues whether they can suggest anything.

0
George F UKB Community team Data Analyst
- 05 August 2024 09:08
Hi Jonny,
If you wish to filter the pVCF files, I would recommend submitting them as jobs, you can alter the instance and priority to reduce costs. bcftools and plink are part of the swiss-army-knife tool, however you can build your own apps or use a docker image.
For example:
while read FILE; do VCF=${FILE##*/}; CMD=bcftools view -S samples_list.txt $VCF > ${VCF%.vcf.gz}_filt1K.vcf \\
runid=$(dx run swiss-army-knife \\
-iin=${FILE} -icmd=${CMD} --name=filter_${out} --instance-type=mem2_ssd1_v2_x4 --destination=deCODE_SV_VCF/ \\
--yes --brief); done < VCF_list

0

Please sign in to leave a comment.