Extracting group of samples from 500K WGS pvcf files

Hi,

I need to extract a small (~1000) group of samples from the joint called pvcf files (Data-Field 23374) for the 500K WGS data. There are about 150K segments of the pvcfs, so its v slow and expensive to download and filter all the segments. Is there any simple/cheap way to get WGS genotype information for a small subset of the data? I appreciate plink files will be available sometime in the future but i would prefer to get this moving earlier than that,

Thanks

Comments

4 comments

  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Would the VCF files in field 23370 https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23370 be better?

    If you can't see field 23370 files in your RAP project, you might need to dispense the third bundle of data, see section 2 q3 in https://www.ukbiobank.ac.uk/media/dovbae03/uk-biobank-final-whole-genome-sequencing-release-faqs_v1-0.pdf 

    0
  • Comment author
    Jonny James Else

    Hi Rachael,

    I had seen them but they are gvcfs that haven't been joint called. Do i need to run these through the rest of GATK pipeline (eg GenotypeGVCFs, VQSR etc) to get similar results to the pvcfs?

    Thanks

    Jonny

    0
  • Comment author
    Rachael W The helpers that keep the community running smoothly. UKB Community team Data Analyst

    I don't know, sorry.   I'm not a geneticist, and I don't understand why you need the data for 1000 samples to be in pVCF format.   If you describe what you need to do with the data, I'll ask one of my colleagues whether they can suggest anything.

    0
  • Comment author
    George F The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Hi Jonny,

    If you wish to filter the pVCF files, I would recommend submitting them as jobs, you can alter the instance and priority to reduce costs. bcftools and plink are part of the swiss-army-knife tool, however you can build your own apps or use a docker image.

    For example:

    while read FILE; do VCF=${FILE##*/}; CMD=bcftools view -S samples_list.txt $VCF > ${VCF%.vcf.gz}_filt1K.vcf \\

    runid=$(dx run swiss-army-knife \\
               -iin=${FILE} -icmd=${CMD} --name=filter_${out}  --instance-type=mem2_ssd1_v2_x4 --destination=deCODE_SV_VCF/ \\
               --yes --brief); done < VCF_list
     

    0

Please sign in to leave a comment.