participant IDs in individual VCFs
Hello, I wanted to confirm the following:
I am working with some individual level VCFs that I dispensed – i.e., “/Bulk/DRAGEN WGS/Whole genome STR call files (DRAGEN) [500k release]/”
Am I correct that the integer ID In the file name – i.e., `5801163` in `5801163_24062_0_0.dragen.repeats.vcf.gz` corresponds to the sample participant ID and that I should disregard the Sample ID in the actual VCF? (The sample ID in that VCF differs from the one in the file name). I searched for the existence of a participant with each of the IDs and saw that only the ID in the file name matched a participant ID.
If so - is there a reason why the IDs in the VCF and the filename differ?
Thanks
Comments
2 comments
Yes, you are correct.
Section 3 Question 4 of the WGS 500k FAQ https://www.ukbiobank.ac.uk/media/dovbae03/uk-biobank-final-whole-genome-sequencing-release-faqs_v1-0.pdf says “Why are the EIDs in the header of the gVCF and CRAM different to the filename? The EID in the filename is pseudonymised to match your application EIDs. These EIDs are consistent across your project space, for all bulk and tabular data. Please disregard any sample IDs within the gVCF, VCF and CRAM files. "
See also https://dnanexus.gitbook.io/uk-biobank-rap/frequently-asked-questions#are-the-headers-of-gvcf-or-cram-files-pseudonymized
Thanks Rachael! Will be sure to check out both those FAQ links in the future too for any questions I may have.
Please sign in to leave a comment.