Filtering of UK Biobank WGS data
I would like to use whole-genome sequencing (WGS) data for association analysis, so I selected the UK Biobank 24308 dataset (is this dataset an appropriate choice?). Before running regenie step 2, I applied several basic quality control filters to the data, including --mac 20, --geno 0.1, --mind 0.1, and --hwe 1e-15. However, after applying these filters, different chromosomes retained different numbers of individuals, and chromosome 18 in particular retained so few individuals that it cannot be used for downstream analysis.
The number of individuals retained for each chromosome is listed below:
490542 ../WGS/ukb24308_c10_b0_v1_mac20_geno01_mind01_hwe.psam
490417 ../WGS/ukb24308_c11_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c12_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c13_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c14_b0_v1_mac20_geno01_mind01_hwe.psam
490331 ../WGS/ukb24308_c15_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c16_b0_v1_mac20_geno01_mind01_hwe.psam
489727 ../WGS/ukb24308_c17_b0_v1_mac20_geno01_mind01_hwe.psam
434 ../WGS/ukb24308_c18_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c19_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c1_b0_v1_mac20_geno01_mind01_hwe.psam
289828 ../WGS/ukb24308_c20_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c21_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c22_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c2_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c2_b0_v1_maf001_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c3_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c4_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c5_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c6_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c7_b0_v1_mac20_geno01_mind01_hwe.psam
490542 ../WGS/ukb24308_c8_b0_v1_mac20_geno01_mind01_hwe.psam
488355 ../WGS/ukb24308_c9_b0_v1_mac20_geno01_mind01_hwe.psam
476320 ../WGS/ukb24308_cX_b0_v1_mac20_geno01_mind01.psam
How should I address this issue? Is it possible that filtering on sample missingness (--mind) at the per-chromosome level is inappropriate in this context?
Comments
2 comments
Dear Huijie,
To reduce the chance of removing too many participants based on missingness, first filter the non-PASS variants before doing the sample level missingness filtering.
Kind regards
George
Dear George,
Thank you very much for the suggestion. That makes a lot of sense and is very helpful.
Best regards,
Huijie
Please sign in to leave a comment.