Filtering of UK Biobank WGS data

Huijie Yang

I would like to use whole-genome sequencing (WGS) data for association analysis, so I selected the UK Biobank 24308 dataset (is this dataset an appropriate choice?). Before running regenie step 2, I applied several basic quality control filters to the data, including --mac 20, --geno 0.1, --mind 0.1, and --hwe 1e-15. However, after applying these filters, different chromosomes retained different numbers of individuals, and chromosome 18 in particular retained so few individuals that it cannot be used for downstream analysis.

The number of individuals retained for each chromosome is listed below:
490542 ../WGS/ukb24308_c10_b0_v1_mac20_geno01_mind01_hwe.psam

   490417 ../WGS/ukb24308_c11_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c12_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c13_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c14_b0_v1_mac20_geno01_mind01_hwe.psam

   490331 ../WGS/ukb24308_c15_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c16_b0_v1_mac20_geno01_mind01_hwe.psam

   489727 ../WGS/ukb24308_c17_b0_v1_mac20_geno01_mind01_hwe.psam

      434 ../WGS/ukb24308_c18_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c19_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c1_b0_v1_mac20_geno01_mind01_hwe.psam

   289828 ../WGS/ukb24308_c20_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c21_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c22_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c2_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c2_b0_v1_maf001_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c3_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c4_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c5_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c6_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c7_b0_v1_mac20_geno01_mind01_hwe.psam

   490542 ../WGS/ukb24308_c8_b0_v1_mac20_geno01_mind01_hwe.psam

   488355 ../WGS/ukb24308_c9_b0_v1_mac20_geno01_mind01_hwe.psam

   476320 ../WGS/ukb24308_cX_b0_v1_mac20_geno01_mind01.psam

How should I address this issue? Is it possible that filtering on sample missingness (--mind) at the per-chromosome level is inappropriate in this context?

Comments

2 comments

  • Comment author
    George F The helpers that keep the community running smoothly. UKB Community team Data Analyst

    Dear Huijie,

    To reduce the chance of removing too many participants based on missingness, first filter the non-PASS variants before doing the sample level missingness filtering.

    Kind regards

    George

    0
  • Comment author
    Huijie Yang

    Dear George,

    Thank you very much for the suggestion. That makes a lot of sense and is very helpful.

    Best regards,
    Huijie

    0

Please sign in to leave a comment.