Inconsistency in participants used in phenotype regression while running PRSice2

Brandon Lien

Hello,

I am running PRSice2 to calculate polygenic risk scores (PRS) for subcortical brain volumes. I converted the UKB BGEN data into PLINK binary format.

My input file are set up as follows:

  • The phenotype file has the participants listed in the same order as they are listed in the BGEN files and currently includes only 1 phenotype–volume of thalamus.
  • The covariates file includes only a subset of participants of interest.
  • Those not in the covariates file have their phenotype listed as NA (since thalamic volume is quantitative). 

The PRSice2 script runs successfully (through swiss army knife). However, when I examine the .best file:

  • Some participants not present in the covariates file (and with NA volume of thalamus in the phenotype file) are still marked as included in regression, noted by a “Yes” in the “In_Regression” column of the .best file.
  • Moreover, many participants present in the covariates file with a valid phenotype value in the phenotype file are not included in the phenotype regression.

Interestingly, this issue does not arise when I run PRSice2 directly on the BGEN data. However, that approach is not sustainable for the number of SNPs I am interested in (the job takes too long likely because working with the BGEN data is slow compared to the PLINK binaries). 

I have checked multiple resources on how to properly convert between BGEN to PLINK conversion and do not see any discrepancies in my approach. Additionally, the .fam files generated during conversion do not seem to be used by PRSice2 in a way that explains this behavior.

Does anyone have insight into what might be causing this mismatch in sample inclusion?

Thanks, Brandon

Comments

0 comments

Please sign in to leave a comment.