Performing GWAS on RAP using DX toolkit and swiss-army-knife

22 August 2022 00:00
30 comments

First, I want to thank DNANexus for the tutorials offered to date. They have done a tremendous amount of work and I am in their debt. Currently I feel that DX toolkit is the most powerful way to interact with their data platform and I have decided to make my GWAS workflows available to the community.

https://github.com/pjgreer/ukb-rap-tools

I have take the AD by proxy tutorial and the liftover wdl script and have expanded it to include the imputed data set, as well as standalone workflows for plink2.

I modified the GT array data prep and liftover because I felt it was a waste to run liftover on all the array data for all subjects. You only need to run liftover on the subjects you will be using for your analysis, and it makes sense to perform the QC filtering prior to liftover as well. I also implemented an LD filter for variants with r^2 >0.4 bringing the total number of variants in the array after filtering to ~410K. This should be more than enough for most analyses.

Some might still want to run GWAS on the imputed datasets, so how to process that data is presented here as well.

Finally, plink2 is quite fast and requires far less preparatory steps, So for the impatient, I present how to run GWAS on both WES and imputed data inside the RAP here as well. A complete plink GWAS can be run in less than 4 hours.

I do not go into how to create the phenotype and covariate files, but there is an important caveat here.for plink NA values should be coded as -9 and in regenie they must be coded as NA.

This is a work in progress, but it appears to work well for me. I may be altering the scripts specifically the directory structure in your project.

I would appreciate any and all feedback.

Comments

30 comments

Former User of DNAx Community_10
- 30 September 2022 14:30
When I run .\11a-gwas-s2-imp37-qc-filter.sh I get error "could not find a project named C"
What am I doing wrong?

0
Former User of DNAx Community_10
- 30 September 2022 14:37
How do I do this? I can't find these scripts to execute

# Requirements:
# 0-4 - please refer to readme.md
# 5. Must have executed:
# - scripts 1-3 from partA_prep_gtfile_4_GWAS

0
Former User of DNAx Community_28
- 30 September 2022 14:38
Are you logged in via the dx toolkit and have a default project selected?

https://documentation.dnanexus.com/getting-started/cli-quickstart

0
Former User of DNAx Community_10
- 30 September 2022 14:39
Yes, that worked fine

0
Former User of DNAx Community_28
- 30 September 2022 14:51
This comment was left over from the scripts I was using to create these sets.

The imputed plink gwas is standalone and does not require any prior steps other than: install dx toolkit. login with dx toolkit, select a default project, and generate phenotype and covariate files.

I prefer to use subsets of the ukb cohort rather than everyone to make the analysis a bit more balanced and keep the control to case ration around 30:1 .

The phenotype file would be a tab or space delimited text file with a minimum of 3 columns. for plink, missing values should be coded "-9"

FID IID pheno1 pheno2 pneno3

The covariate file will look similar with "-9" for missing data

FID IID Sex Age BMI pca1 pca2 pca3 ... pca10

0
Former User of DNAx Community_10
- 30 September 2022 15:01
Here is the output

Note: Use dx select --level VIEW or dx select --public to
select from projects for which you only have VIEW permissions.

Available projects (CONTRIBUTE or higher):
0) Untitled Project - Dec 07, 2021 (ADMINISTER)
1) Untitled Project - Dec 03, 2021 (ADMINISTER)
2) Christchurch mutation in Alzheimer's Disease (ADMINISTER)

Pick a numbered choice [2]: 0
Setting current project to: Untitled Project - Dec 07, 2021
You are now logged in. Your credentials are stored in
C:\Users\steve/.dnanexus_config and will expire in 30 days, 0:00:00. Use
dx login --timeout to control the expiration date, or dx
logout to end this session.
PS C:\Users\steve> .\11a-gwas-s2-imp37-qc-filter.sh

I get error cannot find a project named "c"

0
Former User of DNAx Community_28
- 30 September 2022 15:28
You also need to make a couple of directories in your project, $data_file_dir for the directory you will put your data, and a $txt_file_dir where all the phenotype files will be stored.

If those directories do not exist you get a different error.

${project} is a global variable within dx toolkit listing the project you selected with dx select.

It may be possible that you have a global variable on your linux system. I would run "echo $project" at the command line and see if it returns a value.
If it returns a value, you need to unset that global variable.

"unset project"

0
Former User of DNAx Community_10
- 30 September 2022 18:16
I am running win 10. What would be the command

0
Former User of DNAx Community_28
- 30 September 2022 18:47
I mainly use linux/mac, so I am not familiar with windows environment variables.

try:
https://www.alphr.com/clear-environment-variables-windows/
or
https://www.digitalcitizen.life/remove-edit-clear-environment-variables/

0
Former User of DNAx Community_10
- 30 September 2022 18:51
I have installed your app on a linux system
Have run $ cd dx-toolkit && source environment
What do I do now?

0
Former User of DNAx Community_10
- 30 September 2022 18:56
I ran this
[lehres01@li03c02 UKBGWAS]$ source $DNANEXUS_HOME/environment
worked fine. What now?

0
Former User of DNAx Community_10
- 30 September 2022 19:15
I'm able to login and run the shell script. I'll create the directories and see what happens. On linux your app works fine.

0
Former User of DNAx Community_28
- 30 September 2022 19:18
I just updated the README files for clearer explanation.

see both the base readme as well as the gwas readme.
https://github.com/pjgreer/ukb-rap-tools
and
https://github.com/pjgreer/ukb-rap-tools/tree/main/GWAS_pipeline

0
Former User of DNAx Community_10
- 30 September 2022 19:23
I created the directories above but get this error
DXCLIError: Value provided for input field "in" could not be parsed as file: could not resolve "/gwas_cohort_textfiles/phenotypes.v08-04-22.txt" to a name or ID

0
Former User of DNAx Community_10
- 30 September 2022 19:34
I am now getting output but I'm not sure what it is or where it's going
[lehres01@li03c02 UKBGWAS]$ sh 11a-gwas-s2-imp37-qc-filter.sh
job-GGqZ7F0J89141YvY4zyp66q5
job-GGqZ7F8J891284G86z9z12Qb
job-GGqZ7FQJ8917XZk125k1pQJ2
job-GGqZ7FjJ8916zJx34gQzpq50
job-GGqZ7G0J89171FyK503Z4204
job-GGqZ7G0J891BpvF12yjz470G
job-GGqZ7GjJ8917k2Bx4gV4K869
job-GGqZ7J0J8914X2Zb4zv8QkPj
job-GGqZ7J8J8916BXGG4fY2zPBy
job-GGqZ7J8J8917XZk125k1pQJ6
job-GGqZ7JQJ891175xQ4kbPpZ0G
job-GGqZ7JjJ89102Vz54fKgJ5f7
job-GGqZ7K0J891KyGv30Vzbb7qx
job-GGqZ7K0J891JZY9K4fPf5bfq
job-GGqZ7K8J891284G86z9z12Qf
job-GGqZ7KQJ891G5qpv4jQq1ZfQ
job-GGqZ7KjJ8914X2Zb4zv8QkPk
job-GGqZ7P0J89139bFK4zk205bP
job-GGqZ7P0J8918zqJ86Yk0B6Fx
job-GGqZ7PQJ891PxZv44p6kK5y9
job-GGqZ7PQJ89102Vz54fKgJ5f8
job-GGqZ7PjJ891PxZv44p6kK5yB
job-GGqZ7Q0J8919zBBB4gZp1K53
[lehres01@li03c02 UKBGWAS]$

0
Former User of DNAx Community_28
- 30 September 2022 19:35
You need to generate a phenotype file, upload it to your project, and replace the name in the script with? the name you gave your phenotype file.
The same will be true for the covariate file.?

0
Former User of DNAx Community_10
- 30 September 2022 19:38
I created directory and phenotype file as below
"/gwas_cohort_textfiles/phenotypes.v08-04-22.txt"
This is what app was looking for
DXCLIError: Value provided for input field "in" could not be parsed as file: could not resolve "/gwas_cohort_textfiles/phenotypes.v08-04-22.txt" to a name or ID
but I dont seem to be getting output

0
Former User of DNAx Community_10
- 30 September 2022 20:04
It might be working. It is generating multiple ukb22828 bed bim fam files. I guess they just have to be merged when finished.

0
Former User of DNAx Community_28
- 30 September 2022 21:04
The next script 15a runs a gwas on each chromosome. ?16a combines all the results together.

0
Former User of DNAx Community_10
- 30 September 2022 21:12
Once I've run 16a, how do I turn the results into a txt file I can use with locuszoom

0
Former User of DNAx Community_28
- 30 September 2022 21:43
The output of 16a is the association text file for chromosomes 1-22+X that you can use in locuszoom.?

0
Chai Fungtammasan DNAnexus Team
- 30 September 2022 22:04
@Steven Lehrer? You may find our basic tutorial on how to use RAP useful.
Overview RAP basic function: https://www.youtube.com/watch?v=8bcHeoEggBI&t=517s
LocusZoom that @Phil Greer? mentioned: https://www.youtube.com/watch?v=bKiBk4UtbY0&t=117s
Run tools: https://www.youtube.com/watch?v=U8QZAGwnUm0&list=PLRkZ0Fz-n3Z7Jg0Vz4vudLYnBza4EUGLM&index=20&t=1755s

Here are the documentation for the platform if you have questions on any functionalities.
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/ukb-rap
https://documentation.dnanexus.com/

0
Former User of DNAx Community_10
- 02 October 2022 20:40
11a and 15a ran perfectly on linux. 16a gives this error
[lehres01@li03c03 UKBGWAS]$ sh 16a-gwas-imp37-result-merge-plink.sh
DXCLIError: Value provided for input field "in" could not be parsed as file: could not resolve "//data/ap_imp37_gwas/ukb22828_AP_c2_v3.AP.glm.logistic.hybrid" to a name or ID
The error is described here
https://community.dnanexus.com/s/question/0D5t000003yJQvECAW/how-to-use-dxfuse-style-file-paths-in-swiss-army-knife
I tried loading 16a 11a and 15a into subdirectory ukb22828_AP_c2_v3.AP.glm.logistic.hybrid
Didn't help. Any suggestions would be appreciated.

0
Former User of DNAx Community_28
- 03 October 2022 13:03
if 15a ran and you should have the *.glm.logistic.hybrid files in your data directory, you need to make sure that the -iin file is points to the correctly names outfile from step15a. It will likely be named something other than "ukb22828_AP_c2_v3.AP.glm.logistic.hybrid" which is for chromosome 2, (c2) and has the output names I set for my specific phenotype.

The gwas scripts were written specifically to show how to use plink and regenie using dx run command. I each plink (or regenie) command can and should be edited to add any number of additional analysis features that would be specific to your analysis.

-Phil

0
Permanently deleted user
- 19 January 2023 15:57
@Phil Greer? Hi Phil! thanks a lot for creating these tutorials. I'm trying to do a very simple operation that involves plink in order to see the list of samples that have WES final release data. I'm trying to use swiss army knife in the GUI but I don't see any output being generated and no instance allocated (but still is charging me money for being queued)??? not sure how this works.

This is the command I'm submitting

plink --bfile "/mnt/project(my_project_name)/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - final release/ukb23158_c1_b0_v1" --write-samples --no-id-header --out final_WES.id

I would appreciate any help regarding this! I'm not sure I'm giving the correct paths (maybe that's the issue). Any help will be appreciated

Thanks

Diana

0
Chai Fungtammasan DNAnexus Team
- 19 January 2023 16:57
I'm not sure I understand when you say "no instance allocated", but you get charged. You might want to reach out to ukbiobank-support@dnanexus.com

0
Former User of DNAx Community_33
- 05 November 2023 19:57
Hi, Phil! Thanks for all the work doing these tutorials and scripts. I'm trying to do a study of chronic imbalance, using genotype calls for Step 1 of Regenie, then WES for Step 2, did all the liftover, etc. After merging in plink, I have 488,738 variants. After QC (--geno 0.1, --mac 20, --maf 0.01), I end up with only 381,649 variants (prior to liftover). And I wanted to wait to do HWE until after I put in the phenotype and ancestry, since it seemed to me that HWE without ancestry information would eliminate variants that I might want to keep, but I could be wrong).
After pruning with a very conservative 0.3, I'm left with 210,640 variants.
Are these sufficient variants to run Regenie? It's many fewer than the ~410,000 you mentioned to run against WES. Am I doing something wrong?
Thanks for any input you might have.

0
Former User of DNAx Community_28
- 21 November 2023 18:39
I am not sure why you are having this issue. Forgive me for being slow, but it has been over a year since I wrote this repo, but the scripts I wrote have a specific order for analysis. Lets go over the specific order for the WES GWAS on regenie.

The three scripts in GTfile_prep (01, 02, & 03) are run in that order. The first 01 combines all the individual chromosome files into a single file. The second 02 filters out missing data and rare variants, and the last 03 prunes out variants in high LD. Next we have GTfile_liftover, (04, 05, & 06) again, run in that order. 04 converts the plink file to vcf, 05 lifts-over from grch37 to grch38 using picard, and 06 converts the vcf back to plink format.
The last set is gwas_wes_regenie with scripts 09b, 11b, 12b and 13b. 09b is the regenie step1 on the pruned, grch38, genotype data. This is done to generate the independent block for regenie step2. 11b qc filters the WES dataset removing the most rare variants that do not behave well in association tests. 12b is running regenie on the filtered WES data using the results from step1, and finally 13b combines all the results together into a single files for download and graphing.

HWE filtering is completely unnecessary on the GT file, and I would highly discourage HWE filtering on the WES data. I will hopefully have a paper on that in the near future.

0
Nick Shrine
- 11 February 2025 15:03
You don't actually need to liftOver the b37 GT data.
You can run regenie step1 on b37 and then use the loco predictions for step 2 with b38 data such as WES.
The loco predictions are independent of genome build:
https://github.com/rgcgithub/regenie/issues/82

0
Shuxin Liu
- 14 January 2026 02:29
Subject: PLINK2 GWAS fails on Chromosome 1 & 20 due to cloud instance termination
Body:
Hello ,
I am running a GWAS on a binary phenotype (two-class outcome) with ~330,000 UKB samples using PLINK2. I am using a high-memory cloud instance (mem1_ssd1_v2_x36) and the following command:

plink2 --pfile chr1_qc \ --pheno lonely_new.pheno \ --pheno-name lo \ --covar covariates.cov \ --covar-name age sex GPC1-GPC20 \ --glm hide-covar cols=+a1freq no-firth \ --threads 34 \ --memory 80000 \ --out gwas_chr1
Issue:
- The jobs for Chromosome 1 and 20 always fail with the error:
  "Cause of Failure: The machine running the job was terminated by the cloud provider"
- Jobs for Chromosomes 21 and 22 run successfully.
- Each chromosome has several hundred thousand SNPs (e.g., chr1 ~820k SNPs).
I have tried reducing threads and memory allocation, but the problem persists for the larger chromosomes.
Questions:
1. Is this a cloud spot instance preemption issue or something else?
2. Are there recommended strategies for running large chromosomes GWAS with PLINK2 on UKB?
3. Would splitting the chromosome into smaller chunks or using a different instance type help?
Any advice on how to reliably run large-scale GWAS on UKB data using PLINK2 would be greatly appreciated.
Thank you!
0

Please sign in to leave a comment.