Performing GWAS on RAP using DX toolkit and swiss-army-knife Pinned Featured
First, I want to thank DNANexus for the tutorials offered to date. They have done a tremendous amount of work and I am in their debt. Currently I feel that DX toolkit is the most powerful way to interact with their data platform and I have decided to make my GWAS workflows available to the community.
https://github.com/pjgreer/ukb-rap-tools
I have take the AD by proxy tutorial and the liftover wdl script and have expanded it to include the imputed data set, as well as standalone workflows for plink2.
I modified the GT array data prep and liftover because I felt it was a waste to run liftover on all the array data for all subjects. You only need to run liftover on the subjects you will be using for your analysis, and it makes sense to perform the QC filtering prior to liftover as well. I also implemented an LD filter for variants with r^2 >0.4 bringing the total number of variants in the array after filtering to ~410K. This should be more than enough for most analyses.
Some might still want to run GWAS on the imputed datasets, so how to process that data is presented here as well.
Finally, plink2 is quite fast and requires far less preparatory steps, So for the impatient, I present how to run GWAS on both WES and imputed data inside the RAP here as well. A complete plink GWAS can be run in less than 4 hours.
I do not go into how to create the phenotype and covariate files, but there is an important caveat here.for plink NA values should be coded as -9 and in regenie they must be coded as NA.
This is a work in progress, but it appears to work well for me. I may be altering the scripts specifically the directory structure in your project.
I would appreciate any and all feedback.
Comments
28 comments
When I run .\11a-gwas-s2-imp37-qc-filter.sh I get error "could not find a project named C"
What am I doing wrong?
How do I do this? I can't find these scripts to execute
# Requirements:
# 0-4 - please refer to readme.md
# 5. Must have executed:
# - scripts 1-3 from partA_prep_gtfile_4_GWAS
Are you logged in via the dx toolkit and have a default project selected?
https://documentation.dnanexus.com/getting-started/cli-quickstart
Yes, that worked fine
This comment was left over from the scripts I was using to create these sets.
The imputed plink gwas is standalone and does not require any prior steps other than: install dx toolkit. login with dx toolkit, select a default project, and generate phenotype and covariate files.
I prefer to use subsets of the ukb cohort rather than everyone to make the analysis a bit more balanced and keep the control to case ration around 30:1 .
The phenotype file would be a tab or space delimited text file with a minimum of 3 columns. for plink, missing values should be coded "-9"
FID IID pheno1 pheno2 pneno3
The covariate file will look similar with "-9" for missing data
FID IID Sex Age BMI pca1 pca2 pca3 ... pca10
Here is the output
Note: Use dx select --level VIEW or dx select --public to
select from projects for which you only have VIEW permissions.
Available projects (CONTRIBUTE or higher):
0) Untitled Project - Dec 07, 2021 (ADMINISTER)
1) Untitled Project - Dec 03, 2021 (ADMINISTER)
2) Christchurch mutation in Alzheimer's Disease (ADMINISTER)
Pick a numbered choice [2]: 0
Setting current project to: Untitled Project - Dec 07, 2021
You are now logged in. Your credentials are stored in
C:\Users\steve/.dnanexus_config and will expire in 30 days, 0:00:00. Use
dx login --timeout to control the expiration date, or dx
logout to end this session.
PS C:\Users\steve> .\11a-gwas-s2-imp37-qc-filter.sh
I get error cannot find a project named "c"
You also need to make a couple of directories in your project, $data_file_dir for the directory you will put your data, and a $txt_file_dir where all the phenotype files will be stored.
If those directories do not exist you get a different error.
${project} is a global variable within dx toolkit listing the project you selected with dx select.
It may be possible that you have a global variable on your linux system. I would run "echo $project" at the command line and see if it returns a value.
If it returns a value, you need to unset that global variable.
"unset project"
I am running win 10. What would be the command
I mainly use linux/mac, so I am not familiar with windows environment variables.
try:
https://www.alphr.com/clear-environment-variables-windows/
or
https://www.digitalcitizen.life/remove-edit-clear-environment-variables/
I have installed your app on a linux system
Have run $ cd dx-toolkit && source environment
What do I do now?
I ran this
[lehres01@li03c02 UKBGWAS]$ source $DNANEXUS_HOME/environment
worked fine. What now?
I'm able to login and run the shell script. I'll create the directories and see what happens. On linux your app works fine.
I just updated the README files for clearer explanation.
see both the base readme as well as the gwas readme.
https://github.com/pjgreer/ukb-rap-tools
and
https://github.com/pjgreer/ukb-rap-tools/tree/main/GWAS_pipeline
I created the directories above but get this error
DXCLIError: Value provided for input field "in" could not be parsed as file: could not resolve "/gwas_cohort_textfiles/phenotypes.v08-04-22.txt" to a name or ID
I am now getting output but I'm not sure what it is or where it's going
[lehres01@li03c02 UKBGWAS]$ sh 11a-gwas-s2-imp37-qc-filter.sh
job-GGqZ7F0J89141YvY4zyp66q5
job-GGqZ7F8J891284G86z9z12Qb
job-GGqZ7FQJ8917XZk125k1pQJ2
job-GGqZ7FjJ8916zJx34gQzpq50
job-GGqZ7G0J89171FyK503Z4204
job-GGqZ7G0J891BpvF12yjz470G
job-GGqZ7GjJ8917k2Bx4gV4K869
job-GGqZ7J0J8914X2Zb4zv8QkPj
job-GGqZ7J8J8916BXGG4fY2zPBy
job-GGqZ7J8J8917XZk125k1pQJ6
job-GGqZ7JQJ891175xQ4kbPpZ0G
job-GGqZ7JjJ89102Vz54fKgJ5f7
job-GGqZ7K0J891KyGv30Vzbb7qx
job-GGqZ7K0J891JZY9K4fPf5bfq
job-GGqZ7K8J891284G86z9z12Qf
job-GGqZ7KQJ891G5qpv4jQq1ZfQ
job-GGqZ7KjJ8914X2Zb4zv8QkPk
job-GGqZ7P0J89139bFK4zk205bP
job-GGqZ7P0J8918zqJ86Yk0B6Fx
job-GGqZ7PQJ891PxZv44p6kK5y9
job-GGqZ7PQJ89102Vz54fKgJ5f8
job-GGqZ7PjJ891PxZv44p6kK5yB
job-GGqZ7Q0J8919zBBB4gZp1K53
[lehres01@li03c02 UKBGWAS]$
You need to generate a phenotype file, upload it to your project, and replace the name in the script with? the name you gave your phenotype file.
The same will be true for the covariate file.?
I created directory and phenotype file as below
"/gwas_cohort_textfiles/phenotypes.v08-04-22.txt"
This is what app was looking for
DXCLIError: Value provided for input field "in" could not be parsed as file: could not resolve "/gwas_cohort_textfiles/phenotypes.v08-04-22.txt" to a name or ID
but I dont seem to be getting output
It might be working. It is generating multiple ukb22828 bed bim fam files. I guess they just have to be merged when finished.
The next script 15a runs a gwas on each chromosome. ?16a combines all the results together.
Once I've run 16a, how do I turn the results into a txt file I can use with locuszoom
The output of 16a is the association text file for chromosomes 1-22+X that you can use in locuszoom.?
@Steven Lehrer? You may find our basic tutorial on how to use RAP useful.
Overview RAP basic function: https://www.youtube.com/watch?v=8bcHeoEggBI&t=517s
LocusZoom that @Phil Greer? mentioned: https://www.youtube.com/watch?v=bKiBk4UtbY0&t=117s
Run tools: https://www.youtube.com/watch?v=U8QZAGwnUm0&list=PLRkZ0Fz-n3Z7Jg0Vz4vudLYnBza4EUGLM&index=20&t=1755s
Here are the documentation for the platform if you have questions on any functionalities.
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/ukb-rap
https://documentation.dnanexus.com/
11a and 15a ran perfectly on linux. 16a gives this error
[lehres01@li03c03 UKBGWAS]$ sh 16a-gwas-imp37-result-merge-plink.sh
DXCLIError: Value provided for input field "in" could not be parsed as file: could not resolve "//data/ap_imp37_gwas/ukb22828_AP_c2_v3.AP.glm.logistic.hybrid" to a name or ID
The error is described here
https://community.dnanexus.com/s/question/0D5t000003yJQvECAW/how-to-use-dxfuse-style-file-paths-in-swiss-army-knife
I tried loading 16a 11a and 15a into subdirectory ukb22828_AP_c2_v3.AP.glm.logistic.hybrid
Didn't help. Any suggestions would be appreciated.
if 15a ran and you should have the *.glm.logistic.hybrid files in your data directory, you need to make sure that the -iin file is points to the correctly names outfile from step15a. It will likely be named something other than "ukb22828_AP_c2_v3.AP.glm.logistic.hybrid" which is for chromosome 2, (c2) and has the output names I set for my specific phenotype.
The gwas scripts were written specifically to show how to use plink and regenie using dx run command. I each plink (or regenie) command can and should be edited to add any number of additional analysis features that would be specific to your analysis.
-Phil
@Phil Greer? Hi Phil! thanks a lot for creating these tutorials. I'm trying to do a very simple operation that involves plink in order to see the list of samples that have WES final release data. I'm trying to use swiss army knife in the GUI but I don't see any output being generated and no instance allocated (but still is charging me money for being queued)??? not sure how this works.
This is the command I'm submitting
plink --bfile "/mnt/project(my_project_name)/Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - final release/ukb23158_c1_b0_v1" --write-samples --no-id-header --out final_WES.id
I would appreciate any help regarding this! I'm not sure I'm giving the correct paths (maybe that's the issue). Any help will be appreciated
Thanks
Diana
I'm not sure I understand when you say "no instance allocated", but you get charged. You might want to reach out to ukbiobank-support@dnanexus.com
Hi, Phil! Thanks for all the work doing these tutorials and scripts. I'm trying to do a study of chronic imbalance, using genotype calls for Step 1 of Regenie, then WES for Step 2, did all the liftover, etc. After merging in plink, I have 488,738 variants. After QC (--geno 0.1, --mac 20, --maf 0.01), I end up with only 381,649 variants (prior to liftover). And I wanted to wait to do HWE until after I put in the phenotype and ancestry, since it seemed to me that HWE without ancestry information would eliminate variants that I might want to keep, but I could be wrong).
After pruning with a very conservative 0.3, I'm left with 210,640 variants.
Are these sufficient variants to run Regenie? It's many fewer than the ~410,000 you mentioned to run against WES. Am I doing something wrong?
Thanks for any input you might have.
I am not sure why you are having this issue. Forgive me for being slow, but it has been over a year since I wrote this repo, but the scripts I wrote have a specific order for analysis. Lets go over the specific order for the WES GWAS on regenie.
The three scripts in GTfile_prep (01, 02, & 03) are run in that order. The first 01 combines all the individual chromosome files into a single file. The second 02 filters out missing data and rare variants, and the last 03 prunes out variants in high LD. Next we have GTfile_liftover, (04, 05, & 06) again, run in that order. 04 converts the plink file to vcf, 05 lifts-over from grch37 to grch38 using picard, and 06 converts the vcf back to plink format.
The last set is gwas_wes_regenie with scripts 09b, 11b, 12b and 13b. 09b is the regenie step1 on the pruned, grch38, genotype data. This is done to generate the independent block for regenie step2. 11b qc filters the WES dataset removing the most rare variants that do not behave well in association tests. 12b is running regenie on the filtered WES data using the results from step1, and finally 13b combines all the results together into a single files for download and graphing.
HWE filtering is completely unnecessary on the GT file, and I would highly discourage HWE filtering on the WES data. I will hopefully have a paper on that in the near future.
Please sign in to leave a comment.