Hi folks,
Did anyone manage to run GWAS in Hail and successfully save summary stats to the workspace?
I've left the files saving for up to a couple of hours yet no files are saved to the workspace and jupyter notebook is running. If I, however, stop the cell from running and want to save the file again, I am told that the file already exists on the Hadoop master.
Equally, trying to generate a Manhattan or QQ plot, the notebook is running forever without any output.
I am using default Hail configuration (mem1_hdd1_v2_x16).
The command to save the file (converted to Spark) is:
df.write.format('csv').option("header", "true").save("./gwas_sumstats.csv")
usuing .export() from Hail is not working either.
Any tips on how to get that saved will be much appreciated!
1) no - not even the smallest files produced by Hail/Spark get saved to local storage, however, smaller files "finish running in the cell" but there is nothing saved locally;
2) no - toPandas() or collect() never converge;
3) I tried with just one chromosome and tried with all - no luck with either. My other GWAS summary stats on a university cluster are ~25Mb. On my laptop or cluster, it does indeed take considerable amount of time to save the whole file, however, as soon as I run the cell there is a new folder created that stores all the chunks of the csv - this does not happen on RAP;
--> If a particular Spark command is taking too long to evaluate, you can monitor the Spark status by visiting the Spark console page. To do that, copy the URL of your current JupyterLab session (typically ending in
".dnanexus.cloud/lab?"
), open a new browser tab, and paste the URL. Replace
"/lab?"
with
":8081/jobs/"
and press Enter.
Please share a couple of Spark UI screenshots here.
You could also contact ukbiobank-support@dnanexus.com and DNAnexus Support team can try to reproduce it and help you to resolve these issues. You can share the screenshots and your full notebook with the Support team.
Comments
6 comments
Hi Ondrej,
Please see my responses:
1) no - not even the smallest files produced by Hail/Spark get saved to local storage, however, smaller files "finish running in the cell" but there is nothing saved locally;
2) no - toPandas() or collect() never converge;
3) I tried with just one chromosome and tried with all - no luck with either. My other GWAS summary stats on a university cluster are ~25Mb. On my laptop or cluster, it does indeed take considerable amount of time to save the whole file, however, as soon as I run the cell there is a new folder created that stores all the chunks of the csv - this does not happen on RAP;
4) Just tried with mem3_ssd3_x12 - no luck...
Thanks,
Krzys
Hi @Ondrej Klempir? ,
Did you by any chance manage to look into this issue? Alternatively is there anyone I could try to contact instead?
Thanks,
Krzys
Is there anything suspicious in the screenshots of the Spark UI?
https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data#tips-for-retrieving-fields
--> If a particular Spark command is taking too long to evaluate, you can monitor the Spark status by visiting the Spark console page. To do that, copy the URL of your current JupyterLab session (typically ending in
".dnanexus.cloud/lab?"
), open a new browser tab, and paste the URL. Replace
"/lab?"
with
":8081/jobs/"
and press Enter.
Please share a couple of Spark UI screenshots here.
You could also contact ukbiobank-support@dnanexus.com and DNAnexus Support team can try to reproduce it and help you to resolve these issues. You can share the screenshots and your full notebook with the Support team.
Hi @Krzysztof Marianski?,
I would like to let you know that @Chai Fungtammasan? has recently published the following post about Hail troubleshooting for UKB data:
https://community.dnanexus.com/s/question/0D5t000004AflSiCAJ/hail-troubleshooting-for-ukb-data
Please sign in to leave a comment.