Unable to save Hail output files

Hi folks,   Did anyone manage to run GWAS in Hail and successfully save summary stats to the workspace? I've left the files saving for up to a couple of hours yet no files are saved to the workspace and jupyter notebook is running. If I, however, stop the cell from running and want to save the file again, I am told that the file already exists on the Hadoop master.   Equally, trying to generate a Manhattan or QQ plot, the notebook is running forever without any output.   I am using default Hail configuration (mem1_hdd1_v2_x16). The command to save the file (converted to Spark) is: df.write.format('csv').option("header", "true").save("./gwas_sumstats.csv")   usuing .export() from Hail is not working either.   Any tips on how to get that saved will be much appreciated!    

Comments

6 comments

  • Comment author
    Ondrej Klempir DNAnexus Team
    1. Are you able to filter it and save just a subset of rows of your Spark df?
    2. Is it possible to first convert your Spark df to Pandas df and save?
    3. How big are the Hail GWAS results? Is it just one chromosome?
    4. How does this behave with let's say instance type mem3 and ssd instead of hdd?
    0
  • Hi Ondrej,

    Please see my responses:

     

    1) no - not even the smallest files produced by Hail/Spark get saved to local storage, however, smaller files "finish running in the cell" but there is nothing saved locally;

    2) no - toPandas() or collect() never converge;

    3) I tried with just one chromosome and tried with all - no luck with either. My other GWAS summary stats on a university cluster are ~25Mb. On my laptop or cluster, it does indeed take considerable amount of time to save the whole file, however, as soon as I run the cell there is a new folder created that stores all the chunks of the csv - this does not happen on RAP;

    4) Just tried with mem3_ssd3_x12 - no luck...

     

    Thanks,

    Krzys

    0
  • Hi @Ondrej Klempir? ,

     

    Did you by any chance manage to look into this issue? Alternatively is there anyone I could try to contact instead?

     

    Thanks,

    Krzys

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Is there anything suspicious in the screenshots of the Spark UI?

     

    https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/using-spark-to-analyze-tabular-data#tips-for-retrieving-fields

     

    --> If a particular Spark command is taking too long to evaluate, you can monitor the Spark status by visiting the Spark console page. To do that, copy the URL of your current JupyterLab session (typically ending in

    ".dnanexus.cloud/lab?"

    ), open a new browser tab, and paste the URL. Replace

    "/lab?"

    with

    ":8081/jobs/"

    and press Enter.

     

    Please share a couple of Spark UI screenshots here.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    You could also contact ukbiobank-support@dnanexus.com and DNAnexus Support team can try to reproduce it and help you to resolve these issues. You can share the screenshots and your full notebook with the Support team.

    0
  • Comment author
    Ondrej Klempir DNAnexus Team

    Hi @Krzysztof Marianski?,

     

    I would like to let you know that @Chai Fungtammasan? has recently published the following post about Hail troubleshooting for UKB data:

     

    https://community.dnanexus.com/s/question/0D5t000004AflSiCAJ/hail-troubleshooting-for-ukb-data

    0

Please sign in to leave a comment.