What is the cleanest way to extract phenotype data at scale?

Eric Kernfeld

I know there's a lot to read about this already. The canonical post on this topic seems to be this one:

https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019569797021-Query-of-the-Week-1-Export-Phenotypic-Data-to-a-File

But none of those options meet my needs. 

- I need something that is fully automated with no manual or GUI steps, so that it's reproducible in line with Ziemann et al.'s guidance. This rules out JupyterLab and the Table Exporter GUI. 
- I need this to work at scale, with thousands of fields. This causes errors or very long runtimes in dx extract_dataset and the Table Exporter. 
- I need to make queries involving multiple entities (tables), which rules out the Table Exporter CLI. 
- I need a consistent header style. My downstream code currently works with the default output from dx extract_dataset, which looks like participant.p22009_i0_a1 for example. None of the Table Exporter options for --header_style seem to match this; none are fully qualified with the entity name. 

Right now I have a WDL pipeline that calls dx extract_dataset, but it is having to get more and more elaborate. Because dx extract_dataset does not handle large numbers of fields, I had to use scatter-gather to split the job into chunks. Then some of the chunks usually time out, so I have to handle retries. Some chunks generate errors I have never seen before with DNAnexus' WDL workflows, with Smart Reuse failing and workers become unresponsive (but no OOM issues). Sorry I am not prepared to fully document that last bit, but I am not asking for help with these specific errors. I just am expressing that my current solution is not great, and I am looking to brainstorm with other users solving the same general problem. 

This seems like the type of thing that many if not most other users would want. Does anyone have a good solution? What are your go-to workflows for scalable, fully automated tabular data extraction from multiple entities with consistent output header format?

Thanks very much!

Eric Kernfeld
Alden Scientific

Comments

3 comments

  • Comment author
    Eric Kernfeld
    • Edited

    More reading on this topic:

    Jonathan Margoliash's threads provides a possible path to calling Spark jobs from WDL. But they end with no resolution to the DNAnexus bugs and with a recommendation to use dx extract_dataset instead.

    This thread provides a path to calling Python notebooks non-interactively. But there's no way to provide arguments such as what fields to extract or what entities to reference. 

    Spark apps seem like the DNAnexus intended solution, and the table exporter is a Spark app. 

     

    0
  • Comment author
    Eric Kernfeld

    Note to self: a good example of unusual DNAnexus failures if I decide to go down that path is in analysis-J60PjYQJbkgq6yx42q98gj97

    0
  • Comment author
    Eric Kernfeld

    Related thread: https://community.dnanexus.com/s/question/0D582000004TONTCA4/are-there-simpler-ways-of-doing-things

    0

Please sign in to leave a comment.