What is the cleanest way to extract phenotype data at scale?
I know there's a lot to read about this already. The canonical post on this topic seems to be this one:
https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019569797021-Query-of-the-Week-1-Export-Phenotypic-Data-to-a-File
But none of those options meet my needs.
- I need something that is fully automated with no manual or GUI steps, so that it's reproducible in line with Ziemann et al.'s guidance. This rules out JupyterLab and the Table Exporter GUI.
- I need this to work at scale, with thousands of fields. This causes errors or very long runtimes in dx extract_dataset and the Table Exporter.
- I need to make queries involving multiple entities (tables), which rules out the Table Exporter CLI.
- I need a consistent header style. My downstream code currently works with the default output from dx extract_dataset, which looks like participant.p22009_i0_a1 for example. None of the Table Exporter options for --header_style seem to match this; none are fully qualified with the entity name.
Right now I have a WDL pipeline that calls dx extract_dataset, but it is having to get more and more elaborate. Because dx extract_dataset does not handle large numbers of fields, I had to use scatter-gather to split the job into chunks. Then some of the chunks usually time out, so I have to handle retries. Some chunks generate errors I have never seen before with DNAnexus' WDL workflows, with Smart Reuse failing and workers become unresponsive (but no OOM issues). Sorry I am not prepared to fully document that last bit, but I am not asking for help with these specific errors. I just am expressing that my current solution is not great, and I am looking to brainstorm with other users solving the same general problem.
This seems like the type of thing that many if not most other users would want. Does anyone have a good solution? What are your go-to workflows for scalable, fully automated tabular data extraction from multiple entities with consistent output header format?
Thanks very much!
Eric Kernfeld
Alden Scientific
Comments
3 comments
More reading on this topic:
Jonathan Margoliash's threads provides a possible path to calling Spark jobs from WDL. But they end with no resolution to the DNAnexus bugs and with a recommendation to use dx extract_dataset instead.
This thread provides a path to calling Python notebooks non-interactively. But there's no way to provide arguments such as what fields to extract or what entities to reference.
Spark apps seem like the DNAnexus intended solution, and the table exporter is a Spark app.
Note to self: a good example of unusual DNAnexus failures if I decide to go down that path is in analysis-J60PjYQJbkgq6yx42q98gj97
Related thread: https://community.dnanexus.com/s/question/0D582000004TONTCA4/are-there-simpler-ways-of-doing-things
Please sign in to leave a comment.