I can create a DNA Nexus database, and confirm that it's present with dx describe.
E.g.:
import dxpy
db_name = "mydb"
mt_name = "my_table"
stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
spark.sql(stmt).show()
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
mt_url = f"dnax://{db_uri}/{mt_name}"
# Save Hail MatrixTable (defined elsewhere) to database table
mt.write(mt_url)
How can I get this data to be accessible in another format (e.g. tsv) in my project?
Currently, not. The dxfuse is in read only mode in this case, so writing object is only feasible if you write into instance and upload them to the platform.
The largest instance could hold up to 60 TB which should be sufficient.
Comments
4 comments
I think that saving it to csv and manipulating the data on hdfs might resolve this:
https://community.dnanexus.com/s/question/0D5t0000045I12HCAS/where-does-saved-data-go-on-a-jupyter-spark-cluster
https://discuss.hail.is/t/exporting-data-from-matrixtable-into-tsv/2406
Thanks for your reply.
I guess this means that if I produce a large file (many terabytes) then the local HD of a single node would need to be large enough to hold that file?
It seems like there ought to be a way for a cluster to produce data that can be directly stored to the project / cloud buckets...
Currently, not. The dxfuse is in read only mode in this case, so writing object is only feasible if you write into instance and upload them to the platform.
The largest instance could hold up to 60 TB which should be sufficient.
Please sign in to leave a comment.