I am referring to the following incredibly useful table:
geno=spark.sql(f"SELECT * FROM genotype_23146")
geno.printSchema()
While the documentation states that these genotype tables are not documented because they may change frequently, it would be very useful if one could query them effectively (naive "head()" call takes forever).
They must be indexed somehow, because the cohort browser access them very fast and retrieves genotypes, calculated allele frequencies, for a cohort on the fly.
How can I access them properly? How are they indexed?
Can I index them if the index is not available?
If will be extremely useful since other methods of loading working with WES genotypes in a Jupyter spark cluster are not yet working properly, in particular hail.
Thanks!
Thanks!
Thanks {@005t000000AD4v6AAD}?, this is very useful information!
I was interested in this topic and checked documentation on how to query data using spark sql. I found this Best Practices note:
When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.
With that, I started Spark based JL with 8 nodes of mem3_ssd1_v2_x8 instance type. I was able to query genotype data in let's say 10 minutes which I was pretty happy with it.
My commands:
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
spark.sql('SHOW DATABASES').show(20, False)
spark.sql('SHOW TABLES IN database_X).show(20, False)
spark.sql(f"SELECT * FROM database_X.genotype_23146").printSchema()
df = spark.sql('SELECT sample_id, type, depth FROM X.genotype_23146')
I am curious in how to achieve the efficiency that they achieve when plotting the genomic data and calculating allele frequencies on the cohort browser, they must have am index to be able to do it so fast (unless they use so many expensive nodes and not charging you for them, which I doubt).?
Comments
4 comments
Thanks {@005t000000AD4v6AAD}?, this is very useful information!
I was interested in this topic and checked documentation on how to query data using spark sql. I found this Best Practices note:
When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.
[https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster/exploring-and-querying-with-spark#best-practices]
With that, I started Spark based JL with 8 nodes of mem3_ssd1_v2_x8 instance type. I was able to query genotype data in let's say 10 minutes which I was pretty happy with it.
My commands:
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
spark.sql('SHOW DATABASES').show(20, False)
spark.sql('SHOW TABLES IN database_X).show(20, False)
spark.sql(f"SELECT * FROM database_X.genotype_23146").printSchema()
df = spark.sql('SELECT sample_id, type, depth FROM X.genotype_23146')
df.show(10)
I was monitoring each spark task based on this doc page: https://documentation.dnanexus.com/developer/apps/developing-spark-apps#monitoring-the-spark-ui
When I used even more resources (10 nodes, mem3, 16 CPU) I got results approximately in 1 min.
Thanks, did you look up specific IDs? How many?
?
I am curious in how to achieve the efficiency that they achieve when plotting the genomic data and calculating allele frequencies on the cohort browser, they must have am index to be able to do it so fast (unless they use so many expensive nodes and not charging you for them, which I doubt).?
Please sign in to leave a comment.