Can you annotate with CADD, gnomad, clinvar and dbNSFP options when using hail on Spark jupyterlab notebooks?

06 May 2022 00:00
2 comments

I'm just wondering how to specify cadd, gnomad, clinvar and dbNSFP options when annotating with hail on dxjupyterlab_spark_cluster? From the hail website, the following command can be used on your matrix file to annotate with these features:

db = hl.experimental.DB(region='us', cloud='gcp')

mt = db.annotate_rows_db(mt, 'CADD', 'clinvar_gene_summary', 'clinvar_variant_summary', 'dbNSFP_genes', 'dbNSFP_variants', 'dbSNP_rsid', 'gnomad_exome_sites')

weblink: https://hail.is/docs/0.2/annotation_database_ui.html

Unfortunately, this command does not work on hail when using the spark jupyterlab python3 console. The error that is given is:

Java stack trace:

java.io.IOException: No FileSystem for scheme: gs

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:164)

at is.hail.io.fs.FS$class.isDir(FS.scala:175)

at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:70)

at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:30)

at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:73)

at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:581)

at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:282)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.78-b17627756568

Error summary: IOException: No FileSystem for scheme: gs

Also, the EU version of this hail command is not available but it is available on the US version. I am based in the EU so I don't whether this matters or not.

My instance type is a dxjupyterlab_spark_cluster with mem3_ssd2_v2_x16 and 3 nodes.

Any help with this would be great. I've managed to figure out how to annotate with hail .vep but these options are not available.

Comments

2 comments

Ondrej Klempir DNAnexus Team
- 10 May 2022 14:41
Thomas, were you able to resolve this issue?

I am sharing my thoughts (not tested). It would be interesting to check if the annotation data is stored somewhere else, not on gs. Also whether it would be possible to load the DB object with a different Hail function (which can read from local file instead of cloud). Another idea on top of my head would be to check whether is possible to download DBs from the original sources and set DBs manually for Hail.

0
Former User of DNAx Community_4
- 26 May 2022 15:10
Hi Ondrej,

Yes I was able to solve this by changing gs to aws and I kept the region as US.

0

Please sign in to leave a comment.