Error Reading Public Datasets

Hello,

 

I am trying to access the publicly available gnomAD dataset from AWS s3// file systems but I am getting timeout waiting for connection from pool error. May I know how to proceed with this error?

 

```

Code:

mt = hl.read_matrix_table("s3://gnomad-public-us-east-1/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt")

var_qc_mt = hl.variant_qc(mt) 

#Filter to variants with AF between 0.05 & 0.95, and call rate greater than 0.999   

filtered_mt = var_qc_mt.filter_rows(((var_qc_mt.variant_qc.AF[0] > 0.05) & (var_qc_mt.variant_qc.AF[1] > 0.05)) &

                 ((var_qc_mt.variant_qc.AF[0] < 0.95) & (var_qc_mt.variant_qc.AF[1] < 0.95)) &

                 (var_qc_mt.variant_qc.call_rate > 0.999))

prunned_mt = hl.ld_prune(mt_hgdp_tgp_clean.GT, r2=0.1, bp_window_size=500000) 

```

```

Error:

FatalError: ConnectionPoolTimeoutException: Timeout waiting for connection from pool

 

Java stack trace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost task 55.3 in stage 1.0 (TID 522, ip-10-60-49-94.eu-west-2.compute.internal, executor 0): com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool

```

 

Please do let me know how to fix this issue.

 

Regards

Akhil

 

 

 

 

Comments

8 comments

  • Comment author
    Alexandra Lee DNAnexus Team

    I'm not able to reproduce the timeout error based on the code provided. Do you know which command call within this code block is throwing the error?

     

    Here is a previous post about troubleshooting Hail that you might find helpful: https://community.dnanexus.com/s/question/0D5t000004AflSiCAJ/hail-troubleshooting-for-ukb-data

    0
  • Comment author
    Former User of DNAx Community_6

    This block is throwing error which was running well on Friday.

     

    prunned_mt = hl.ld_prune(mt_hgdp_tgp_clean.GT, r2=0.1, bp_window_size=500000) 

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    @Akhil Could you reach out to ukbiobank-support@dnanexus.com with your error. We many need out team to look into your error in detail.

    0
  • Comment author
    Former User of DNAx Community_6

    Sure, I will reach out to ukbiobank support. Thank you so much.

     

    0
  • Comment author
    Former User of DNAx Community_6

    Thank you for the help. I haven't got any response from the ukbiobank support team. I have checked online and found this StackOverflow solution(https://stackoverflow.com/questions/56259853/why-aws-is-rejecting-my-connections-when-i-am-using-wholetextfiles-with-pyspar).

     

    They suggested starting spark based on this builder.

     

    builder = (

      SparkSession

      .builder

      .config("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

      .config("fs.s3a.awsAccessKeyId", aws_access_key)

      .config("fs.s3a.awsSecretAccessKey", aws_secret_key)

      .config("fs.s3a.fast.upload", "true")

      .config("fs.s3a.multipart.size", "1G")

      .config("fs.s3a.fast.upload.buffer", "disk")

      .config("fs.s3a.connection.maximum", 200)

      .config("fs.s3a.attempts.maximum", 20)

      .config("fs.s3a.connection.timeout", 30)

      .config("fs.s3a.threads.max", 10)

      .config("fs.s3a.buffer.dir", "hdfs:///user/hadoop/temporary/s3a")     

    )

     

    I tried running this but got aws_access_key not found. Since DNAnexus is aws based, is there a way to get aws based secret keys and access key ids in DNAnexus ?

     

    Regards

    Akhil

     

     

     

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    We could not share the secret key, and I'm afraid it's best to let support team handle this since you could grant them access to the project.

    0
  • Comment author
    Former User of DNAx Community_6

    Thank you so much for the information. I have reached out to the help desk last week but no response till now.

    0
  • Comment author
    Chai Fungtammasan DNAnexus Team

    The response could be slow from time to time depending on ticket volume. We will discuss with them and see how we resolve the issue. It would be hard for the public community to help since most of us are focusing on UKB data.

    0

Please sign in to leave a comment.