Error Reading Public Datasets
Hello,
I am trying to access the publicly available gnomAD dataset from AWS s3// file systems but I am getting timeout waiting for connection from pool error. May I know how to proceed with this error?
```
Code:
mt = hl.read_matrix_table("s3://gnomad-public-us-east-1/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt")
var_qc_mt = hl.variant_qc(mt)
#Filter to variants with AF between 0.05 & 0.95, and call rate greater than 0.999
filtered_mt = var_qc_mt.filter_rows(((var_qc_mt.variant_qc.AF[0] > 0.05) & (var_qc_mt.variant_qc.AF[1] > 0.05)) &
((var_qc_mt.variant_qc.AF[0] < 0.95) & (var_qc_mt.variant_qc.AF[1] < 0.95)) &
(var_qc_mt.variant_qc.call_rate > 0.999))
prunned_mt = hl.ld_prune(mt_hgdp_tgp_clean.GT, r2=0.1, bp_window_size=500000)
```
```
Error:
FatalError: ConnectionPoolTimeoutException: Timeout waiting for connection from pool
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost task 55.3 in stage 1.0 (TID 522, ip-10-60-49-94.eu-west-2.compute.internal, executor 0): com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
```
Please do let me know how to fix this issue.
Regards
Akhil
Comments
8 comments
I'm not able to reproduce the timeout error based on the code provided. Do you know which command call within this code block is throwing the error?
Here is a previous post about troubleshooting Hail that you might find helpful: https://community.dnanexus.com/s/question/0D5t000004AflSiCAJ/hail-troubleshooting-for-ukb-data
This block is throwing error which was running well on Friday.
prunned_mt = hl.ld_prune(mt_hgdp_tgp_clean.GT, r2=0.1, bp_window_size=500000)
@Akhil Could you reach out to ukbiobank-support@dnanexus.com with your error. We many need out team to look into your error in detail.
Sure, I will reach out to ukbiobank support. Thank you so much.
Thank you for the help. I haven't got any response from the ukbiobank support team. I have checked online and found this StackOverflow solution(https://stackoverflow.com/questions/56259853/why-aws-is-rejecting-my-connections-when-i-am-using-wholetextfiles-with-pyspar).
They suggested starting spark based on this builder.
builder = (
SparkSession
.builder
.config("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
.config("fs.s3a.awsAccessKeyId", aws_access_key)
.config("fs.s3a.awsSecretAccessKey", aws_secret_key)
.config("fs.s3a.fast.upload", "true")
.config("fs.s3a.multipart.size", "1G")
.config("fs.s3a.fast.upload.buffer", "disk")
.config("fs.s3a.connection.maximum", 200)
.config("fs.s3a.attempts.maximum", 20)
.config("fs.s3a.connection.timeout", 30)
.config("fs.s3a.threads.max", 10)
.config("fs.s3a.buffer.dir", "hdfs:///user/hadoop/temporary/s3a")
)
I tried running this but got aws_access_key not found. Since DNAnexus is aws based, is there a way to get aws based secret keys and access key ids in DNAnexus ?
Regards
Akhil
We could not share the secret key, and I'm afraid it's best to let support team handle this since you could grant them access to the project.
Thank you so much for the information. I have reached out to the help desk last week but no response till now.
The response could be slow from time to time depending on ticket volume. We will discuss with them and see how we resolve the issue. It would be hard for the public community to help since most of us are focusing on UKB data.
Please sign in to leave a comment.