Getting INFO scores for .bgen files

I understand that I can use Swiss Army Knife to get INFO scores for imputed GEL data but the run times are way too long - i have been using this command qctool -g name.bgen -s name.sample -snp-stats -osnp snp_stats.txt in Swiss Army Knife and when running it on the small chrom 22, it is still taking forever. Is there an alternative to getting INFO scores for GEL that does not involve signing up for EGA or submitting jobs that seem never ending? 

 

Comments

5 comments

  • Comment author
    Gabriele Maria Sgarlata

    Hi Alexandra

    I am learning to use the UKB-RAP so I am not sure if what I am suggesting is the best way, but in principle you can do that using the python package Hail, that is:

    # Create an index file for each BGEN file in the directory

    bgen_path = f'/mnt/project/Bulk/Imputation/UKB imputation from genotype'
    filename = f'ukb22828_c22_b0_v3'
    file_url = f"file://{bgen_path}/{filename}.bgen"
    output_name = f'file:///mnt/project/Data/{filename}.bgen.idx2'

    hl.index_bgen(path=file_url, index_file_map={file_url:f"hdfs:///{filename}.idx2"}, reference_genome="GRCh37", contig_recoding=None, skip_invalid_loci=False)

    # Define the path to the index file and import the BGEN file in a Hail Matrix Table

    index_file_map = {}
    index_file_map[f"file://{bgen_path}/{filename}.bgen"] = f"hdfs:///{filename}.idx2"

    mt = hl.import_bgen(path=f"file://{bgen_path}/ukb22828_c22_b0_v3.bgen", entry_fields=['GT', 'GP'], sample_file=f"file://{bgen_path}/ukb22828_c22_b0_v3.sample", n_partitions=None, block_size=None, index_file_map=index_file_map, variants=None)

    #Convert an aggregable of genotypes (gs) to an aggregable of genotype quality scores and then it compute the IMPUTE information score for each variant
    mt_with_infoscore = mt.filter_rows(hl.agg.info_score(mt.GP).score >= 0.8) 

    2
  • Comment author
    Alexandra Baousi

    Gabriele Maria Sgarlata did you do this on a spark instance with HAIL? Also, how long are the runtimes? Spark Clusters seem to take a while to launch for me so I was just wondering if it is the same for all of us aha! Thank you for your response by the way !!

    0
  • Comment author
    Gabriele Maria Sgarlata

    Hi Alexandra, 

    I did this in the spark instance with Hail. The analysis itself takes a few seconds (it seems), although you may need to check with your full pipeline. It seems that Hail is “lazy” (for what I read in other threads), meaning that it performs certain analyses in the moment of writing the MatrixTable (for instance, in a database). 

    Regarding the Spark Cluster, yeah, it takes a bit to launch the Jupyter Lab (in my case about 10 minutes, although sometimes even 20 mins). I do not know on what does it depends on. Maybe the number of cores?

    I hope it helps.

    Best,

    Gabriele

     

    0
  • Comment author
    Alexandra Baousi

    Gabriele Maria Sgarlata did you get this error when running your script at all? 

    ValueError: index_bgen: no file or directory at file:///mnt/project/Bulk/Imputation/UKB%20imputation%20from%20genotype/ukb21008_c22_b0_v1.bgen

    initially it had an issue with the fact that it had 2 forward slashes and not 3 but it just is not seeing the file at all so i am wondering, have you ran the code/has it worked? I am trying to run it as a bash shell script rather than on UKB RAP because it is very inefficient 

    0
  • Comment author
    Gabriele Maria Sgarlata

    Hi Alexandra, 

    No, I did not encountered this problem. Maybe it is related to how you write the path? ( specifically, how you deal with the space in 'UKB imputation from genotype' ?).

    I cannot tell. But try to follow what I wrote above. It worked for me.

    Best,

    Gabriele

     

    0

Please sign in to leave a comment.