cram cache download/internet access from samtools

Permanently deleted user

I am trying to use samtools view with a selected genome positions with the cram files in my applet. It will get the cache files from www.ebi.ac.uk and require internet access. The process can take 2 ~ 3 hrs on a 50GB cram files on a weekday. Besides I get the messages:

[W::cram_populate_ref] Creating reference cache directory /home/dnanexus/.cache/hts-ref

This may become large; see the samtools(1) manual page REF_CACHE discussion

 

I do not know the exact hg38 reference fasta version used in my cram file, so the cache files generated following the instructions in

http://www.htslib.org/workflow/cram.html

didn't work.

I managed to download the cache MD5 files from the ebi web site, and and I don't see messages above anymore. However if I remove the internet access from dxapp.json the job will fail.

So how do you correctly set the environment variable in your applet so samtools can use a local version of cache?

I am using python in the applet.

Thank you for any help.

Comments

4 comments

  • Comment author
    Anastazie Sedlakova DNAnexus Team

    Hello,

    when we work with CRAM files we usually provide reference file using -T parameter, see example here.

     

    You can download reference file here or here is described how you can figure out the reference file from the header.

     

    Hope that helps

    0
  • Comment author
    Permanently deleted user

    Thank you. I couldn't make a complete download of hg38 fasta file from your link. But I found one in the GATK4 bundle that matches M5 tags of the cram file head. samtools performs significantly faster now.

    0
  • Comment author
    Anastazie Sedlakova DNAnexus Team

    Hello, that is strange, link that I provided worked for me. Nevertheless it is fine that samtools are much faster now. Is FASTA file that you used  GRCh38.primary_assembly.genome.fa?

    0
  • Comment author
    Permanently deleted user

    Actually the link works. I guess my local server had some problems.

    wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

    only downloaded part of the sequence.

    I found a version of Homosapiens_assembly38.fasta in GATK4 Bundle that contains the same M5 numbers as the ones in cram files so it worked. I had to remove " _" sign before "sapiens" or else I can't post this reply.

    Thank you so much for the help.

    0

Please sign in to leave a comment.