Are there alignment reports available?
Hi everyone,
We are currently applying for access to UKB. Unfortunately, for our research purposes we would need to realign the entire WGS dataset (or, at the very least, a large subset of the available data) to a slightly modified genome assembly. This is obviously financially problematic, if not outright impossible for us at the moment, so we are exploring ways around it.
We also don't have much experience with BWA, since we have done all of our previous WGS work with bowtie2, plus bismark and STAR for WGBS and RNA-Seq, respectively, which all provide at the very least a short report on the results of the alignments. BWA doesn't appear to include any, and it seems to require post-hoc processing to obtain any information about the alignment results.
Does anyone know if the datasets include any information about the alignment results, be from samtools idxstats, bamtools stats or any other similar tool? If so, what information is it already present?
Cheers,
Fran
Comments
4 comments
Hi Fran,
BWA is a great general purpose aligner and I highly recommend using it. For basic alignment stats `samtools flagstat` is a great option. You can find the manual page here: http://www.htslib.org/doc/samtools-flagstat.html. The example output of this tool looks like this:
661181998 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
2661408 + 0 supplementary
71466644 + 0 duplicates
660796799 + 0 mapped (99.94% : N/A)
658520590 + 0 paired in sequencing
329251148 + 0 read1
329269442 + 0 read2
643899768 + 0 properly paired (97.78% : N/A)
657750192 + 0 with itself and mate mapped
385199 + 0 singletons (0.06% : N/A)
6284682 + 0 with mate mapped to a different chr
3170603 + 0 with mate mapped to a different chr (mapQ>=5)
For more advanced alignment statistics I highly recommend GenomicAlignments package (https://bioconductor.org/packages/release/bioc/html/GenomicAlignments.html). With this package, you can leverage great statistics tools and visualization in R to get the exact result you need.
There are some alignment stats fields in "QC metrics for WGS processing" category, which include the proportion of mapped reads and coverage. Please check the UKB Showcase for further details: https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=180
All the best,
Aleks
Hi Aleks,
Thanks for your reply. We use samtools extensively already, so that's not the issue. Our concern is that computation time on the RAP is costly, so when considering whether it is really worth paying the upfront fee to UKB we need to also account for the additional processing time that the data would require. Since alignment stats is something that many users would need to check for QC purposes, I was wondering if the output of any of the tools that provide such stats might already be present and accessible, to avoid redundant computation (and save money).
Cheers,
Fran
Hi Fran,
Could you explain what kind of project would you like to do and why you need to re-align the dataset?
The alignment will be a considerable compute cost, significantly higher than UKB access fee even for a small subset of the dataset. The individual CRAM files are 15-20GB (please note that CRAM format is designed for storage optimization). So for the 150k cohort, we are talking about multiple PiB of data to re-align.
Have you checked these files: https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=187 ?
Are any of these stats useful for your project?
Best wishes,
Aleks
In broad terms, we intend to estimate the copy number of a repetitive part of the genome that is not explicitly included in the assembly. For this, we always use a tweaked genome assembly on which pseudocopies (and complete units included in unplaced contigs) are masked, and to which we explicitly append a curated version of the region of interest. Since we are fully aware that realigning the entire dataset (or even just a fraction of it) would be prohibitively costly, we are trying to find ways around it. Fortunately, it seems that knowing the total number of aligned reads to the entire assembly and those aligned to one or more of the unplaced contigs might suffice.
Please sign in to leave a comment.