Which RAP tools work natively with the segmentation of the genome into tiny chunks (with no index)

26 March 2024 00:23
12 comments

The only way to access UKB WGS data is on the RAP and the only format of WGS data (currently provided) on the RAP is a vast number of pVCF files which segment the genome into ~20kb chunks, without an index to know which chunks pertain to which genomic regions.

I was just wondering if anyone has found any tools (or workflows) that DNA nexus might have considered providing us with in order to allow the extraction of a specific genome regions for testing against phenotype?

Or has anyone even managed to find an index file which tells us the start and end of each tiny genome chunk?

Surely DNA nexus would have pre-empted this problem?

How are we supposed to use these data?

Comments

12 comments

George F UKB Community team Data Analyst
- Edited 02 April 2024 13:17
DRAGEN WGS and GATK/Graphtyer pipeline 500K is currently only available in pVCF format, however we are planning to release this in PLINK and BGEN format as well https://biobank.ctsu.ox.ac.uk/showcase/label.cgi?id=185 and https://biobank.ctsu.ox.ac.uk/showcase/label.cgi?id=270
The previous releases of GATK/Graphtyer (https://biobank.ctsu.ox.ac.uk/showcase/label.cgi?id=271) and the exome data are available in PLINK, BGEN and pVCF formats.
The pVCF blocks are sequential, so it possible to calculate the starting position by the block number x the chunk length. We are planning to release index files for several pVCF fields as well as a notebook to help calculate the start positions of each block.

2
Rachael W UKB Community team Data Analyst
- 31 March 2024 16:44
This post might also be relevant https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/17290902247453-WGS-QC-and-list-of-variants

0
David Curtis
- 02 August 2024 14:51
Hi.
I am very interested in this issue. Is there any update on it?
If not, you write “it possible to calculate the starting position by the block number x the chunk length”. Can we rely on this across a whole chromosome? What is the exact chunk length?
Thanks!
- Dave Curtis

0
Rachael W UKB Community team Data Analyst
- 02 August 2024 15:12
Resource 2008 https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=2008 is an index file for field 23374 500k GraphTyper WGS
Resource 2009 https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=2009 is an index file for field 24310 500k DRAGEN WGS

1
David Curtis
- 02 August 2024 15:35
Thanks. But where are the DRAGEN WGS files which this indexes? I can't find a folder which has them in? (As I've posted in another thread.)

0
David Curtis
- 02 August 2024 15:38
Likewise, where are the GraphTyper WGS files which this indexes? I can find some GraphTyper VCFs but they have names like ukb23352_c7_b3106_v1.vcf.gz.tbi whereas the index has names like ukb23374_c1_b0_v1.vcf.gz, so the version numbers look different?
Thanks.

0
Rachael W UKB Community team Data Analyst
- 02 August 2024 15:41
Please see this thread https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/17290902247453-WGS-QC-and-list-of-variants

0
Rachael W UKB Community team Data Analyst
- 02 August 2024 15:45
The sub-categories within Category 180 hold lists of the WGS field numbers, see https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=180

0
David Curtis
- 02 August 2024 15:51
Ah, thanks, it looks like I need to “Dispense more data” to get the individual level VCFs. At the moment if I go to my project settings the “Check for updates” button is greyed out because it is "unavailable due to system maintenance". I'll have a look next week and see if it's sorted then.

0
Rachael W UKB Community team Data Analyst
- 02 August 2024 16:21
The “Check for updates” is a different concept.
Before the WGS release in November 2023, it was possible to Refresh a RAP project and thereby receive all new fields and all updates to old fields. This was done with “Check for Updates”. It is not currently possible to Refresh a project. Anyone needing updates to old RAP projects (older than Nov 2023) needs to dispense a new RAP project.
When the 500k WGS data was released, a new system of dispensing data was introduced. Under the new system, it takes 3 dispenses to receive all the data. See https://www.ukbiobank.ac.uk/media/dovbae03/uk-biobank-final-whole-genome-sequencing-release-faqs_v1-0.pdf . However, I think the pVCFs are actually in the second set.
One way to find a filename if it exists is with the dx find command.

0
Rachael W UKB Community team Data Analyst
- 11 August 2024 18:28
Field 23352 holds GraphTyper pVCF data from the 150k release, see UKB Showcase https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23352 .
Field 23374 holds GraphTyper pVCF data from the 500k release, see UKB Showcase https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23374 .

0
David Curtis
- 02 October 2024 08:14
I've emailed Access about this but just to say here that it would be helpful if people were also able to download the index file for the exome pVCFs, which I think is called field_23157_pVCF_500k_Exome_starter_pos.txt.

0

Please sign in to leave a comment.