500k BEAGLE Phased VCFs data release

500k BEAGLE Phased VCFs available on the UKB-RAP and described on Showcase in Field 30108. This data enables more accurate genotype imputation, fine-mapping of disease-associated loci in genetic association testing and detection of compound heterozygosity, as well as analyses of recombination, ancestry, and parent-of-origin effects.

Data Processing

The 500k BEAGLE Phased data were generated using the ML-Corrected DRAGEN whole genome pVCFs (Field 24311).

Data processing prior to phasing used bcftools v1.19 and consisted of the following steps:

The first shard of each chromosome was filtered using the bcftools command:

bcftools norm -m +any -Ou in.vcf.gz \
| bcftools view -i "FILTER=\"PASS\"&MAX(INFO/AC)>=2&N_ALT<255&QUAL>=30.0” \
-Oz -o out.vcf.gz

The subsequent shards of each chromosome were filtered to exclude the VCF header with the bcftools command:

bcftools norm -m +any -Ou in.vcf.gz \
| bcftools view -i "FILTER=\"PASS\"&MAX(INFO/AC)>=2&N_ALT<255&QUAL>=30.0" \
-H -Oz -o out.vcf.gz

These commands combined VCF records with the same POS field into a single VCF record and then excluded any records that met one or more of the following conditions:

The VCF QUAL field is < 30.0
The VCF FILTER field is not "PASS"
The number of alternate alleles is >= 255
The allele count for every alternate allele is < 2

The resulting filtered VCF files for each genomic interval, i.e. shards, were then combined to create a chromosome-wide VCF file for each chromosome.

Subsequently, each of these unphased chromosome-wide VCF files were phased with Beagle v5.5. For the chromosome X VCF, male genotypes in the non-PAR region were changed to homozygous diploid genotypes by Beagle immediately prior to genotype phasing.

Phasing was carried out using the following code:

Each chromosome-wide VCF file, except chromosome 6, was phased with the command:

java -Xmx685g -jar beagle.jar gt=unphased.vcf.gz \
map=map_file out=chr${chr}.phased window-markers=3500000

In order to satisfy memory constraints, the chromosome 6 VCF file was phased with the command:

java -Xmx685g -jar beagle.jar gt=unphased.vcf.gz \
map=map_file out=phased window-markers=3200000

The VCF QUAL, FILTER, and INFO fields in the input unphased VCF file are copied to the output phased VCF file. Consequently, the INFO/AC and INFO/AN fields in the output VCF file do not count sporadic missing alleles in the input VCF file that were imputed during genotype phasing.

Data Considerations

Males in the chromosome X phased VCF file have diploid genotypes in the non-PAR regions. Therefore, the male genotypes in the non-PAR region will be homozygous, except in some cases where the input male genotype is missing and the true allele cannot be determined with high confidence. Haploid male genotypes can be recovered by deleting the 2nd haplotype of each male sample in the chromosome X non-PAR region, however, this will substantially increase the size of the gzip-compressed VCF file.

The data processing using bcftools (described above) results in some long entries in the ID field. This means, in some cases, the variant ID field in the phased VCF files exceed plink/plink2’s 16,000-character limit for variant IDs and produces the following error “Error: Invalid ID on line xxxx of --vcf file (max 16000 chars).”

As mentioned above, VCF records with the same POS field were combined into a single VCF record. Some software packages require multiallelic VCF records to be split into multiple biallelic VCF records.

In the future, we plan to provide guidance on how to start using the UKB pVCF data, including the 500k BEAGLE phased data. This notebook will be made available on the UK Biobank github.

Computational Information

Filtering using bcftools:

Filtering jobs were initially run on "mem3_ssd1_v2_x2" spot instances. Filtering jobs that failed due to insufficient memory were re-run on "mem3_ssd1_v2_x4" or "mem3_ssd1_v2_x8" instances. Filtering jobs that repeatedly failed due to spot instance interruptions were re-run on on-demand instances.

The cost of VCF record filtering if all filtering jobs are run on on-demand "mem3_ssd1_v2_x4" instances that cost 0.1604 GBP/hour is estimated to be between 4,000 and 5,000 GBP.

Concatenation using the linux cat command:

The cost of concatenating the filtered shards of each chromosome to create chromosome-wide VCF files was 6 GBP.

Phasing using Beagle:

Each unphased chromosome-wide VCF file was phased on an on-demand "mem3_ssd1_v2_x96" instance with Beagle v5.5.

The cost of genome-wide phasing with Beagle was 11,064 GBP.

Citation

The most recent published reference for the Beagle genotype phasing methodology is:

Browning, B. L., Tian, X., Zhou, Y., & Browning, S. R. (2021). Fast two-stage phasing of large-scale sequence data. American journal of human genetics, 108(10), 1880–1890. https://doi.org/10.1016/j.ajhg.2021.08.005

Related to