Query of the week #8: Filtering BGEN file and exporting geno data to Pandas df

Ondrej Klempir DNAnexus Team

After a one-week pause (were enjoying public holidays), we are back with Query of the week numero 8. This week we will follow up on Query #7. I read a great discussion below the post and this week elaborates one way to load geno data and process it for Pandas dataframe. {@005t00000089ohDAAQ}? mentioned two ways (PLINK and HAIL) in the discussion below the post. Today I am sharing my notes and one example on how to directly query a BGEN file.

 

We are going to explore a UKB BGEN file along with its sample file and we will be using bgen-reader (https://github.com/jeremymcrae/bgen). The authors of this tool said "This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files)". Among others, it offers a set of methods to interact with BGEN file, such as with_rsid, at_position, varids, rsids, chroms or positions. See the Github page (https://github.com/jeremymcrae/bgen) for more details and give them a star if you like it! On UKB-RAP, you can easily install it within JupyterLab.

 

Now to the example itself and what worked for me. I was able to:

 

a) read a pair of BGEN and sample file

b) list all rsids in BGEN

 

Screenshot 2023-04-06 at 16.54.29 

c) get BGEN geno data for a selected rsid for all samples in that BGEN

d) subsample according to filter

 

Screenshot 2023-04-06 at 16.54.41 

With this tool and the steps described above, you can get a numpy data representation which could be easily transformed to a Pandas dataframe. Looking forward to hearing your experience with processing geno data!

Comments

3 comments

Please sign in to leave a comment.