Hi dear there Can you help me in extracting genotype data from imputed "bgen files" for certain variants using the 'bigsnpr" R package?
I'm new on dealing with imputed genotype data and bgen files. I have developed some codes to extract an example SNP in chromosome 22 (22_42524947_C_T) but not sure if my method was correct. I extracted the genotype data for this SNP but this was extracted without participants ids. I used the "sample" file to extract ids, then I combined both columns (the genotype column with the ids column). However, not sure if ids are ordered in the same order of the genotype data. Please have a look at my codes below and let me know if I need to correct anything, please help me by providing a correct syntax if there were some errors below:
install.packages("bigsnpr")
library(bigsnpr)
x<- snp_readBGEN(
bgenfiles= "ukb22828_c22_b0_v3.bgen" ,
backingfile = "ukb22828_c22_b0_v3.bgen.bk" ,
list_snp_id = list("22_42524947_C_T"),
ind_row = NULL,
read_as = c("dosage", "random"),
ncores = 8)
R <- snp_attach(rdsfile = "ukb22828_c22_b0_v3.bgen.bk.rds")
Genotype.data <- as.data.frame(R$genotypes[1:487409, ])
sample <- bigreadr::fread2("ukb22828_c22_b0_v3.sample")[-1, ]
ID.sex <- sample[,c(2,4)]
Final.Genotype <- cbind(ID.sex, Genotype.data)
Comments
5 comments
If you can use Python, I can recommend some packages that I successfully use on bgen.
Hi dear Chair ,
Thank you so much for your replay.
I'm afraid that I have no experience in Python to be honest but many thanks
for spending the time trying to help. I appreciate that.
Hope that someone out here can help me using R. However, if not, I may ask you
how to do that in Python. I will start to learn.
I have experimented with bgen and bgen_reader package in Python. They work great. Each has pros and cons in what information it extracts for you.
https://bgen-reader.readthedocs.io/en/latest/quickstart.html
https://github.com/jeremymcrae/bgen
The bgen-reader would have more documentation, so you could start from that. I think you don't actually need to learn Python extensively. I found that Python is very helpful for my bioinformaitcs career, but it would take a week for the basic and many months to get good at it. I think just basic Python knowledge would be sufficient here. You may just extract data, save into csv, and run the rest in R if that's your prefer choice.
Here is some step of what you can do if you want to try Python option.
1) launch Jupyterlab and select Python/R kernel. You may use ttyd or cloud_workstation as well. I made this example using cloud_workstation and ipython inside it.
2) user terminal to download the bgen of interest to the work station using dx download <file-id>
3) install bgen-reader using pip install bgen-reader
4) Launch python notebook or ipython. Copy code from bgen_reader into notebook. If it looks confusing, you may use minimal example code I made below.
from pathlib import Path
from bgen_reader import read_bgen
filename="ukb22828_c21_b0_v3.bgen" #replace with file name you want
file_path=Path("/home/dnanexus/"+filename) # the path would depend on where you keep the file
bgen = read_bgen(file_path, verbose=True)
# at this point data is read. you can use example in https://bgen-reader.readthedocs.io/en/latest/quickstart.html to see the data you need.
# for example
print(bgen["variants"].head())
5) Now you would need a bit of Python knowledge to format data in the way you want and save to a file, so that you can use in R.
I will leave this thread open in case anyone has tried your R package.
Thank you so much dear Chai , this would be so helpful.
Many thanks for sharing this with me.
Kind Regards
Alternatively, you can extract the snp from the bgen file using bgenix via swiss-army-knife. Then read the single snp bgen with plink and export out as a raw file. The raw file will have the fid iid and the dosage for the snp. which can be read into R or python easily enough.
-Phil
Please sign in to leave a comment.