Hi team,
I am doing it with single node jupyter notebook but it takes around 3 hours for one single folder and there are around 51 folders in one category of liver MRI. what i actually want is to pull the images and extract them as they are zipped and process them locally.
When i am looking into available examples of spark jupyter notebook i cannot find a way to do it with the imaging data. Can somebody help in this regards.
Regards
Richa
In my opinion, it does not seem to me that unzipping bulk files would be a good candidate for Spark paralellization. It would be possible to parallelize in batch mode or run several jobs simultaneously, e.g. you could create an applet that does unzip and process data per IDEAL folder (that would around 50 jobs in total for IDEAL protocol). Of course, this would depend on the specific use case.
Anyway I went ahead, and tried the following experiment in bash. I used single node Jupyter (R/Python):
Zip files for the Liver MRI (IDEAL protocol) are relatively small (around 5 MB per participant), I downloaded one folder (811 participants) on the JL worker and avoided reading from mounted dir (dxfuse). My command was:
time dx download --no-progress --lightweight project-XYZ:"/Bulk/Liver MRI/IDEAL/10/" -r
This took 12m37s.
Then I iterated over downloaded files and did unzip. Command is:
time for f in *.zip; do unzip "$f" -d "${f%.zip}"; done
This was pretty quick operation and took 1m53s, i.e. this was the time to unzip all participants in one IDEAL folder. In total, 14 minutes for one folder. Theoretically 14 mins * 50 folders = 700 mins ~ 11 hours ~ overnight job. Not too bad. I used mem2_ssd1_v2_x8 instance type which would cost 3 pounds in total, per my estimate.
It would depend on which steps you are planning to do with the unzipped folder and also if you would like process all participants or just a selected cohort (in that case I would expect some kind of bulk data filtering).
Overall, I'm not saying that this is an optimization, just some quick summary of what I observed. I would love to hear more Community ideas :).
0
Permanently deleted user
Thanks a lot @Ondrej Klempir? for a detailed reply. I was using the python code to pull and unzip at the same time may be that was taking it much longer. I will follow your instructions to do it quickly.
Regards
Richa
P.S. i have a doubt that changing the instance configuration plays a big role on whats libraries are pre- installed on the machine. For e.g. I was using instance mem1_hdd1_v2_x16, which did not had unzip package installed on bash. then i tried your suggested instance and it worked!.
Comments
3 comments
Hello {@005t000000Aqg0DAAR}? ,
In my opinion, it does not seem to me that unzipping bulk files would be a good candidate for Spark paralellization. It would be possible to parallelize in batch mode or run several jobs simultaneously, e.g. you could create an applet that does unzip and process data per IDEAL folder (that would around 50 jobs in total for IDEAL protocol). Of course, this would depend on the specific use case.
Anyway I went ahead, and tried the following experiment in bash. I used single node Jupyter (R/Python):
Zip files for the Liver MRI (IDEAL protocol) are relatively small (around 5 MB per participant), I downloaded one folder (811 participants) on the JL worker and avoided reading from mounted dir (dxfuse). My command was:
time dx download --no-progress --lightweight project-XYZ:"/Bulk/Liver MRI/IDEAL/10/" -r
This took 12m37s.
Then I iterated over downloaded files and did unzip. Command is:
time for f in *.zip; do unzip "$f" -d "${f%.zip}"; done
This was pretty quick operation and took 1m53s, i.e. this was the time to unzip all participants in one IDEAL folder. In total, 14 minutes for one folder. Theoretically 14 mins * 50 folders = 700 mins ~ 11 hours ~ overnight job. Not too bad. I used mem2_ssd1_v2_x8 instance type which would cost 3 pounds in total, per my estimate.
It would depend on which steps you are planning to do with the unzipped folder and also if you would like process all participants or just a selected cohort (in that case I would expect some kind of bulk data filtering).
Overall, I'm not saying that this is an optimization, just some quick summary of what I observed. I would love to hear more Community ideas :).
Thanks a lot @Ondrej Klempir? for a detailed reply. I was using the python code to pull and unzip at the same time may be that was taking it much longer. I will follow your instructions to do it quickly.
Regards
Richa
P.S. i have a doubt that changing the instance configuration plays a big role on whats libraries are pre- installed on the machine. For e.g. I was using instance mem1_hdd1_v2_x16, which did not had unzip package installed on bash. then i tried your suggested instance and it worked!.
Interesting, perhaps this would depend on which JupyterLab flavor you choose.
Please sign in to leave a comment.