Guest Author
This post was written by a guest author, Jillian Rowe, who can be reached at jillian@dabbleofdevops.com.
Running CellProfiler in batch mode is the ideal way to automate large scale analyses. Or not so large scale analyses that you prefer to automate!
One of the benefits of running CellProfiler in batch is that you can split your analysis. This is helpful when you have a very large dataset. Say you have a large dataset that would take 1 hour to complete. You could split that analysis into 4 chunks, and each would complete in 15 minutes.
This is also an important consideration if you are running out of memory or CPU. The larger the dataset you are analyzing the more memory it consumes. If you need to find a way to decrease your computational resources you can often split your dataset. The instructions for the CreateBatchFiles module describe how to set up a CellProfiler pipeline and submit it to a cluster. Here's a tutorial of step 7, submit your batches to the cluster. Let’s get started!
If you prefer to watch, here is a video where I go through the steps described.
Using Docker
Docker is a way of packaging applications. A docker container is like a virtual machine, except without a visual interface. Once you have it all setup you treat it just as you would a regular computer.
Quick disclaimer, if you are very uncomfortable with the command line you may want to reach out for help. This does not require too much Linux command line knowledge, but you will need to be able to type commands and navigate a directory structure. Here's a quick explanation and tutorial from Ubuntu to get you started.
We will be using the default CellProfiler docker image with a few changes. We are making these changes because the image is set up in such a way that it is very well suited for a job queue environment, but what we want here is to dig around with some exploratory analysis.
Dockerfile
Create a project directory, cellprofiler-batch-tutorial and cd into it.
mkdir cellprofiler-batch-tutorial
cd cellprofiler-batch-tutorial
Then create a file called Dockerfile with this:
FROM cellprofiler/cellprofiler:3.1.9
RUN apt-get update -y; apt-get install -y unzip imagemagick
ENV TINI_VERSION v0.16.1
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /usr/bin/tini
RUN chmod +x /usr/bin/tini
ENTRYPOINT [ "/usr/bin/tini", "--" ]
CMD [ "/bin/bash" ]
Now we'll build our new CellProfiler image!
docker build -t cellprofiler .
Simple Analysis with the Example Human Dataset
We're going to start off with a very simple example just to get a feel for how we would run things using batch mode. Once we're there we will move onto more complex pipelines. (WOOOO!)
Let's also grab the first datasets.
wget http://cellprofiler-examples.s3.amazonaws.com/ExampleHuman.zip
unzip ExampleHuman
Here's what the dataset looks like -
.
├── ExampleHuman.cppipe
├── README.md
└── images
├── AS_09125_050116030001_D03f00d0.tif
├── AS_09125_050116030001_D03f00d1.tif
└── AS_09125_050116030001_D03f00d2.tif
The ExampleHuman.cppipe is a CellProfiler pipeline, the README is the usual README, and the images are the images that we want to analyze with the CellProfiler pipeline!
Drop into our CellProfiler Image
Earlier I said that your docker image is a computer. It (mostly) is. We're going to use it as a computer now.
docker run -it --name cellprofiler -v "$(pwd)":/project \
cellprofiler \
bash
Now you are using the docker container as a shell. Cd to your project directory and check that your expected files are there.
cd /project/ExampleHuman
ls -lah # Should show the ExampleHuman dataset
Run CellProfiler
Make sure you can run the CellProfiler CLI by executing cellprofiler with help. (This is just always a nice check.)
cellprofiler --run --run-headless --help
Now let's run with our ExampleHuman dataset!
cellprofiler --run --run-headless \
-p ExampleHuman.cppipe \
-o output -i images
Split your Dataset with CellProfiler
First of all, this CellProfiler analysis only uses one imageset, so it's not interesting, but it is informative.
You can use the -f and -l flags to tell CellProfiler to start, or use -f for first, and last, -l to split your dataset.
cellprofiler --run --run-headless \
-p ExampleHuman.cppipe \
-o output \
-i images \
-f 1 -l 1
Once that finishes up you should see an image along with some csv files in the output directory!
Profile a Single Image
If you are designing a large scale batch pipeline with CellProfiler you need to know how much memory and CPU you're using. We're going to grab this information using a tool called Portainer.
Portainer does a LOT of things, and those things are very cool, but right now we are only using it for profiling our cellprofiler process running in a docker container.
Start the portainer service like so:
docker volume create portainer_data
docker run -d -p 8000:8000 -p 9000:9000 --name=portainer --restart=always -v /var/run/docker.sock:/var/run/docker.sock -v portainer_data:/data portainer/portainer
Go to your browser and open localhost:9000 to see the Portainer service. You may be prompted to create a username and password. If so, do that, and then you should see a home page that looks like this:
Now profile your dataset like so:
# Youtube Embed
https://www.youtube.com/embed/SUwdjyI8RjA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
Notice that the command to run a single image exits very quickly. You will have to be quick to see the memory profiling in real time, or you may have to rerun it several times.
A More Complex Dataset
Now we've discussed how we start to plan and think about batching our CellProfiler Pipeline with a simple example. Now let's get into a real dataset! Let's use the BBBC021 from the Broad BioImage Benchmark Collection. We're going to use the Week1 data, which is about 15GB.
I'm going to show you how I ran this dataset, but CellProfiler is very flexible and I'm sure there are other ways.
Grab the data
You should still be in your docker container. Once there let's grab some data!
cd /project
mkdir -p BBBC021/Week1
cd BBBC021/Week1
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_images_Week1_221...
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_images_Week1_221...
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_images_Week1_221...
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_images_Week1_223...
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_images_Week1_223...
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_images_Week1_224...
find $(pwd) -name "*zip" | xargs -I {} unzip {}
# Clean up the zips, we don't need them anymore
find $(pwd) -name "*zip" | xargs -I {} rm -rf {}
cd ../..
# Run $(pwd) to check where you are. You should be in /project/BBBC021
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_image.csv
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_compound.csv
wget https://data.broadinstitute.org/bbbc/BBBC021/BBBC021_v1_moa.csv
wget https://data.broadinstitute.org/bbbc/BBBC021/analysis.cppipe
wget https://data.broadinstitute.org/bbbc/BBBC021/illum.cppipe
Understanding the MetaData
I think it's best explained from the dataset itself.
You will also see that there are 2 pipelines, an Illumination correction pipeline and then an Analysis pipeline. I had to play around with the exact inputs and outputs to get this to work without errors, but how it works is:
illum.cppipe
Week1/Week1_22123/
# Inputs to the Illumination AND Analysis Pipeline
Week1*.tif
# Outputs to the Illumination Pipeline
# Inputs to the Analysis Pipeline
Week1_22123_Illumctin.npy
Week1_222123_IllumActinAvg.npy
Week1_222123_IllumDAPI.npy
Week1_222123_IllumDAPIAvg.npy
Week1_222123_IllumTubulin.npy
Week1_222123_IllumTubulinAvg.npy
# Outputs to the Analysis Pipeline
overlay
labels
measurements
illum_corrected
You should have images in the illum_corrected, labels and overlay directories, and csv files in the measurements directory.
The analysis pipeline seems to expect that the output from the illumination pipeline exists in the same directory.
Create a Week 1 Data File
The data file that comes with the set includes data for all weeks. We're only using Week 1, because I'm doing this on my laptop.
cd /project/BBBC021
head -n 1 BBBC021_v1_image.csv > images_week1.csv
cat BBBC021_v1_image.csv | grep Week1 >> images_week1.csv
We will not actually use this file, but it is useful to understand the structure of the analysis.
Run some checks
Let's make sure that we can process at least the first image.
cellprofiler --run --run-headless \
-p illum.cppipe \
-o Week1/Week1_22123 \
-i Week1/Week1_22123 \
-c -r -f 1 -l 1
You should see some output that looks like this:
This is a check, and only a check. Because the Illumination Pipeline computes some average illumination files, we shouldn’t use the divide and conquer approach. When you want to run the entire pipeline as a whole you should rerun the illum.cpipe with no -f or -l. For just troubleshooting and thinking about how you want to batch your analysis, you are fine just running the first image set.
# Rerun this when you want to run the entire analysis
# It will take some time, so don't run until you're sure!
cellprofiler --run --run-headless \
-p illum.cppipe \
-o Week1/Week1_22123 \
-i Week1/Week1_22123
Run the analysis in Batch
This dataset comes with a BBBC021_v1_image.csv, which is a CellProfiler CSV Data file. These are created based on the groupings in the experiment, and the exact details that go into creating them are particular to your situation. You can also use the -f and -l flags to choose first and last to batch, or some combination of the two. Disclaimer, I am not a biologist, and actually generating these pipelines is beyond me. ;-)
cellprofiler --run --run-headless \
-p analysis.cppipe \
-i Week1/Week1_22123 \
-o Week1/Week1_22123/f1-l1 \
-c -r -f 1 -l 1
Since we can split the analysis we will want to make sure that each split gets it's own output directory. This is necessary to make sure we don't clobber the output each time!
This will take some time to complete. On my laptop it took around 10 minutes to complete. For fun let's look at the Portainer stats!
Bonus - Human Readable Images
HCS images often appear very dark when opened in a regular file viewer. They are fine when opened with CellProfiler, but very dark otherwise. If you would like to be able to view your images using your system image viewer, or through a web browser, you can use imagemagick to convert them.
cd /project/BBBC021/Week1/Week1_22123/f1-l1/labels/Week1_22123
find -name '*tiff' | sed 's/tiff//g' | xargs -I {} convert -auto-level -quality 100 {}tiff {}png
Please note that the produced .png images will not be suitable for subsequent reanalysis in CellProfiler. They are only for human viewing!
Wrap Up
That’s it! I hope you better understand how you can run your CellProfiler pipeline batch!
Citations and DataSets
BBBC021v1
We used image set BBBC021v1 [Caie et al., Molecular Cancer Therapeutics, 2010], available from the Broad Bioimage Benchmark Collection [Ljosa et al., Nature Methods, 2012].
Example Human
The Example Human dataset comes straight from the https://cellprofiler.org/examples/ page.