One of the big issue as data scientist is to configure correctly the data science environment. Sometimes this means install a lot of packages, wait time for package to compile, handing obscure errors, set up everything to work correctly and most of the time is a pain.But configure correctly the environment is necessary to reproduce the analysis, or when we need to share work with others.
For these reasons, to overcome the problem, I introduced docker in my data science workflow.
What is Docker?
Docker is a tool that simplifies the installation process for software engineers. To explain in very simple way (sorry docker gurus for this definition): Docker creates a super lightweight virtual machine that can be run in very few milliseconds and contain all we need to run our environment in the right way.
If you would read more this is the Docker official website.
The goal of this post is to create an environment to run a very simple jupyter notebook.
First of all we need to install Docker for the correct platform, refer to this tutorial to perform the correct steps.
Now we can start to create our environment. Really we can pull a ready to use container for this, on docker hub there are a lot of ready to use images, for example:
but my target is create my own environment from scratch!
Open our favorite text editor and start to create the dockerfile. Dockerfile is a file that describe how the container will be built:
- Start with a simple python3 image that is based on debian.
- Then update all packages at last version
- copy the requirements.txt that describe all python packages we need for our data science environment.
- run the installation of all packages
- expose the port for jupyter
- and run the command to start the jupyter notebook.
Now it’s time to write the requirements.txt. This file describe all the python packages we need and will be used by pip to install all packages correctly.
Ok we are ready to compile our container, the command is:
docker build -t your_container_name .
with -t option we can tag our container, for example:
docker build -t mydatascience_env .
When the build process is finished we can run our container:
docker run -p 8887:8888 -v /path_your_machine/notebook_folder/:/Documents -it datascience_env
With -v option /path_your_machine/notebook_folder/ will be mounted into the docker container at /Documents path.
This is useful to save the work and to keep separate the environment from the notebook. I prefer this way organize my work, instead to create a Docker container that contain the environment and notebook too.
Ok when the container is up, we can open the jupyter web interface:
and when the token is asked we put ‘mynotebook’, or whatever you set into your dockerfile, and that’s all! Now we can work into our new data science environment.
Click on Documents we have all our notebook! Note: every change will be saved when the container will be stopped.
To test this environment I used the example of DBSCAN founded on sklearn website. This is the link.
When our work is finished, we can stop the container with the command:
docker stop datascience_env
I think docker is a very important tool for every developer and for every data scientist to deploy and share works. From my point of view the most important innovation Docker introduce is a way to describe how to recreate correctly an environment where my code can run (with a dockerfile). In this way I can reproduce, every time, the exactly environment I used during my development process and I can share the container built with everyone.