Like many people we both love and hate notebooks such as Jupyter (https://jupyter.org/). Data science and the initial stages of model/algorithm development are creative processes, requiring lots of visualisations and quick pivoting between modelling approaches. For this rapid analysis of data and prototyping of algorithms, notebooks are excellent tools and they are the tool of choice for many data scientists. However they have a number of features which make them difficult to use in production.
Notebook files contain both code and outputs - these can be large (e.g. images) and also contain important business or even personal data. When used in conjunction with version control such as Git, data is by default committed to the repo. You can work round this but it is all too easy to inadvertently pass data to where it shouldn’t be. It also means that it is difficult/impossible to see exactly what changes have been made to the code from one commit to the next.
Notebook cells can run out of order - meaning that different results are possible from the same notebook - depending on what order you run the cells in.
Variables can stay in the kernel after the code which created them has been deleted. Variables can be shared between notebooks using magic commands.
Not all python features work in a notebook e.g. multi-processing will not function in Jupyter
The format of notebooks does not lend itself easily to testing - there are no intuitive test frameworks for notebooks.
In some cases we have used tools like to run notebooks in production, but most of the time moving to standard modular code after an initial prototype has been created will make it more testable, easier to move into production and will probably speed up your algorithm development as well.
Using this approach has a number of advantages:
You can import your code into any notebook by a simple pip install. You can use the same tested and repeatable ELT pipeline in a number of notebooks with differing algorithms with confidence.
You can write and run tests and make use of CI tools, linting and all the other goodies software developers have created to make our code more manageable.
Reduce your notebook’s size, so that when you’re doing presentations and demos you don’t need 1,000 lines of boilerplate before you get to the good stuff.
The final advantage of this approach, in a world of deadlines where proof of concepts far too often become production solutions, is that you productionise your code as you go. This means that when the time comes that your code needs to be used in production, standardising it doesn’t seem like such an insurmountable task.
Jake Saunders Python developer
Equal Experts, UK
