Avoid notebooks in production
Last updated
Last updated
Like many people we both love and hate notebooks such as Jupyter (https://jupyter.org/). Data science and the initial stages of model/algorithm development are creative processes, requiring lots of visualisations and quick pivoting between modelling approaches. For this rapid analysis of data and prototyping of algorithms, notebooks are excellent tools and they are the tool of choice for many data scientists. However they have a number of features which make them difficult to use in production.
Notebook files contain both code and outputs - these can be large (e.g. images) and also contain important business or even personal data. When used in conjunction with version control such as Git, data is by default committed to the repo. You can work round this but it is all too easy to inadvertently pass data to where it shouldn’t be. It also means that it is difficult/impossible to see exactly what changes have been made to the code from one commit to the next.
Notebook cells can run out of order - meaning that different results are possible from the same notebook - depending on what order you run the cells in.
Variables can stay in the kernel after the code which created them has been deleted. Variables can be shared between notebooks using magic commands.
Not all python features work in a notebook e.g. multi-processing will not function in Jupyter
The format of notebooks does not lend itself easily to testing - there are no intuitive test frameworks for notebooks.
In some cases we have used tools like to run notebooks in production, but most of the time moving to standard modular code after an initial prototype has been created will make it more testable, easier to move into production and will probably speed up your algorithm development as well.