> For the complete documentation index, see [llms.txt](https://playbooks.equalexperts.com/mlops-playbook/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://playbooks.equalexperts.com/mlops-playbook/pitfalls-avoid/avoid-notebooks-in-production.md).

# Avoid  notebooks in production

Like many people we both love and hate notebooks such as Jupyter (<https://jupyter.org/>). Data science and the initial stages of model/algorithm development are creative processes, requiring lots of visualisations and quick pivoting between modelling approaches. For this rapid analysis of data and prototyping of algorithms, notebooks are excellent tools and they are the tool of choice for many data scientists. However they have a number of features which make them difficult to use in production.

* Notebook files contain both code and outputs - these can be large (e.g. images) and also contain important business or even personal data. When used in conjunction with version control such as Git, data is by default committed to the repo. You can work round this but it is all too easy to inadvertently pass data to where it shouldn’t be. It also means that it is difficult/impossible to see exactly what changes have been made to the code from one commit to the next.&#x20;
* Notebook cells can run out of order - meaning that different results are possible from the same notebook - depending on what order you run the cells in.
* Variables can stay in the kernel after the code which created them has been deleted. Variables can be shared between notebooks using magic commands.&#x20;
* Not all python features work in a notebook e.g. multi-processing will not function in Jupyter&#x20;
* The format of notebooks does not lend itself easily to testing - there are no intuitive test frameworks for notebooks.

In some cases we have used tools like [papermill](https://papermill.readthedocs.io/en/latest/) to run notebooks in production, but most of the time moving to standard modular code after an initial prototype has been created will make it more testable, easier to move into production and will probably speed up your algorithm development as well.

## <mark style="color:blue;">Experience report</mark>

{% hint style="info" %}
*I first came into contact with a Jupyter notebook while working on a predictive maintenance machine learning project, after a number of years as a production software developer. In this scenario, I found notebooks to be an invaluable resource. The ability to organise your code into segments with full markdown support and charts showing your thinking and output at each stage made demos and technical discussions simple and interactive. In addition, the tight integration with Amazon SageMaker and S3 meant I could work with relative freedom and with computing power on-tap while remaining in the client’s estate.*

*However, as our proof of concept got more complicated, with a multi-stage ELT pipeline and varying data normalisation techniques etc, I found myself maintaining a block of core ELT code that was approaching 500 lines of untested spaghetti. I had tried, with some success, to functionalise it so it wasn’t just one script and I could employ some DRY principles. However, I couldn’t easily call the functions from one notebook to another so I resorted to copy and paste. Often I would make a small change somewhere and introduce a regression that made my algorithm performance drop off a cliff, resulting in losing half a day trying to figure out where I had gone wrong. Or maybe I’d restart my code in a morning and it wouldn’t work because it relied on some globally scoped variable that I’d created and lost with my kernel the night before. If there were tests, I could have spotted these regressions and fixed them quickly, which would have saved me far more time in lost productivity than the tests would have taken to write in the first place.*

### &#x20;<mark style="color:blue;">**In retrospect, when I come to do work like this in the future, I would opt for a hybrid approach. I would write the initial code for each stage in a notebook where I could make changes in an interactive way and design an initial process that I was happy with. Then, as my code ‘solidified’, I would create an installable package in a separate GIT repository where I could make use of more traditional software development practices.**</mark>&#x20;

***Using this approach has a number of advantages:***

* You can import your code into any notebook by a simple pip install. You can use the same tested and repeatable ELT pipeline in a number of notebooks with differing algorithms with confidence.&#x20;
* You can write and run tests and make use of CI tools, linting and all the other goodies software developers have created to make our code more manageable.&#x20;
* Reduce your notebook’s size, so that when you’re doing presentations and demos you don’t need 1,000 lines of boilerplate before you get to the good stuff.

*The final advantage of this approach, in a world of deadlines where proof of concepts far too often become production solutions, is that you productionise your code as you go. This means that when the time comes that your code needs to be used in production, standardising it doesn’t seem like such an insurmountable task.*

\
[<mark style="color:blue;">**Jake Saunders**</mark>](https://www.linkedin.com/in/jake-saunders-83617741/)\ <mark style="color:blue;">**Python developer**</mark>

<mark style="color:blue;">Equal Experts, UK</mark>\
\
![](/files/j84fsc8Pl4rkbNkCUEsy)
{% endhint %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://playbooks.equalexperts.com/mlops-playbook/pitfalls-avoid/avoid-notebooks-in-production.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
