Collaborating effectively with Jupyter notebooks
I spend roughly half of my programming time working in Jupyter notebooks1. Some is exploring data and building models, and some is experimenting when writing new code. I like notebooks - they make exploration much easier. For data science work, visualization with notebooks is just easier than what came before.
However, opinions of notebook-style programming vary wildly, from love to mild dislike to hate. There are many tropes about notebook users, but the most common one is that a data scientist opens up a notebook, writes spaghetti Python code that kind of produces a model, and then hands it over to a "real engineer" to turn into something that can be run in production. The most common reaction is that the data scientist loves Jupyter, and the engineer hates it.
This dynamic has been picked over many times, by a wide variety of data science, data engineering, and machine learning teams 2. Many startups have been built around trying to fix this problem.
Instead of writing an ethnography of ML engineering3, here's a list of principles I follow for creating Jupyter notebooks for handoff, as someone who probably straddles the line between data scientist and developer.
1. Kill the kernel and rerun all cells
If you're sending a notebook to someone else, the bare minimum is that you should restart your kernel and verify that the results look right. This almost goes without saying.
The bare minimum is restarting your kernel.
Often you'll realize that you've deleted a cell that was important, or ran cells out of order. This is on you, the notebook author, to fix. It's very common to avoid this step because the notebook takes a long time to run. Further down I include some helpful advice on how to conquer this fear.
2. Don't mutate. Don't mutate. Don't mutate. (across cell boundaries)
Don't mutate variables defined in an earlier cell if you can avoid it.
One common anti-pattern for notebook users is to do something like this:
# cell 1
df = pd.read_csv("<fn>").set_index('timestamp')
...
# cell 25
df = df.reset_index().join(...).set_index('date')
...
# cell 146
df.plot()
Remember that most readers of notebooks skim out of sheer necessity: it is tiring reading someone else's code. Keeping a mental model of the variable df
and its index is extremely difficult.
On the other hand, assuming variables are totally immutable is a great way to use up all your memory, since most variables you define will be in the global scope and not garbage collected.
df_a = pd.read_csv("<fn>")
df_b = df_a.groupby(['a']).response.mean()
...
My proposed solution:
- Reuse variable names inside a cell so that intermediate state can be garbage collected
- Don't mutate the variable outside of the cell that it's defined in
- Ideally "export" one variable per cell, and make it obvious (i.e. the last line of a cell is something like
exported_var = ...
).
For example (with a contrived example):
import seaborn as sns
iris = sns.load_dataset('iris')
mean = iris.groupby("species").petal_width.mean()
mean += iris.groupby("species").sepal_width.mean()
std = iris.groupby("species").std()[['sepal_width', 'petal_width']].mean(axis=1)
species_z = mean / std
3. Use Markdown to separate regions
Make liberal use of Markdown headers to separate and organize your work.
One of the major advantages of notebooks over regular Python code is that you can include Markdown to write explanations of the content. I'd highly recommend writing headers (e.g. ### Loading data
) to make it easier for your readers. You can also fold sections by clicking on the caret on the left side of the header.
This becomes much easier once you learn the muscle memory for a few shortcuts:
Esc
moves you from editing to command modej/k or up/down
to select a different cell in command modeb
to insert a cell below the currently selected onea
to insert a cell above the currently selected onem
to convert a code cell into a Markdown cell
Pretty quickly the muscle memory of Esc -> b -> m -> Enter
becomes automatic.
4. Cache expensive computation to disk (automatically)
Make it easier to kill your kernel by caching large datasets to disk.
A common anti-pattern for notebook users is to do some slow and memory-intensive computation, and then hold the resulting data in memory and be scared of killing the kernel and losing that data.
The solution: write it to disk!
However, most notebook users avoid this not because writing to disk is infeasible, but because the tools they have for doing this are poor.
Here's a common example of loading a lot of data from disk to memory, joining it against another dataframe, then selecting a subset of it:
df_list = []
for i in range(100):
df = pd.read_parquet(f"filename_{i}.parquet")
df = df.join(my_other_dataset.set_index('ts'), on='ts')
df = df[df.ts > pd.Timestamp('20200101')]
df_list.append(df)
df = pd.concat(df_list)
Here are a few issues with this code:
- If it's slow, you won't want to run it again, so you'll avoid killing your kernel.
df_list
is still hanging around, holding pointers to each individual dataframe, so those can't be garbage collected
Instead, use a decorator to cache this computation to disk and compute it just once:
@cache_to_disk
def compute_my_dataframe():
df_list = []
for i in range(100):
df = pd.read_parquet(f"filename_{i}.parquet")
df = df.join(my_other_dataset.set_index('ts'), on='ts')
df = df[df.ts > pd.Timestamp('20200101')]
df_list.append(df)
return pd.concat(df_list)
df = compute_my_dataframe()
One issue with this approach is that if you're not careful, you can leak other data into this function implicitly; above we're leaking the variable my_other_dataset
. If you change how that variable is defined, you can lead to unreproducible results. In practice, though, I find this simpler to reason about than most people think. You can also delete your cache before rerunning it for sharing.
Using a decorator means that you don't have to manage the disk writing and reading yourself, which means the barrier to doing this drops significantly. There are many useful decorators, but here's a simple one I wrote just for you:
from functools import wraps
import pickle
import pandas as pd
def cache_to_disk(func):
"""
A decorator for functions with no arguments.
"""
cache_path = pathlib.Path(".")
func_cache_path = (
cache_path /
func.__module__.replace(".", "/") /
func.__name__
).with_suffix('.cache')
if not func_cache_path.parent.exists():
func_cache_path.parent.mkdir(parents=True)
@wraps(func)
def wrapped():
if func_cache_path.exists():
try:
ret = pd.read_parquet(func_cache_path)
except:
with open(func_cache_path, 'rb') as fh:
ret = pickle.load(fh)
return ret
else:
ret = func()
if isinstance(ret, pd.DataFrame):
ret.to_parquet(func_cache_path, compression='zstd')
else:
with open(func_cache_path, 'wb') as fh:
pickle.dump(ret, fh)
return ret
return wrapped
5. Use assertions
Assert anything you confirm "by inspection" while working with the data:
assert len(df.species.unique()) == len(ALL_SPECIES)
assert df.index.name == 'blah'
assert iris.dtypes['sepal_length'] == np.float64
They help the reader build their mental model of what is guaranteed and checked, and what is not.
6. Don't import code from other notebooks (or copy and paste)
Instead of copying and pasting code (or worse), set up a local Python module to import. It's easier than you think.
Importing code from other notebooks is extremely weird. Since most notebooks are written like one long main
script, this is like importing from one script to another.
Copying and pasting a huge block of useful snippets is common (e.g. methods to load data, clean it, plotting, etc). The issue is that it's very easy to copy and paste irrelevant code around, and your reader will have trouble understanding what's used and what's not. Also this can take time and adds to visual clutter.
One alternative is that if you find some code becoming useful, you can add it to your own personal little library or package, and import that code everywhere.
I find that most people find this pretty intimidating or a waste of time, because they assume it requires checking code in or deploying it. That tends to be what developers recommend as well. However, you can set up your repo in "develop" mode and any changes you make to the file will be immediately importable.
Here's one way to set this up when not working in a cloud environment:
- Create an empty directory in an easy-to-edit place (e.g.
/home/rachit/utils/
) - Inside that folder, create a minimal Python package as described here: minimal Python package setup
- Making sure that your
python
is whatever is used in your Jupyter kernel, runpython setup.py develop
in that directory. You might also be able to usepip install -e .
; I haven't tested it.
Now, whenever you make any changes to that module, you can restart your kernel and get the updated version of that code. Using the example module described in the article:
from funniest import joke
7. Avoid using generic variable names
Generic variable names confuse readers and can also lead to subtle bugs because of the default-global context.
Like df
, like I've done above. It can be really convenient, because it's easier to type, but Jupyter does support autocomplete so you can just hit TAB
to complete a variable name.
The issue with using generic variable names is that (1) it makes it harder to read through and understand the code, and (2) common variable names are often overwritten, which violates the "Don't mutate" rule but also confuses most people.
Here's an example:
# cell 1
df = pd.read_csv("...")
# cell 25
dfs = []
for i in range(25):
df = pd.read_parquet(f"filename_{i}.parquet")
dfs.append(df)
other_dataset = pd.concat(dfs)
# cell 45
df.plot()
Here, the last df
is actually the final instance of df
in the loop on cell 25, i.e. whatever is inside filename_24.parquet
. Python loops modify the local namespace, so you might not even notice. It's better just to name the first variable something more descriptive.
8. Don't use notebooks when you mean to use an IDE
This isn't advice on how to collaborate with notebooks.
Real IDEs (tm) have many useful features that notebooks (at least currently) don't have:
- High quality type inference / Intellisense / autocomplete 4
- Easier to productionize code by splitting code into files and modules and tests
- Access to (many) plugins that haven't yet made it over to JupyterLab, etc.
My general rule of thumb: if you're writing production code, it's easier to do it in an IDE or a text editor like Vim. If you're creating a reproducible chart, or training a small model for research use, it's better to do it in a notebook, if you can.
Summary
Have empathy for the person who is going to read your notebook; reading someone else's code (especially in a critical setting) is exhausting and hard work.
Thanks
Thanks to Stefan Gramatovici and Prastuti Singh for reading earlier drafts of this post.
Footnotes
The remaining half is spent between VSCode and Vim; roughly 98% of the code I write for work is in Python.
If you haven't already seen / read Joel Grus's talk from JupyterCon 2018, it's pretty funny. I would recommend.
Some of the most common complaints: poor (unreadable) coding style due to the code being written for a single author, reliance on data without provenance (i.e. I downloaded a file to my local home directory and am now loading it), complex . If I get one thing across to data scientists who collaborate with engineers: the "exploratory" programming style is not production code, because production code needs to be readable by people with a range of context on what the code does.
I'm a big fan of Pylance inside VSCode; I realize it's a bit concerning that this is closed source, but at a high level I'm ok with it. I use many closed source products from Microsoft daily.