PyCallJLD2 and ScikitLearn.jl

Source code notebook compat Author Update time

Overview

This demo shows how to use PyCallJLD2.jl to save models from ScikitLearn.jl. This script borrows heavily from the saving models to disk example in ScikitLearn.jl documentation to illustrate how this package can be used as a drop-in for using JLD2.jl instead of JLD.jl.

Setup

First, you must have your PyCall environment setup in the correct way. Here, we will point to the default Python installation internal to Julia and make sure to rebuild the PyCall package to point to it

ENV["PYTHON"] = ""
using Pkg
Pkg.build("PyCall")
    Building Conda ─→ `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/8c86e48c0db1564a1d49548d3515ced5d604c408/build.log`
    Building PyCall → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/43d304ac6f0354755f1d60730ece8c499980f7ba/build.log`

Next, we load our dependencies. To use this package, you must load PyCall, JLD2, and PyCallJLD2 in the context that you intend to do model saving and loading:

# Load the modules into the current context
using
    PyCall,     # for PyObjects
    JLD2,       # for saving and loading
    PyCallJLD2  # for telling JLD2 how to save and load PyObjects

Because we are showing how to save and load ScikitLearn.jl objects, we will also load that package and other dependencies:

using
    ScikitLearn,            # for @sk_import
    ScikitLearn.Pipelines   # for Pipeline

Create some ScikitLearn.jl PyObjects

Now we use the ScikitLearn.jl API to load scikit-learn modules:

# Import some scikit-learn modules
@sk_import decomposition: PCA
@sk_import linear_model: LinearRegression
PyObject <class 'sklearn.linear_model._base.LinearRegression'>

We can instantiate the modules:

pca = PCA()
lm = LinearRegression()
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

and make up some random training data:

X=rand(10, 3); y=rand(10);

and create a pipeline from one model to another:

pip = Pipeline([("PCA", pca), ("LinearRegression", lm)])
ScikitLearn.Skcore.Pipeline(Tuple{Any, Any}[("PCA", PyObject PCA()), ("LinearRegression", PyObject LinearRegression())], Any[PyObject PCA(), PyObject LinearRegression()])

Just to illustrate the statefulness of the model, let us fit the pipeline to our random dataset:

fit!(pip, X, y)   # fit to some dataset
ScikitLearn.Skcore.Pipeline(Tuple{Any, Any}[("PCA", PyObject PCA()), ("LinearRegression", PyObject LinearRegression())], Any[PyObject PCA(), PyObject LinearRegression()])

and see how it fares on the same data:

score_1 = score(pip, X, y)
0.2958220359792777

Save and Load

Now we will save the model with the JLD2.save interface:

# Name the file to save and load to
model_file = "models.jld2"
# Save the pipeline
JLD2.save(model_file, "pip", pip)

And we can load the same module into another variable in this context:

pip_2 = JLD2.load(model_file, "pip")
ScikitLearn.Skcore.Pipeline(Tuple{Any, Any}[("PCA", PyObject PCA()), ("LinearRegression", PyObject LinearRegression())], Any[PyObject PCA(), PyObject LinearRegression()])

Finally, lets calculate the score again for the loaded model:

score_2 = score(pip_2, X, y)
0.2958220359792777

and verify that the score is the same as before

score_1 == score_2
true

And voila! The answers are the same because we retained the stateful information of the pipeline during saving and loading.

Note

When loading the object, you must be sure that the definition for the unpacked data is in the current workspace (i.e., if you change terminal sessions here, you must remember to reimport @sk_import ... before loading the model file).

For the sake of this script, we will clean up after ourselves and remove the model:

rm(model_file)

This page was generated using DemoCards.jl and Literate.jl.