ScikitLearn Comparison

Source code notebook compat Author Update time

Overview

This demo compares CVIs that are implemented in this package with their equivalents in the scikit-learn Python package. We will generate some random distributions, cluster them with different algorithms, and demonstrate that the resulting metrics are equivalent.

Setup

First, we load our dependencies.

# Load all dependencies
using
    ClusterValidityIndices,     # This package
    Clustering,                 # k-means
    AdaptiveResonance,          # DDVFA
    PyCall,                     # scikit-learn interface
    Plots,                      # Plots frontend
    Distributions,              # Random distribution sampling
    LinearAlgebra               # UniformScaling
gr()                            # Use the default GR backend explicitly
theme(:dracula)                 # Change the theme for fun

Next, we will set some configuration variables that we will use throughout the script.

# The number of random distributions to sample from
n_distributions = 4
# The number of samples per distribution that we want to draw
n_samples_per = 20
# The number of seed values we will use for k-means
n_k_means = 30
# The dimensionality of the distributions and resulting feature space
dim = 2
2

After setting our variables, we will import the scikit-learn metrics package.

# Use the PyCall interface to import the metrics package
skmetrics = pyimport("sklearn.metrics")

Next, we will create a series of distributions for random sampling.

# Container for multivariate normal distributions
dists = []
# Iterate over the number of distributions we selected earlier
for i = 1:n_distributions
    # Set a random vector mean
    mean = 50*rand(dim)
    # Create a positive definite matrix, guaranteeing dominance of the diagonal
    covariance = rand(Float64, (dim, dim))
    covariance = 0.5 * (covariance' + covariance) + UniformScaling(1)
    # Add a new distribution to the list
    push!(dists, MvNormal(mean, covariance))
end

We then create a batch of samples from these distributions. While we are here, we can visualize the resulting sample space.

X = reduce(hcat, [rand(dists[i], n_samples_per) for i=1:n_distributions])
p = plot()
scatter!(p, X[1, :], X[2, :])

Cluster and Compute Metrics

Now, we are ready to do some clustering. One use of CVIs is for checking the performance of a clustering algorithm while tweaking its hyperparameters. For example, the k-means clustering algorithm is a partition/globular clustering algorithm that requires a seed number as a hyperparameter. The "best" seed number for a particular clustering problem varies and is usually unknown, and higher values are more computationally expensive.

Here, we will cluster the k-means while testing a range of seed values, comparing the equivalence of the Calinski-Harabasz (CH) metric between this package and the scikit-learn implementation.

# Create a range of k-mean seeds to test across along with CVI containers
km_range = 2:n_k_means
criterion_values_jl = zeros(length(km_range))
criterion_values_py = zeros(length(km_range))
for (i, n_clust) = enumerate(km_range)
    # Get the labels prescribed k-means to each sample
    labels_km = kmeans(X, n_clust).assignments

    # Create a Calinski-Harabasz CVI and get its criterion value
    cvi = CH()
    criterion_values_jl[i] = get_cvi!(cvi, X, labels_km)

    # Compute the criterion value with the scikit-learn implementation
    criterion_values_py[i] = skmetrics.calinski_harabasz_score(X', labels_km)
end
# Show that the values are all equivalent between implementations
isapprox(criterion_values_jl, criterion_values_py)

We can visualize the result as well by plotting the trendlines of the criterion values from each implementation.

q = plot(legend=:bottomright)
plot!(q, km_range, criterion_values_jl, marker=:d, label="Julia")
plot!(q, km_range, criterion_values_py, marker=:d, label="Python")
ylabel!(q, "Calinski-Harabasz CVI")
xlabel!(q, "K-Means Seed")

After visualizing the results and seeing that the trendlines are equivalent, we also see that unsurprisingly the best k-means seed value is k=4, since we know that we are sampling from four gaussians.

"assets/scikit-learn.png"

This page was generated using DemoCards.jl and Literate.jl.