Guide¶

Installation¶

This project is distributed as a Python package and is hosted on the PyPI package server. To use cvi, first install it using pip:

pip install cvi

You can also add the package directly from GitHub to get the latest changes between releases (or from a specific branch) with:

pip install git+https://github.com/AP6YC/cvi

Quickstart¶

This section provides a quick overview of how to use the project. For more detailed code usage, please see the Detailed Usage section.

Create a CVI object and compute the criterion value in batch with get_cvi:

# Import the library
import cvi
# Create a Calinski-Harabasz (CH) CVI object
my_cvi = cvi.CH()
# Load some data from some clustering algorithm
samples, labels = load_some_clustering_data()
# Compute the final criterion value in batch
criterion_value = my_cvi.get_cvi(samples, labels)

or do it incrementally, also with get_cvi:

# Datasets are numpy arrays
import numpy as np
# Create a container for criterion values
n_samples = len(labels)
criterion_values = np.zeros(n_samples)
# Iterate over the data
for ix in range(n_samples):
   criterion_values = my_cvi.get_cvi(samples[ix, :], labels[ix])

Detailed Usage¶

The cvi package contains a set of implemented CVIs with batch and incremental update methods. Each CVI is a standalone stateful object inheriting from a base class CVI, and all CVI functions are object methods, such as those that update parameters and return the criterion value.

Instantiate a CVI of you choice with the default constructor:

# Import the package
import cvi
# Import numpy for some data handling
import numpy as np

# Instantiate a Calinski-Harabasz (CH) CVI object
my_cvi = cvi.CH()

CVIs are instantiated with their acronyms, with a list of all implemented CVIS being found in the [Implemented CVIs](#implemented-cvis) section.

A batch of data is assumed to be a numpy array of samples and a numpy vector of integer labels.

# Load some data
samples, labels = my_clustering_alg(some_data)

Note

The cvi package assumes the Numpy row-major convention where rows are individual samples and columns are features. A batch dataset is then [n_samples, n_features] large, and their corresponding labels are [n_samples] large.

You may compute the final criterion value with a batch update all at once with CVI.get_cvi

# Get the final criterion value in batch mode
criterion_value = my_cvi.get_cvi(samples, labels)

or you may get them incrementally with the same method, where you pass instead just a single numpy vector of features and a single integer label. The incremental methods are used automatically based upon the dimensions of the data that is passed.

# Create a container for the criterion value after each sample
n_samples = len(labels)
criterion_values = np.zeros(n_samples)

# Iterate across the data and store the criterion value over time
for ix in range(n_samples):
   sample = samples[ix, :]
   label = labels[ix]
   criterion_values[ix] = my_cvi.get_cvi(sample, label)

Note

Currently only using either batch or incremental methods is supported; switching from batch to incremental updates with the same is not yet implemented.

Implemented CVIs¶

The following CVIs have been implemented as of the latest version of cvi:

CH: Calinski-Harabasz
cSIL: Centroid-based Silhouette
DB: Davies-Bouldin
GD43: Generalized Dunn’s Index 43.
GD53: Generalized Dunn’s Index 53.
PS: Partition Separation.
rCIP: (Renyi’s) representative Cross Information Potential.
WB: WB-index.
XB: Xie-Beni.

Acknowledgements¶

Derivation¶

The incremental and batch CVI implementations in this package are largely derived from the following Julia language implementations by the same authors of this package:

ClusterValidityIndices.jl

Authors¶

The principal authors of the cvi pacakge are:

Sasha Petrenko - petrenkos@mst.edu
Nik Melton - nmmz76@mst.edu

Guide¶

Installation¶

Quickstart¶

Detailed Usage¶

Implemented CVIs¶

Acknowledgements¶

Derivation¶

Authors¶

Related Projects¶