Guide

Installation

This project is distributed as a Python package and is hosted on the PyPI package server. To use cvi, first install it using pip:

pip install cvi

You can also add the package directly from GitHub to get the latest changes between releases (or from a specific branch) with:

pip install git+https://github.com/AP6YC/cvi

Quickstart

This section provides a quick overview of how to use the project. For more detailed code usage, please see the Detailed Usage section.

Create a CVI object and compute the criterion value in batch with get_cvi:

# Import the library
import cvi
# Create a Calinski-Harabasz (CH) CVI object
my_cvi = cvi.CH()
# Load some data from some clustering algorithm
samples, labels = load_some_clustering_data()
# Compute the final criterion value in batch
criterion_value = my_cvi.get_cvi(samples, labels)

or do it incrementally, also with get_cvi:

# Datasets are numpy arrays
import numpy as np
# Create a container for criterion values
n_samples = len(labels)
criterion_values = np.zeros(n_samples)
# Iterate over the data
for ix in range(n_samples):
   criterion_values = my_cvi.get_cvi(samples[ix, :], labels[ix])

Detailed Usage

The cvi package contains a set of implemented CVIs with batch and incremental update methods. Each CVI is a standalone stateful object inheriting from a base class CVI, and all CVI functions are object methods, such as those that update parameters and return the criterion value.

Instantiate a CVI of you choice with the default constructor:

# Import the package
import cvi
# Import numpy for some data handling
import numpy as np

# Instantiate a Calinski-Harabasz (CH) CVI object
my_cvi = cvi.CH()

CVIs are instantiated with their acronyms, with a list of all implemented CVIS being found in the [Implemented CVIs](#implemented-cvis) section.

A batch of data is assumed to be a numpy array of samples and a numpy vector of integer labels.

# Load some data
samples, labels = my_clustering_alg(some_data)

Note

The cvi package assumes the Numpy row-major convention where rows are individual samples and columns are features. A batch dataset is then [n_samples, n_features] large, and their corresponding labels are [n_samples] large.

You may compute the final criterion value with a batch update all at once with CVI.get_cvi

# Get the final criterion value in batch mode
criterion_value = my_cvi.get_cvi(samples, labels)

or you may get them incrementally with the same method, where you pass instead just a single numpy vector of features and a single integer label. The incremental methods are used automatically based upon the dimensions of the data that is passed.

# Create a container for the criterion value after each sample
n_samples = len(labels)
criterion_values = np.zeros(n_samples)

# Iterate across the data and store the criterion value over time
for ix in range(n_samples):
   sample = samples[ix, :]
   label = labels[ix]
   criterion_values[ix] = my_cvi.get_cvi(sample, label)

Note

Currently only using either batch or incremental methods is supported; switching from batch to incremental updates with the same is not yet implemented.

Implemented CVIs

The following CVIs have been implemented as of the latest version of cvi:

  • CH: Calinski-Harabasz

  • cSIL: Centroid-based Silhouette

  • DB: Davies-Bouldin

  • GD43: Generalized Dunn’s Index 43.

  • GD53: Generalized Dunn’s Index 53.

  • PS: Partition Separation.

  • rCIP: (Renyi’s) representative Cross Information Potential.

  • WB: WB-index.

  • XB: Xie-Beni.

Acknowledgements

Derivation

The incremental and batch CVI implementations in this package are largely derived from the following Julia language implementations by the same authors of this package:

Authors

The principal authors of the cvi pacakge are: