Mushroom Dataset

Update time

Overview

This example shows how to do supervised training and testing with START on the UCI Mushroom dataset. The Mushroom dataset is a purely categorical dataset where each feature has entries that are members of different discrete categories. Where this is normally a challenge for other machine learning models due to encoding schemes and considerations, START learns directly on the symbols of the dataset itself. Furthermore, START can use a simple supervised mode to map clusters to supervised categories to allow for training and performance testing.

Setup

First, we load some dependencies:

# Just load this project
using OAR

Loading the Dataset

The OAR project has an all-in-one function for loading the dataset, parsing it into statements, and inferring the resulting grammar:

# Point to the relative file location
filename = joinpath("..", "assets", "mushrooms.csv")
# All-in-one function
fs, bnf = OAR.symbolic_mushroom(filename)
typeof(fs)

OAR.DataSplitGeneric{SubArray{Vector{GSymbol{String}}, 1, Vector{Vector{GSymbol{String}}}, Tuple{Vector{Int64}}, false}, SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}}

Intializing START

We use the grammar and keyword arguments to set the options of the module during initialization:

# Initialize the module with options
art = OAR.START(bnf,
    rho = 0.6,
)

START(ProtoNode[], OAR.CFG{String}(N:22, S:22, P:22, T:23), OAR.opts_START
  rho: Float64 0.6
  alpha: Float64 0.001
  beta: Float64 1.0
  epochs: Int64 1
  terminated: Bool false
, Int64[], Float64[], Float64[], Dict{String, Any}("n_categories" => 0, "n_clusters" => 0, "n_instance" => Int64[]))

We could also set or change the options after initialization with art.opts.rho = 0.7.

Training and Testing

To train the model we will use the training statements portion of the dataset that we loaded earlier along with their corresponding supervisory labels:

# Iterate over the training data
for ix in eachindex(fs.train_x)
    statement = fs.train_x[ix]
    label = fs.train_y[ix]
    OAR.train!(
        art,
        statement,
        y=label,
    )
end

To test the model, we use the testing data and extract the prescribed label for each sample by the model:

# Create a container for the output labels
clusters = zeros(Int, length(fs.test_y))
# Iterate over the testing data
for ix in eachindex(fs.test_x)
    clusters[ix] = OAR.classify(
        art,
        fs.test_x[ix],
        get_bmu=true,
    )
end

We can finally test the performance of the module by seeing the percentage of testing samples that are incorrectly labeled:

# Calculate testing performance
perf = OAR.AdaptiveResonance.performance(fs.test_y, clusters)

# Logging
@info "Final performance: $(perf)"
@info "n_categories: $(art.stats["n_categories"])"

[ Info: Final performance: 0.8986458760771441
[ Info: n_categories: 619

This page was generated using DemoCards.jl and Literate.jl.