Mushroom Dataset

Source code notebook compat Author Update time

Overview

This example shows how to do supervised training and testing with START on the UCI Mushroom dataset. The Mushroom dataset is a purely categorical dataset where each feature has entries that are members of different discrete categories. Where this is normally a challenge for other machine learning models due to encoding schemes and considerations, START learns directly on the symbols of the dataset itself. Furthermore, START can use a simple supervised mode to map clusters to supervised categories to allow for training and performance testing.

Setup

First, we load some dependencies:

# Just load this project
using OAR

Loading the Dataset

The OAR project has an all-in-one function for loading the dataset, parsing it into statements, and inferring the resulting grammar:

# Point to the relative file location
filename = joinpath("..", "assets", "mushrooms.csv")
# All-in-one function
fs, bnf = OAR.symbolic_mushroom(filename)
typeof(fs)
OAR.DataSplitGeneric{SubArray{Vector{GSymbol{String}}, 1, Vector{Vector{GSymbol{String}}}, Tuple{Vector{Int64}}, false}, SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}}

Intializing START

We use the grammar and keyword arguments to set the options of the module during initialization:

# Initialize the module with options
art = OAR.START(bnf,
    rho = 0.6,
)
START(ProtoNode[], OAR.CFG{String}(N:22, S:22, P:22, T:23), OAR.opts_START
  rho: Float64 0.6
  alpha: Float64 0.001
  beta: Float64 1.0
  epochs: Int64 1
  terminated: Bool false
, Int64[], Float64[], Float64[], Dict{String, Any}("n_categories" => 0, "n_clusters" => 0, "n_instance" => Int64[]))

We could also set or change the options after initialization with art.opts.rho = 0.7.

Training and Testing

To train the model we will use the training statements portion of the dataset that we loaded earlier along with their corresponding supervisory labels:

# Iterate over the training data
for ix in eachindex(fs.train_x)
    statement = fs.train_x[ix]
    label = fs.train_y[ix]
    OAR.train!(
        art,
        statement,
        y=label,
    )
end

To test the model, we use the testing data and extract the prescribed label for each sample by the model:

# Create a container for the output labels
clusters = zeros(Int, length(fs.test_y))
# Iterate over the testing data
for ix in eachindex(fs.test_x)
    clusters[ix] = OAR.classify(
        art,
        fs.test_x[ix],
        get_bmu=true,
    )
end

We can finally test the performance of the module by seeing the percentage of testing samples that are incorrectly labeled:

# Calculate testing performance
perf = OAR.AdaptiveResonance.performance(fs.test_y, clusters)

# Logging
@info "Final performance: $(perf)"
@info "n_categories: $(art.stats["n_categories"])"
[ Info: Final performance: 0.9290110791957324
[ Info: n_categories: 555

This page was generated using DemoCards.jl and Literate.jl.