Mushroom Dataset
Overview
This example shows how to do supervised training and testing with START on the UCI Mushroom dataset. The Mushroom dataset is a purely categorical dataset where each feature has entries that are members of different discrete categories. Where this is normally a challenge for other machine learning models due to encoding schemes and considerations, START learns directly on the symbols of the dataset itself. Furthermore, START can use a simple supervised mode to map clusters to supervised categories to allow for training and performance testing.
Setup
First, we load some dependencies:
# Just load this project
using OAR
Loading the Dataset
The OAR project has an all-in-one function for loading the dataset, parsing it into statements, and inferring the resulting grammar:
# Point to the relative file location
filename = joinpath("..", "assets", "mushrooms.csv")
# All-in-one function
fs, bnf = OAR.symbolic_mushroom(filename)
typeof(fs)
OAR.DataSplitGeneric{SubArray{Vector{GSymbol{String}}, 1, Vector{Vector{GSymbol{String}}}, Tuple{Vector{Int64}}, false}, SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}}
Intializing START
We use the grammar and keyword arguments to set the options of the module during initialization:
# Initialize the module with options
art = OAR.START(bnf,
rho = 0.6,
)
START(ProtoNode[], OAR.CFG{String}(N:22, S:22, P:22, T:23), OAR.opts_START
rho: Float64 0.6
alpha: Float64 0.001
beta: Float64 1.0
epochs: Int64 1
terminated: Bool false
, Int64[], Float64[], Float64[], Dict{String, Any}("n_categories" => 0, "n_clusters" => 0, "n_instance" => Int64[]))
We could also set or change the options after initialization with art.opts.rho = 0.7
.
Training and Testing
To train the model we will use the training statements portion of the dataset that we loaded earlier along with their corresponding supervisory labels:
# Iterate over the training data
for ix in eachindex(fs.train_x)
statement = fs.train_x[ix]
label = fs.train_y[ix]
OAR.train!(
art,
statement,
y=label,
)
end
To test the model, we use the testing data and extract the prescribed label for each sample by the model:
# Create a container for the output labels
clusters = zeros(Int, length(fs.test_y))
# Iterate over the testing data
for ix in eachindex(fs.test_x)
clusters[ix] = OAR.classify(
art,
fs.test_x[ix],
get_bmu=true,
)
end
We can finally test the performance of the module by seeing the percentage of testing samples that are incorrectly labeled:
# Calculate testing performance
perf = OAR.AdaptiveResonance.performance(fs.test_y, clusters)
# Logging
@info "Final performance: $(perf)"
@info "n_categories: $(art.stats["n_categories"])"
[ Info: Final performance: 0.9290110791957324
[ Info: n_categories: 555
This page was generated using DemoCards.jl and Literate.jl.