Clustering on Knowledge Graphs

Update time

Overview

This script demonstrates the usage of a START module for analyzing biomedical data knowledge graphs. Though the OAR project contains multiple such knowledge graphs, the Charcot-Marie-Tooth (CMT) dataset is used as an example here with the procedure remaining the same with other datasets.

Setup

First, we load some dependencies:

# Import the OAR project module
using OAR

Next, we must should point to the location of the dataset containing the preprocessed knowledge graph statements

# Location of the edge attributes file, formatted for Lerch parsing
edge_file = joinpath("..", "assets", "edge_attributes_lerche.txt")

"../assets/edge_attributes_lerche.txt"

Load the KG statements

statements = OAR.get_kg_statements(edge_file)
typeof(statements)

Vector{Vector{GSymbol{String}}} (alias for Array{Array{GSymbol{String}, 1}, 1})

Generate a simple subject-predicate-object grammar from the statements

grammar = OAR.SPOCFG(statements)

OAR.CFG{String}(N:3, S:3, P:3, T:788)

Initialize the START module

gramart = OAR.START(
    grammar,
    rho=0.05,
    terminated=false,
)

START(ProtoNode[], OAR.CFG{String}(N:3, S:3, P:3, T:788), OAR.opts_START
  rho: Float64 0.05
  alpha: Float64 0.001
  beta: Float64 1.0
  epochs: Int64 1
  terminated: Bool false
, Int64[], Float64[], Float64[], Dict{String, Any}("n_categories" => 0, "n_clusters" => 0, "n_instance" => Int64[]))

Train

Now we are ready to cluster the statements. We do this with the train! function without supervised labels, indicating that we are learning on the samples alone.

# Process the statements
for statement in statements
    OAR.train!(gramart, statement)
end

Analysis

We can see how the clustering went by inspecting how many clusters we generated:

@info "Number of categories: $(length(gramart.protonodes))"

[ Info: Number of categories: 6

This page was generated using DemoCards.jl and Literate.jl.