Skip to content

Example: Iris Dataset

The Iris dataset is the classic "Hello World" of machine learning - perfect for getting started with Pilz.

Dataset

  • Source: UCI Machine Learning Repository
  • 150 samples (50 per class)
  • 4 features: sepal/petal length and width
  • 3 classes: Setosa, Versicolor, Virginica

Quick Start

# Download data
curl -o iris.csv "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Add headers
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_header.csv
tail -n +2 iris.csv >> iris_header.csv
mv iris_header.csv iris.csv

# Create DataCard
pilz create-dc --src iris.csv --out iris_dc.yaml

# Train
pilz train --datacard iris_dc.yaml --trainsettings train.yaml

# Evaluate
pilz eval --datacard iris_dc.yaml --evalsettings eval.yaml

DataCard

features:
  - name: sepal_length
    statistical: numerical
    type: float
  - name: sepal_width
    statistical: numerical
    type: float
  - name: petal_length
    statistical: numerical
    type: float
  - name: petal_width
    statistical: numerical
    type: float

target:
  feature_name: species
  values:
    - Iris-setosa
    - Iris-versicolor
    - Iris-virginica

train_files:
  - iris.csv
test_files:
  - iris.csv

Settings

n: 1
out_folder: iris_model
max_depth: 10
n_dims: 2
n_cat: 3

Actual Results

ROC Curves (Excellent Separation)

The Iris dataset is well-separated - the three species are easily distinguished:

  • Setosa is linearly separable from the others
  • Versicolor and Virginica overlap slightly but are still distinguishable

Learned Rules Example

A trained tree for "Iris-setosa" might look like:

{
  "spores": [
    {
      "cut": ["petal_width <= 0.8"],
      "score": 1.0,
      "depth": "l"
    },
    {
      "cut": ["petal_width > 0.8", "petal_width <= 1.75"],
      "score": 0.0,
      "depth": "rr"
    }
  ],
  "target": "Iris-setosa"
}

This simple rule achieves 100% accuracy: - If petal_width ≤ 0.8 → Setosa - Otherwise → Not Setosa

Output Files

iris_model/
├── Iris-setosa/0.json
├── Iris-versicolor/0.json
└── Iris-virginica/0.json

eval/
├── Iris-setosa_roc.html
├── Iris-versicolor_roc.html
├── Iris-virginica_roc.html
├── all_roc.html
├── multi_class_result.html
└── predictions.csv

Why Iris Works So Well

flowchart LR subgraph "Feature Distribution" P1[petal_length: 1-6cm] P2[petal_width: 0.1-2.5cm] end subgraph "Separation" S1[Setosa: Small petals] S2[Versicolor: Medium] S3[Virginica: Large] end P1 --> S1 P1 --> S2 P1 --> S3 P2 --> S1 P2 --> S2 P2 --> S3 style S1 fill:#ccffcc style S2 fill:#ffff99 style S3 fill:#ffcccc
  1. Clear clusters: Each species forms a distinct group
  2. Simple rules work: Single feature (petal_width) separates most
  3. No feature interactions needed: Even n_dims=1 works

Expected Results

Metric Value
AUC ~1.0 (excellent)
Accuracy >95%
Training time < 1 second

Next Steps

Try these variations to learn more:

  1. n_dims=1 - See if single features are enough
  2. n_dims=3 - Try feature combinations (though not needed)
  3. n_cat=5 - More granular bins
  4. n=5 - Ensemble of 5 trees