Example: Iris Dataset¶
The Iris dataset is the classic "Hello World" of machine learning - perfect for getting started with Pilz.
Dataset¶
- Source: UCI Machine Learning Repository
- 150 samples (50 per class)
- 4 features: sepal/petal length and width
- 3 classes: Setosa, Versicolor, Virginica
Quick Start¶
# Download data
curl -o iris.csv "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# Add headers
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_header.csv
tail -n +2 iris.csv >> iris_header.csv
mv iris_header.csv iris.csv
# Create DataCard
pilz create-dc --src iris.csv --out iris_dc.yaml
# Train
pilz train --datacard iris_dc.yaml --trainsettings train.yaml
# Evaluate
pilz eval --datacard iris_dc.yaml --evalsettings eval.yaml
DataCard¶
features:
- name: sepal_length
statistical: numerical
type: float
- name: sepal_width
statistical: numerical
type: float
- name: petal_length
statistical: numerical
type: float
- name: petal_width
statistical: numerical
type: float
target:
feature_name: species
values:
- Iris-setosa
- Iris-versicolor
- Iris-virginica
train_files:
- iris.csv
test_files:
- iris.csv
Settings¶
Actual Results¶
ROC Curves (Excellent Separation)¶
The Iris dataset is well-separated - the three species are easily distinguished:
- Setosa is linearly separable from the others
- Versicolor and Virginica overlap slightly but are still distinguishable
Learned Rules Example¶
A trained tree for "Iris-setosa" might look like:
{
"spores": [
{
"cut": ["petal_width <= 0.8"],
"score": 1.0,
"depth": "l"
},
{
"cut": ["petal_width > 0.8", "petal_width <= 1.75"],
"score": 0.0,
"depth": "rr"
}
],
"target": "Iris-setosa"
}
This simple rule achieves 100% accuracy:
- If petal_width ≤ 0.8 → Setosa
- Otherwise → Not Setosa
Output Files¶
iris_model/
├── Iris-setosa/0.json
├── Iris-versicolor/0.json
└── Iris-virginica/0.json
eval/
├── Iris-setosa_roc.html
├── Iris-versicolor_roc.html
├── Iris-virginica_roc.html
├── all_roc.html
├── multi_class_result.html
└── predictions.csv
Why Iris Works So Well¶
flowchart LR
subgraph "Feature Distribution"
P1[petal_length: 1-6cm]
P2[petal_width: 0.1-2.5cm]
end
subgraph "Separation"
S1[Setosa: Small petals]
S2[Versicolor: Medium]
S3[Virginica: Large]
end
P1 --> S1
P1 --> S2
P1 --> S3
P2 --> S1
P2 --> S2
P2 --> S3
style S1 fill:#ccffcc
style S2 fill:#ffff99
style S3 fill:#ffcccc
- Clear clusters: Each species forms a distinct group
- Simple rules work: Single feature (petal_width) separates most
- No feature interactions needed: Even n_dims=1 works
Expected Results¶
| Metric | Value |
|---|---|
| AUC | ~1.0 (excellent) |
| Accuracy | >95% |
| Training time | < 1 second |
Next Steps¶
Try these variations to learn more:
- n_dims=1 - See if single features are enough
- n_dims=3 - Try feature combinations (though not needed)
- n_cat=5 - More granular bins
- n=5 - Ensemble of 5 trees