Example: Customer Churn¶
This example demonstrates binary classification with the Telco Customer Churn dataset.
Dataset¶
- Source: Kaggle Telco Customer Churn via blastchar/telco-customer-churn
- Task: Predict customer churn (Yes/No)
- Features: 19 (demographics, services, billing)
- Classes: 2 (Yes, No)
- Training samples: 5,634
- Test samples: 1,409
Quick Start¶
The config files for this example are in examples/churn/:
# 1. Download data (requires kagglehub)
pip install kagglehub
python3 -c "
import kagglehub
path = kagglehub.dataset_download('blastchar/telco-customer-churn')
print(f'Downloaded to: {path}')
"
# 2. Point the datacard to your downloaded data
# Edit examples/churn/dc_telco_customer.yaml and update:
# train_files:
# - <kagglehub_path>/train.csv
# test_files:
# - <kagglehub_path>/test.csv
# Note: the original dataset is one file; split it into train/test first.
# 3. Train
pilz train \
--datacard examples/churn/dc_telco_customer.yaml \
--trainsettings examples/churn/train_settings.yaml
# 4. Evaluate
pilz eval \
--datacard examples/churn/dc_telco_customer.yaml \
--evalsettings examples/churn/eval_settings.yaml
Or use the provided script:
DataCard Structure¶
features:
- name: gender
statistical: categorial
type: string
- name: SeniorCitizen
statistical: numerical
type: int
- name: Partner
statistical: categorial
type: string
- name: Dependents
statistical: categorial
type: string
- name: tenure
statistical: numerical
type: int
- name: PhoneService
statistical: categorial
type: string
- name: MultipleLines
statistical: categorial
type: string
- name: InternetService
statistical: categorial
type: string
# ... 11 more features
- name: Churn
statistical: categorial
type: string
target:
feature_name: Churn
values:
- "Yes"
- "No"
train_files:
- /path/to/train.csv
test_files:
- /path/to/test.csv
Settings (Quick Start)¶
n: 1 # 1 tree per class (2 trees total)
out_folder: test
max_depth: 5 # Shallow trees for speed
frac_eval_cat: 0.8
max_eval_fit: 500
min_eval_fit: 5
n_dims: 2 # Pairwise feature combinations
n_cat: 3 # 3 bins per numerical feature
calcs_per_dim: 200 # Limited calculations per dimension
Training Time¶
With quick-start settings on a modern laptop (Apple Silicon):
- Training: ~15 seconds
- Evaluation: < 1 second
Actual Results¶
Overall Accuracy: 79.3%¶
Per-Class Accuracy¶
| Class | Accuracy |
|---|---|
| No | 87.8% |
| Yes | 53.6% |
The "No" class is easier to predict (majority class with ~73% of samples). The "Yes" class is harder due to class imbalance and more varied churn reasons.
ROC Curve¶
Output Files¶
test/
├── Yes/0.json # Model for predicting churn "Yes"
└── No/0.json # Model for predicting churn "No"
eval/
├── Yes_roc.html
├── No_roc.html
├── all_roc.html
└── multi_class_result.html
Sample Predictions¶
Churn,Yes,No,predicted_Churn,correct
No,-0.88,0.92,No,1
No,-0.95,0.95,No,1
Yes,0.11,-0.37,Yes,1
No,-0.86,0.80,No,1
Yes,-0.36,0.49,No,0
Key Findings¶
- Contract type is the strongest predictor
- Month-to-month customers churn more
-
Two-year contracts have lowest churn
-
Tenure matters
- New customers (< 12 months) churn more
-
Longer relationships = loyalty
-
Internet service type interacts with contract
- Fiber optic + month-to-month = high risk
- DSL customers are more stable
Tips¶
Quick Start Settings (current)¶
These are reduced for fast iteration. Training takes ~15 seconds and gives ~79% accuracy.
For Better Accuracy¶
Increase these settings in train_settings.yaml:
n: 5 # Ensemble of 5 trees per class
max_depth: 13 # Deeper trees
n_dims: 3 # Triple feature combinations
n_cat: 5 # Finer bins
calcs_per_dim: 4000 # More thorough search
max_eval_fit: 5000 # More training samples (dataset has 5634 rows)
With these settings expect: - Training time: 1-5 minutes - Accuracy: 80-85% - Yes-class recall: significantly improved
Incremental Approach¶
- Start with
max_depth=5, n_dims=2to verify the pipeline - Increase
max_depthto 8, then 13 - Try
n_dims=3for feature interactions - Add more trees with
n=5 - For class imbalance, monitor the Yes-class accuracy