Imbalanced Data¶
Pilz handles heavily imbalanced datasets without SMOTE, class weights, or manual resampling. The combination of independent sampling and equal-weight grouping makes it naturally robust to skewed class distributions.
The Problem¶
Most ML algorithms struggle when one class dominates (e.g., 99% non-target, 1% target). Standard approaches:
- SMOTE: Generates synthetic samples of the minority class
- Class weights: Penalizes misclassifications of the minority class more heavily
- Manual resampling: Down-sample majority or up-sample minority
Pilz avoids all of these by design.
Independent Sampling¶
Because target and non-target are queried independently — each with ORDER BY RANDOM() LIMIT max_eval_fit — both sides contribute equally to each node:
# src/pilz/service/darkwing.py:65-90
def read_akt_train(self, targer_filter, train_settings, akt_filters):
target_df = self._get_pl_train_df(
full_filters=full_filter_target,
max_eval_fit=train_settings.max_eval_fit,
)
non_target_df = self._get_pl_train_df(
full_filters=full_filters_non_target,
max_eval_fit=train_settings.max_eval_fit,
)
A dataset with 1% target and 99% non-target produces the same max_eval_fit rows for each — say 1000 target and 1000 non-target — regardless of the original proportions.
Equal-Weight Grouping¶
After sampling, the data is split into count and group sets. The group set receives equal total weight:
# src/pilz/model/dataframes.py:380-402
# Target side: weight = 0.5 / n_groups
target_df_group = target_df_group.with_columns(
pl.lit(0.5 / max(1, n_group_target)).alias("weight")
)
target_df_group = target_df_group.with_columns(
(2 * pl.col("weight")).alias("target_weight")
)
# Non-target side: same total weight of 0.5
non_target_df_group = non_target_df_group.with_columns(
pl.lit(0.5 / max(1, n_group_non_target)).alias("weight")
)
non_target_df_group = non_target_df_group.with_columns(
pl.lit(0.0).alias("target_weight")
)
Each side gets total weight 0.5. If one side has more groups, each individual group in that side gets a proportionally smaller weight. This means the correlation tables always reflect a balanced view.
Per-Node Re-Balancing¶
The balancing happens at every node in the tree, not just once at the root:
# src/pilz/service/train.py:100-107
def train_pilz(self, target_filter, path_filter, depth=""):
train_df = self.darkwing.read_akt_train(
targer_filter=target_filter,
train_settings=self.settings,
akt_filters=path_filter,
)
As the tree splits and data becomes purer, the remaining rows are re-sampled and re-balanced at each recursive call. A deep node that has only a few hundred target rows left in the original data will still get a fresh balanced sample.
The Score Function¶
The leaf score reflects the balance at that node:
# src/pilz/model/dataframes.py:428-433
def score(self) -> float:
if self.non_target_df_size + self.target_df_size == 0:
return 0.0
return (self.target_df_size - self.non_target_df_size) / (
self.non_target_df_size + self.target_df_size
)
Because both sides are sampled to similar sizes, this score is meaningful regardless of the original class balance. A node with mostly target rows (after filtering) gets a high positive score.
Comparison¶
| Approach | Requires tuning? | Works out of the box? | Handles 99:1 imbalance? |
|---|---|---|---|
| SMOTE | Yes (k neighbors) | No | Poor |
| Class weights | Yes (ratio) | No | Moderate |
| Manual resampling | Yes (ratio) | No | Moderate |
| Pilz | No | Yes | Yes |
Summary¶
| Concept | Description |
|---|---|
| Independent sampling | Each class sampled separately with LIMIT |
| Equal weight | Both sides get total weight 0.5 |
| Per-node balancing | Every tree node gets a fresh balanced sample |
| No SMOTE needed | The approach naturally handles any class ratio |
Next Steps¶
- Downsampling — How the sampling works in detail
- Three-Way Splits — What happens at each node
- Feature Categorization — How features are binned