Imbalanced Data¶

Pilz handles heavily imbalanced datasets without SMOTE, class weights, or manual resampling. The combination of independent sampling and equal-weight grouping makes it naturally robust to skewed class distributions.

The Problem¶

Most ML algorithms struggle when one class dominates (e.g., 99% non-target, 1% target). Standard approaches:

SMOTE: Generates synthetic samples of the minority class
Class weights: Penalizes misclassifications of the minority class more heavily
Manual resampling: Down-sample majority or up-sample minority

Pilz avoids all of these by design.

Independent Sampling¶

Because target and non-target are queried independently — each with ORDER BY RANDOM() LIMIT max_eval_fit — both sides contribute equally to each node:

# src/pilz/service/darkwing.py:65-90
def read_akt_train(self, targer_filter, train_settings, akt_filters):
    target_df = self._get_pl_train_df(
        full_filters=full_filter_target,
        max_eval_fit=train_settings.max_eval_fit,
    )
    non_target_df = self._get_pl_train_df(
        full_filters=full_filters_non_target,
        max_eval_fit=train_settings.max_eval_fit,
    )

A dataset with 1% target and 99% non-target produces the same max_eval_fit rows for each — say 1000 target and 1000 non-target — regardless of the original proportions.

Equal-Weight Grouping¶

After sampling, the data is split into count and group sets. The group set receives equal total weight:

# src/pilz/model/dataframes.py:380-402
# Target side: weight = 0.5 / n_groups
target_df_group = target_df_group.with_columns(
    pl.lit(0.5 / max(1, n_group_target)).alias("weight")
)
target_df_group = target_df_group.with_columns(
    (2 * pl.col("weight")).alias("target_weight")
)

# Non-target side: same total weight of 0.5
non_target_df_group = non_target_df_group.with_columns(
    pl.lit(0.5 / max(1, n_group_non_target)).alias("weight")
)
non_target_df_group = non_target_df_group.with_columns(
    pl.lit(0.0).alias("target_weight")
)

Each side gets total weight 0.5. If one side has more groups, each individual group in that side gets a proportionally smaller weight. This means the correlation tables always reflect a balanced view.

Per-Node Re-Balancing¶

The balancing happens at every node in the tree, not just once at the root:

# src/pilz/service/train.py:100-107
def train_pilz(self, target_filter, path_filter, depth=""):
    train_df = self.darkwing.read_akt_train(
        targer_filter=target_filter,
        train_settings=self.settings,
        akt_filters=path_filter,
    )

As the tree splits and data becomes purer, the remaining rows are re-sampled and re-balanced at each recursive call. A deep node that has only a few hundred target rows left in the original data will still get a fresh balanced sample.

The Score Function¶

The leaf score reflects the balance at that node:

# src/pilz/model/dataframes.py:428-433
def score(self) -> float:
    if self.non_target_df_size + self.target_df_size == 0:
        return 0.0
    return (self.target_df_size - self.non_target_df_size) / (
        self.non_target_df_size + self.target_df_size
    )

Because both sides are sampled to similar sizes, this score is meaningful regardless of the original class balance. A node with mostly target rows (after filtering) gets a high positive score.

Comparison¶

Approach	Requires tuning?	Works out of the box?	Handles 99:1 imbalance?
SMOTE	Yes (k neighbors)	No	Poor
Class weights	Yes (ratio)	No	Moderate
Manual resampling	Yes (ratio)	No	Moderate
Pilz	No	Yes	Yes

Summary¶

Concept	Description
Independent sampling	Each class sampled separately with `LIMIT`
Equal weight	Both sides get total weight 0.5
Per-node balancing	Every tree node gets a fresh balanced sample
No SMOTE needed	The approach naturally handles any class ratio

Next Steps¶

Downsampling — How the sampling works in detail
Three-Way Splits — What happens at each node
Feature Categorization — How features are binned