Skip to content

Imbalanced Data

Pilz handles heavily imbalanced datasets without SMOTE, class weights, or manual resampling. The combination of independent sampling and equal-weight grouping makes it naturally robust to skewed class distributions.

The Problem

Most ML algorithms struggle when one class dominates (e.g., 99% non-target, 1% target). Standard approaches:

  • SMOTE: Generates synthetic samples of the minority class
  • Class weights: Penalizes misclassifications of the minority class more heavily
  • Manual resampling: Down-sample majority or up-sample minority

Pilz avoids all of these by design.

Independent Sampling

Because target and non-target are queried independently — each with ORDER BY RANDOM() LIMIT max_eval_fit — both sides contribute equally to each node:

# src/pilz/service/darkwing.py:65-90
def read_akt_train(self, targer_filter, train_settings, akt_filters):
    target_df = self._get_pl_train_df(
        full_filters=full_filter_target,
        max_eval_fit=train_settings.max_eval_fit,
    )
    non_target_df = self._get_pl_train_df(
        full_filters=full_filters_non_target,
        max_eval_fit=train_settings.max_eval_fit,
    )

A dataset with 1% target and 99% non-target produces the same max_eval_fit rows for each — say 1000 target and 1000 non-target — regardless of the original proportions.

Equal-Weight Grouping

After sampling, the data is split into count and group sets. The group set receives equal total weight:

# src/pilz/model/dataframes.py:380-402
# Target side: weight = 0.5 / n_groups
target_df_group = target_df_group.with_columns(
    pl.lit(0.5 / max(1, n_group_target)).alias("weight")
)
target_df_group = target_df_group.with_columns(
    (2 * pl.col("weight")).alias("target_weight")
)

# Non-target side: same total weight of 0.5
non_target_df_group = non_target_df_group.with_columns(
    pl.lit(0.5 / max(1, n_group_non_target)).alias("weight")
)
non_target_df_group = non_target_df_group.with_columns(
    pl.lit(0.0).alias("target_weight")
)

Each side gets total weight 0.5. If one side has more groups, each individual group in that side gets a proportionally smaller weight. This means the correlation tables always reflect a balanced view.

Per-Node Re-Balancing

The balancing happens at every node in the tree, not just once at the root:

# src/pilz/service/train.py:100-107
def train_pilz(self, target_filter, path_filter, depth=""):
    train_df = self.darkwing.read_akt_train(
        targer_filter=target_filter,
        train_settings=self.settings,
        akt_filters=path_filter,
    )

As the tree splits and data becomes purer, the remaining rows are re-sampled and re-balanced at each recursive call. A deep node that has only a few hundred target rows left in the original data will still get a fresh balanced sample.

The Score Function

The leaf score reflects the balance at that node:

# src/pilz/model/dataframes.py:428-433
def score(self) -> float:
    if self.non_target_df_size + self.target_df_size == 0:
        return 0.0
    return (self.target_df_size - self.non_target_df_size) / (
        self.non_target_df_size + self.target_df_size
    )

Because both sides are sampled to similar sizes, this score is meaningful regardless of the original class balance. A node with mostly target rows (after filtering) gets a high positive score.

Comparison

Approach Requires tuning? Works out of the box? Handles 99:1 imbalance?
SMOTE Yes (k neighbors) No Poor
Class weights Yes (ratio) No Moderate
Manual resampling Yes (ratio) No Moderate
Pilz No Yes Yes

Summary

Concept Description
Independent sampling Each class sampled separately with LIMIT
Equal weight Both sides get total weight 0.5
Per-node balancing Every tree node gets a fresh balanced sample
No SMOTE needed The approach naturally handles any class ratio

Next Steps