Skip to content

Training Internals

This chapter provides a deep dive into how Pilz training works under the hood. If you want to understand every detail of the algorithm, read on.

Architecture Overview

flowchart TB subgraph Input DC[DataCard] TS[TrainSettings] end subgraph Training_Service M[Main Loop] --> T[Train Tree] T --> C[Categorize] T --> CO[Counter] T --> R[Recurse] end subgraph Data_Service DW[Darkwing] --> DB[DuckDB] DW --> P[Polars] end DC --> TS TS --> M M --> DW DW --> DB DW --> P style M fill:#e0f0ff style C fill:#ccffcc style CO fill:#ffff99 style R fill:#ffcccc

Data Flow

sequenceDiagram participant CLI participant Train participant Darkwing participant DuckDB participant Pilz CLI->>Train: run() Train->>Train: for each target, for n trees Train->>Darkwing: read_akt_train(target_filter, path_filters) Darkwing->>DuckDB: SELECT WHERE target AND filters DuckDB-->>Darkwing: DataFrame Darkwing-->>Train: TrainDataframes Train->>Train: cater() - categorize features Train->>Train: counter() - find best split Train->>Train: recurse() - build subtree Train->>Pilz: create Pilz object Train-->>CLI: save JSON

The Main Loop

The run() method iterates over each target class, training n trees per target:

# src/pilz/service/train.py:55-80
def run(self):
    for target in self.dc.target:
        target_filter = Filter(target=target)
        pilze: dict[str, list[Spore]] = {}
        for tree_index in range(self.settings.n):
            spores = self.train_pilz(
                target_filter=target_filter,
                path_filter=[],
                depth="",
            )
            pilz = Pilz(spores=spores, target=target)
            self._save_pilz(pilz, tree_index)

The train_pilz Function

This is the core recursive function that builds a single tree:

# src/pilz/service/train.py:100-151
def train_pilz(self, target_filter, path_filter, depth=""):
    train_df = self.darkwing.read_akt_train(
        targer_filter=target_filter,
        train_settings=self.settings,
        akt_filters=path_filter,
    )

    if train_df.is_final_size() or len(depth) >= self.settings.max_depth:
        return self.make_spore(path_filter=path_filter, depth=depth, train_df=train_df)

    self.cater(train_df=train_df)
    left_filter, neutral_filter, right_filter = self.counter(train_df=train_df)

    if left_filter is None and right_filter is None:
        return self.make_spore(path_filter=path_filter, depth=depth, train_df=train_df)

    left_spores = self.train_pilz(target_filter, path_filter + [left_filter], depth + "l") if left_filter else []
    neutral_spores = self.train_pilz(target_filter, path_filter + [neutral_filter], depth + "n") if neutral_filter else []
    right_spores = self.train_pilz(target_filter, path_filter + [right_filter], depth + "r") if right_filter else []

    return left_spores + neutral_spores + right_spores

Each recursive call: 1. Reads a fresh balanced sample for this node via read_akt_train() (see Downsampling) 2. Stops if too few samples remain or max depth reached 3. Categorizes features via cater() (see Feature Categorization) 4. Finds the best split via counter() (see Multi-Dimensional Splits) 5. Recurse on Left, Neutral, and Right branches (see Three-Way Splits)

The cater Function

Categorizes all features into n_cat bins each:

# src/pilz/service/train.py:192-202
def cater(self, train_df: TrainDataframes):
    for feat in self.dc.train_features:
        categorized_feature = self.feat_cater(
            feat=feat, train_df=train_df, n=self.settings.n_cat
        )
        if not categorized_feature.is_diff_to_low():
            train_df.train_features.append(categorized_feature)

Features that don't differentiate well enough are excluded by is_diff_to_low():

# src/pilz/model/dataframes.py:326-331
def is_diff_to_low(self, threshold: float = 0.90) -> bool:
    max_wert = self.diff_df["max_proportion"].max()
    min_prop_of_max = self.diff_df.filter(
        pl.col("max_proportion") == max_wert
    )["proportion", "proportion_right"].min_horizontal()[0]
    return min_prop_of_max > threshold

The counter Function

Finds the best feature or combination to split on:

# src/pilz/service/train.py:153-190
def counter(self, train_df: TrainDataframes) -> tuple[Filter, Filter | None, Filter]:
    sorted_train_feats = sorted(
        train_df.train_features,
        key=lambda x: x.calc_diff(),
        reverse=True,
    )
    best_feat = sorted_train_feats[0]
    best_feat_diff = best_feat.calc_diff()

    for dim in range(2, self.settings.n_dims + 1):
        counter = 0
        for comb in itertools.combinations(sorted_train_feats, r=dim):
            counter += 1
            akt_feature = CombinedCategorizedFeature(
                comb,
                non_target_size=train_df.n_count_non_target,
                target_size=train_df.n_count_target,
                neutral_faktor=self.settings.neutral_faktor,
            )
            if akt_feature.calc_diff() > best_feat_diff:
                best_feat = akt_feature
                best_feat_diff = akt_feature.calc_diff()
            if (
                self.settings.calcs_per_dim
                and counter > self.settings.calcs_per_dim
            ):
                break

    return best_feat.get_left_right_filter()

Leaf Creation

Recursion stops when not enough samples remain or max depth is reached:

# src/pilz/model/dataframes.py:435-439
def is_final_size(self) -> bool:
    return (
        self.target_df_size < self.min_size
        or self.non_target_df_size < self.min_size
    )

A leaf is created with its score:

# src/pilz/service/train.py:85-98
def make_spore(self, path_filter, depth, train_df):
    score = train_df.score()
    return [Spore(
        cut=[fil.sql() for fil in path_filter],
        score=score,
        depth=depth,
    )]

The score is the target rate at this leaf:

# src/pilz/model/dataframes.py:428-433
def score(self) -> float:
    if self.non_target_df_size + self.target_df_size == 0:
        return 0.0
    return (self.target_df_size - self.non_target_df_size) / (
        self.non_target_df_size + self.target_df_size
    )

A score of 1.0 means all rows are target, -1.0 means all are non-target, and 0.0 means a perfect balance.

Data Caching

Darkwing provides caching to avoid reloading data:

# src/pilz/service/darkwing.py:45-60
def get_cached_train_df(self) -> pl.DataFrame:
    if self._cached_df is None:
        self._cached_df = pl.concat(
            [self._load_file(f) for f in self.dc.train_files]
        )
    return self._cached_df
flowchart LR subgraph "First Call" F1[Request] --> L[Load from CSV] L --> C[Cache in memory] end subgraph "Subsequent Calls" S1[Request] --> H[Check Cache] H -->|"Hit"| R1[Return cached] H -->|"Miss"| L end style C fill:#ccffcc style R1 fill:#ccffcc

Summary

Component File Description
run() train.py:55 Main loop: for each target, for n trees
train_pilz() train.py:100 Recursive tree building (read → cater → counter → recurse)
cater() train.py:192 Feature categorization into n_cat bins
feat_cater() train.py:204 Dispatches to numerical or categorical binning
counter() train.py:153 Finds best split (single or multi-dimensional)
make_spore() train.py:85 Creates leaf node with score
is_final_size() dataframes.py:435 Stopping criterion for recursion
score() dataframes.py:428 Target rate at a leaf
get_cached_train_df() darkwing.py:45 Data caching layer

Next Steps