Training Internals¶
This chapter provides a deep dive into how Pilz training works under the hood. If you want to understand every detail of the algorithm, read on.
Architecture Overview¶
Data Flow¶
The Main Loop¶
The run() method iterates over each target class, training n trees per target:
# src/pilz/service/train.py:55-80
def run(self):
for target in self.dc.target:
target_filter = Filter(target=target)
pilze: dict[str, list[Spore]] = {}
for tree_index in range(self.settings.n):
spores = self.train_pilz(
target_filter=target_filter,
path_filter=[],
depth="",
)
pilz = Pilz(spores=spores, target=target)
self._save_pilz(pilz, tree_index)
The train_pilz Function¶
This is the core recursive function that builds a single tree:
# src/pilz/service/train.py:100-151
def train_pilz(self, target_filter, path_filter, depth=""):
train_df = self.darkwing.read_akt_train(
targer_filter=target_filter,
train_settings=self.settings,
akt_filters=path_filter,
)
if train_df.is_final_size() or len(depth) >= self.settings.max_depth:
return self.make_spore(path_filter=path_filter, depth=depth, train_df=train_df)
self.cater(train_df=train_df)
left_filter, neutral_filter, right_filter = self.counter(train_df=train_df)
if left_filter is None and right_filter is None:
return self.make_spore(path_filter=path_filter, depth=depth, train_df=train_df)
left_spores = self.train_pilz(target_filter, path_filter + [left_filter], depth + "l") if left_filter else []
neutral_spores = self.train_pilz(target_filter, path_filter + [neutral_filter], depth + "n") if neutral_filter else []
right_spores = self.train_pilz(target_filter, path_filter + [right_filter], depth + "r") if right_filter else []
return left_spores + neutral_spores + right_spores
Each recursive call:
1. Reads a fresh balanced sample for this node via read_akt_train() (see Downsampling)
2. Stops if too few samples remain or max depth reached
3. Categorizes features via cater() (see Feature Categorization)
4. Finds the best split via counter() (see Multi-Dimensional Splits)
5. Recurse on Left, Neutral, and Right branches (see Three-Way Splits)
The cater Function¶
Categorizes all features into n_cat bins each:
# src/pilz/service/train.py:192-202
def cater(self, train_df: TrainDataframes):
for feat in self.dc.train_features:
categorized_feature = self.feat_cater(
feat=feat, train_df=train_df, n=self.settings.n_cat
)
if not categorized_feature.is_diff_to_low():
train_df.train_features.append(categorized_feature)
Features that don't differentiate well enough are excluded by is_diff_to_low():
# src/pilz/model/dataframes.py:326-331
def is_diff_to_low(self, threshold: float = 0.90) -> bool:
max_wert = self.diff_df["max_proportion"].max()
min_prop_of_max = self.diff_df.filter(
pl.col("max_proportion") == max_wert
)["proportion", "proportion_right"].min_horizontal()[0]
return min_prop_of_max > threshold
The counter Function¶
Finds the best feature or combination to split on:
# src/pilz/service/train.py:153-190
def counter(self, train_df: TrainDataframes) -> tuple[Filter, Filter | None, Filter]:
sorted_train_feats = sorted(
train_df.train_features,
key=lambda x: x.calc_diff(),
reverse=True,
)
best_feat = sorted_train_feats[0]
best_feat_diff = best_feat.calc_diff()
for dim in range(2, self.settings.n_dims + 1):
counter = 0
for comb in itertools.combinations(sorted_train_feats, r=dim):
counter += 1
akt_feature = CombinedCategorizedFeature(
comb,
non_target_size=train_df.n_count_non_target,
target_size=train_df.n_count_target,
neutral_faktor=self.settings.neutral_faktor,
)
if akt_feature.calc_diff() > best_feat_diff:
best_feat = akt_feature
best_feat_diff = akt_feature.calc_diff()
if (
self.settings.calcs_per_dim
and counter > self.settings.calcs_per_dim
):
break
return best_feat.get_left_right_filter()
Leaf Creation¶
Recursion stops when not enough samples remain or max depth is reached:
# src/pilz/model/dataframes.py:435-439
def is_final_size(self) -> bool:
return (
self.target_df_size < self.min_size
or self.non_target_df_size < self.min_size
)
A leaf is created with its score:
# src/pilz/service/train.py:85-98
def make_spore(self, path_filter, depth, train_df):
score = train_df.score()
return [Spore(
cut=[fil.sql() for fil in path_filter],
score=score,
depth=depth,
)]
The score is the target rate at this leaf:
# src/pilz/model/dataframes.py:428-433
def score(self) -> float:
if self.non_target_df_size + self.target_df_size == 0:
return 0.0
return (self.target_df_size - self.non_target_df_size) / (
self.non_target_df_size + self.target_df_size
)
A score of 1.0 means all rows are target, -1.0 means all are non-target, and 0.0 means a perfect balance.
Data Caching¶
Darkwing provides caching to avoid reloading data:
# src/pilz/service/darkwing.py:45-60
def get_cached_train_df(self) -> pl.DataFrame:
if self._cached_df is None:
self._cached_df = pl.concat(
[self._load_file(f) for f in self.dc.train_files]
)
return self._cached_df
Summary¶
| Component | File | Description |
|---|---|---|
run() |
train.py:55 |
Main loop: for each target, for n trees |
train_pilz() |
train.py:100 |
Recursive tree building (read → cater → counter → recurse) |
cater() |
train.py:192 |
Feature categorization into n_cat bins |
feat_cater() |
train.py:204 |
Dispatches to numerical or categorical binning |
counter() |
train.py:153 |
Finds best split (single or multi-dimensional) |
make_spore() |
train.py:85 |
Creates leaf node with score |
is_final_size() |
dataframes.py:435 |
Stopping criterion for recursion |
score() |
dataframes.py:428 |
Target rate at a leaf |
get_cached_train_df() |
darkwing.py:45 |
Data caching layer |
Next Steps¶
- SQL Rules for Deployment — Deploy models to production
- Settings Reference — All parameters
- Troubleshooting — Common issues