Downsampling¶
At every node, Pilz independently samples target and non-target rows. This ensures training data stays manageable regardless of the original dataset size.
How It Works¶
Training data is read via two separate SQL queries — one for target rows, one for non-target rows — each limited to max_eval_fit rows:
# src/pilz/service/darkwing.py:65-104
def read_akt_train(self, targer_filter, train_settings, akt_filters):
full_filters_target_list = [targer_filter.combine] + [f.combine for f in akt_filters]
full_filter_target = And(*full_filters_target_list)
target_df = self._get_pl_train_df(
full_filters=full_filter_target,
max_eval_fit=train_settings.max_eval_fit,
)
full_filters_non_target_list = [Not(targer_filter.combine)] + [f.combine for f in akt_filters]
full_filters_non_target = And(*full_filters_non_target_list)
non_target_df = self._get_pl_train_df(
full_filters=full_filters_non_target,
max_eval_fit=train_settings.max_eval_fit,
)
return TrainDataframes(
target_df=target_df,
non_target_df=non_target_df,
frac_eval_cat=train_settings.frac_eval_cat,
min_size=train_settings.min_eval_fit,
neutral_faktor=train_settings.neutral_faktor,
)
Each SQL query uses ORDER BY RANDOM() LIMIT max_eval_fit to get a random subset:
# src/pilz/service/darkwing.py:92-104
def _get_pl_train_df(self, full_filters, max_eval_fit):
df = self.get_cached_train_df()
sql_str = SympyToSqlHelper.to_sql_where(full_filters)
sql = f"""
SELECT {", ".join(self.dc.feature_names_sql_save)}
FROM df
WHERE {sql_str}
ORDER BY RANDOM()
LIMIT {max_eval_fit};
"""
return duckdb.sql(sql).pl()
Split into Count and Group Sets¶
Each side is further split into two parts controlled by frac_eval_cat:
- Count set (
frac_eval_cat): Used for bin evaluation (value_counts) - Group set (1 -
frac_eval_cat): Used for building correlation tables with weights
# src/pilz/model/dataframes.py:365-406
class TrainDataframes:
def __init__(self, target_df, non_target_df, frac_eval_cat, min_size, neutral_faktor):
self.target_df_size = target_df.height
self.non_target_df_size = non_target_df.height
self.n_count_target, n_group_target = self._calc_split(
target_df.height, frac_eval_cat
)
self.target_df_count = target_df.head(self.n_count_target)
target_df_group = target_df.tail(-self.n_count_target)
self.n_count_non_target, n_group_non_target = self._calc_split(
non_target_df.height, frac_eval_cat
)
self.non_target_df_count = non_target_df.head(self.n_count_non_target)
non_target_df_group = non_target_df.tail(-self.n_count_non_target)
flowchart LR
subgraph Original
O1[100K rows, Target: 10K, Non-target: 90K]
end
subgraph Downsampled
D1[10K rows, Target: 5K, Non-target: 5K]
end
O1 -->|"Balance target"| D1
O1 -->|"Balance non-target"| D1
style D1 fill:#ffff99
Configuration¶
The downsampling behavior is configured through two settings:
# src/pilz/model/settings.py:22-38
max_eval_fit: int = Field(
description="Maximum rows used for training at each node",
default=1000,
)
frac_eval_cat: float = Field(
description="Fraction of data used for bin evaluation; "
"the rest is used for grouping with weights",
default=0.5,
)
max_eval_fit: Limits how many rows are sampled per node. Lower values = faster training but less precision.frac_eval_cat: How much of the sampled data goes to count-based evaluation vs weight-based grouping.
Summary¶
| Concept | Description |
|---|---|
| Independent sampling | Target and non-target queried separately with LIMIT |
| Random ordering | ORDER BY RANDOM() ensures unbiased samples |
| Two-part split | Count set for bin eval, group set for correlation tables |
| Per-node sampling | Each tree node gets a fresh random sample |
Next Steps¶
- Imbalanced Data — How this enables handling skewed distributions
- Three-Way Splits — What happens at each node
- Training Internals — Full algorithm