Skip to content

Downsampling

At every node, Pilz independently samples target and non-target rows. This ensures training data stays manageable regardless of the original dataset size.

How It Works

Training data is read via two separate SQL queries — one for target rows, one for non-target rows — each limited to max_eval_fit rows:

# src/pilz/service/darkwing.py:65-104
def read_akt_train(self, targer_filter, train_settings, akt_filters):
    full_filters_target_list = [targer_filter.combine] + [f.combine for f in akt_filters]
    full_filter_target = And(*full_filters_target_list)

    target_df = self._get_pl_train_df(
        full_filters=full_filter_target,
        max_eval_fit=train_settings.max_eval_fit,
    )

    full_filters_non_target_list = [Not(targer_filter.combine)] + [f.combine for f in akt_filters]
    full_filters_non_target = And(*full_filters_non_target_list)

    non_target_df = self._get_pl_train_df(
        full_filters=full_filters_non_target,
        max_eval_fit=train_settings.max_eval_fit,
    )

    return TrainDataframes(
        target_df=target_df,
        non_target_df=non_target_df,
        frac_eval_cat=train_settings.frac_eval_cat,
        min_size=train_settings.min_eval_fit,
        neutral_faktor=train_settings.neutral_faktor,
    )

Each SQL query uses ORDER BY RANDOM() LIMIT max_eval_fit to get a random subset:

# src/pilz/service/darkwing.py:92-104
def _get_pl_train_df(self, full_filters, max_eval_fit):
    df = self.get_cached_train_df()
    sql_str = SympyToSqlHelper.to_sql_where(full_filters)

    sql = f"""
            SELECT {", ".join(self.dc.feature_names_sql_save)}
            FROM df
            WHERE {sql_str}
            ORDER BY RANDOM()
            LIMIT {max_eval_fit};
          """
    return duckdb.sql(sql).pl()

Split into Count and Group Sets

Each side is further split into two parts controlled by frac_eval_cat:

  • Count set (frac_eval_cat): Used for bin evaluation (value_counts)
  • Group set (1 - frac_eval_cat): Used for building correlation tables with weights
# src/pilz/model/dataframes.py:365-406
class TrainDataframes:
    def __init__(self, target_df, non_target_df, frac_eval_cat, min_size, neutral_faktor):
        self.target_df_size = target_df.height
        self.non_target_df_size = non_target_df.height

        self.n_count_target, n_group_target = self._calc_split(
            target_df.height, frac_eval_cat
        )
        self.target_df_count = target_df.head(self.n_count_target)
        target_df_group = target_df.tail(-self.n_count_target)

        self.n_count_non_target, n_group_non_target = self._calc_split(
            non_target_df.height, frac_eval_cat
        )
        self.non_target_df_count = non_target_df.head(self.n_count_non_target)
        non_target_df_group = non_target_df.tail(-self.n_count_non_target)
flowchart LR subgraph Original O1[100K rows, Target: 10K, Non-target: 90K] end subgraph Downsampled D1[10K rows, Target: 5K, Non-target: 5K] end O1 -->|"Balance target"| D1 O1 -->|"Balance non-target"| D1 style D1 fill:#ffff99

Configuration

The downsampling behavior is configured through two settings:

# src/pilz/model/settings.py:22-38
max_eval_fit: int = Field(
    description="Maximum rows used for training at each node",
    default=1000,
)
frac_eval_cat: float = Field(
    description="Fraction of data used for bin evaluation; "
    "the rest is used for grouping with weights",
    default=0.5,
)
  • max_eval_fit: Limits how many rows are sampled per node. Lower values = faster training but less precision.
  • frac_eval_cat: How much of the sampled data goes to count-based evaluation vs weight-based grouping.

Summary

Concept Description
Independent sampling Target and non-target queried separately with LIMIT
Random ordering ORDER BY RANDOM() ensures unbiased samples
Two-part split Count set for bin eval, group set for correlation tables
Per-node sampling Each tree node gets a fresh random sample

Next Steps