API Reference

Core Functions

pyslicekit.api.evaluate(model: Any, df: DataFrame, y_true: Any, y_pred: Any, slice_cols: List[str], metric: str = 'accuracy', min_samples: int = 30, depth: int = 2, render_visuals: bool = True, **render_kwargs: Any) List[SliceResult][source]

Evaluate a machine learning model across different slices (subgroups) of your data to discover hidden areas of poor performance.

This function is the main engine of PySliceKit. It takes your data, automatically chunks it up into subgroups based on the columns you provide, tests your model on those specific groups, and highlights the ones where your model is secretly failing.

import pyslicekit

# Find the exact segments where your model underperforms!
results = pyslicekit.evaluate(
    model=my_model,
    df=my_dataframe,
    y_true=y_actuals,
    y_pred=y_predictions,
    slice_cols=["Age", "Geography"],
    metric="accuracy",
    depth=2,
    render_visuals=True,
    top_n=15
)

Parameters:

  • model (Any) – Your trained machine learning model. It just needs a standard .predict() method. We never train your model, we only test it!

  • df (pd.DataFrame) – Your feature dataset. This is the data that contains the columns you want to slice (like Age, Income, City, etc).

  • y_true (array-like) – The actual, correct answers (the ground truth).

  • y_pred (array-like) – The answers your model predicted.

  • slice_cols (List[str]) – A list of column names from your df that you want to investigate. E.g., [“Age”, “Geography”].

  • metric (str, optional) – The mathematical way you want to measure success. Examples: “accuracy”, “f1”, “mae”, “rmse”. Default is "accuracy".

  • min_samples (int, optional) – The minimum number of data points needed in a group for us to trust the math. If a group has fewer people than this, we still show it but flag it with a low-sample warning. Default is 30.

  • depth (int, optional) – How deep should we combine columns? 1 means we check Age, then we check Geography. 2 means we cross them and check “Age AND Geography” together. Default is 2.

  • render_visuals (bool, optional) – Do you want us to automatically draw the beautiful Heatmap and Bar charts for you? Default is True.

  • **render_kwargs (Any) – Extra commands for the chart drawing. For example: top_n=15 to only show the top 15 worst segments in the bar chart (Default `top_n` is 15), or figsize_heatmap=(12, 6) to change the size of the heatmap figure.

Returns:

A list of result objects, one for each segment tested, sorted so the absolute worst performing segments are exactly at the top!

Return type:

List[SliceResult]

Data Types

class pyslicekit.types.SliceResult(slice_def: ~typing.List[~typing.Tuple[str, ~typing.Any]], n: int, metric_name: str, metric_value: float, overall_metric: float, gap: float, is_significant: bool = False, low_n: bool = False, p_value: float | None = None, test_used: str | None = None, extra: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

Holds the evaluation result for a single data segment.

A segment is defined by one or more (column, value) pairs. For example: [(“gender”, “female”), (“region”, “north”)]

slice_def

The column-value pairs that define this segment. Single-column slice: [(“gender”, “female”)] Two-column slice: [(“gender”, “female”), (“age_bin”, “Q1”)]

Type:

list of (column, value) tuples

n

Number of rows in this segment.

Type:

int

metric_name

The metric computed (e.g. “accuracy”, “mae”).

Type:

str

metric_value

The metric value for this segment.

Type:

float

overall_metric

The metric value across the full test set (baseline).

Type:

float

gap

metric_value - overall_metric. Sign interpretation depends on MetricDirection:

  • HIGHER_IS_BETTER → negative gap = segment underperforms

  • LOWER_IS_BETTER → positive gap = segment underperforms

Type:

float

is_significant

True if the gap is statistically significant (p < 0.05). Set to False when n < 30 (test unreliable at small n).

Type:

bool

low_n

True when n < min_samples. Result is included but flagged. Renderer displays a warning overlay on these cells.

Type:

bool

p_value

The p-value from the significance test. None when the test could not be run (e.g. n=0, all same label).

Type:

float or None

test_used

Name of the statistical test applied: “proportion_z”, “fisher_exact”, “bootstrap_ci”, or None.

Type:

str or None

extra

Reserved for future use (confidence intervals, etc.).

Type:

dict

property abs_gap: float

Absolute gap — used for sort ordering.

property direction: MetricDirection

Looks up the metric direction from the registry.

extra: Dict[str, Any]
gap: float
is_significant: bool = False
property is_underperforming: bool

True when the segment genuinely performs worse than baseline, taking metric direction into account.

property label: str

Human-readable segment label, e.g. ‘gender=female & age_bin=Q1’. Used by the renderer for axis labels and CSV column headers.

low_n: bool = False
metric_name: str
metric_value: float
n: int
overall_metric: float
p_value: float | None = None
slice_def: List[Tuple[str, Any]]
test_used: str | None = None

Exporters

pyslicekit.exporter.to_csv(results: List[SliceResult], filepath: str) None[source]

Export your entire slice evaluation into a clean, easy-to-read CSV file.

import pyslicekit
from pyslicekit.exporter import to_csv, to_json

# Save your findings to show your manager or colleagues
to_csv(results, "audit_results.csv")

Parameters:

  • results (List[SliceResult]) – The exact list of results that the evaluate() function gave you.

  • filepath (str) – Where do you want to save the file? (e.g. “my_results.csv”)

pyslicekit.exporter.to_json(results: List[SliceResult], filepath: str) None[source]

Export your slice evaluation into a structured JSON file.

This is perfect if you want to take the results and feed them into a web dashboard or another automated system.

import pyslicekit
from pyslicekit.exporter import to_csv, to_json

# Save as JSON for your web app
to_json(results, "audit_results.json")

Parameters:

  • results (List[SliceResult]) – The exact list of results that the evaluate() function gave you.

  • filepath (str) – Where do you want to save the file? (e.g. “my_results.json”)

Exceptions

exception pyslicekit.exceptions.PySliceKitError[source]

Base class for all pyslicekit errors.

Catch this to handle any library error generically:

try:
    results = pyslicekit.evaluate(...)
except PySliceKitError as e:
    print(f"pyslicekit failed: {e}")
exception pyslicekit.exceptions.PySliceKitValidationError[source]

Raised when the inputs to evaluate() fail validation.

Common causes: - y_true and y_pred have different lengths - slice_cols contains column names not present in df - metric name is not in SUPPORTED_METRICS - model has no predict() method - df is empty

The error message always names the specific problem.

What triggers this (Example):

# ❌ WRONG: Passing a metric that doesn't exist
pyslicekit.evaluate(..., metric="made_up_metric")
# Raises: PySliceKitValidationError("Metric 'made_up_metric' is not supported.")

# ❌ WRONG: y_true and y_pred lengths don't match
pyslicekit.evaluate(..., y_true=[1, 0, 1], y_pred=[1, 0])
# Raises: PySliceKitValidationError("Length mismatch: y_true has 3, y_pred has 2")
exception pyslicekit.exceptions.PySliceKitNoSegmentsError[source]

Raised when slicing produces zero usable segments.

This happens when every candidate segment has n < min_samples and there is nothing left to evaluate.

Includes a suggestion to lower min_samples or change slice_cols.

What triggers this (Example):

# ❌ WRONG: Setting min_samples too high for a small dataset
# If your df only has 100 rows, and you ask for min_samples=200,
# all segments will be dropped!
pyslicekit.evaluate(..., df=small_df, min_samples=200)
# Raises: PySliceKitNoSegmentsError("All candidate segments were dropped...")