API Reference¶
Core Functions¶
- pyslicekit.api.evaluate(model: Any, df: DataFrame, y_true: Any, y_pred: Any, slice_cols: List[str], metric: str = 'accuracy', min_samples: int = 30, depth: int = 2, render_visuals: bool = True, **render_kwargs: Any) List[SliceResult][source]¶
Evaluate a machine learning model across different slices (subgroups) of your data to discover hidden areas of poor performance.
This function is the main engine of PySliceKit. It takes your data, automatically chunks it up into subgroups based on the columns you provide, tests your model on those specific groups, and highlights the ones where your model is secretly failing.
import pyslicekit # Find the exact segments where your model underperforms! results = pyslicekit.evaluate( model=my_model, df=my_dataframe, y_true=y_actuals, y_pred=y_predictions, slice_cols=["Age", "Geography"], metric="accuracy", depth=2, render_visuals=True, top_n=15 )
Parameters:
model(Any) – Your trained machine learning model. It just needs a standard .predict() method. We never train your model, we only test it!df(pd.DataFrame) – Your feature dataset. This is the data that contains the columns you want to slice (like Age, Income, City, etc).y_true(array-like) – The actual, correct answers (the ground truth).y_pred(array-like) – The answers your model predicted.slice_cols(List[str]) – A list of column names from your df that you want to investigate. E.g., [“Age”, “Geography”].metric(str, optional) – The mathematical way you want to measure success. Examples: “accuracy”, “f1”, “mae”, “rmse”.Default is "accuracy".min_samples(int, optional) – The minimum number of data points needed in a group for us to trust the math. If a group has fewer people than this, we still show it but flag it with a low-sample warning.Default is 30.depth(int, optional) – How deep should we combine columns? 1 means we check Age, then we check Geography. 2 means we cross them and check “Age AND Geography” together.Default is 2.render_visuals(bool, optional) – Do you want us to automatically draw the beautiful Heatmap and Bar charts for you?Default is True.**render_kwargs(Any) – Extra commands for the chart drawing. For example: top_n=15 to only show the top 15 worst segments in the bar chart (Default `top_n` is 15), or figsize_heatmap=(12, 6) to change the size of the heatmap figure.
- Returns:
A list of result objects, one for each segment tested, sorted so the absolute worst performing segments are exactly at the top!
- Return type:
List[SliceResult]
Data Types¶
- class pyslicekit.types.SliceResult(slice_def: ~typing.List[~typing.Tuple[str, ~typing.Any]], n: int, metric_name: str, metric_value: float, overall_metric: float, gap: float, is_significant: bool = False, low_n: bool = False, p_value: float | None = None, test_used: str | None = None, extra: ~typing.Dict[str, ~typing.Any] = <factory>)[source]¶
Holds the evaluation result for a single data segment.
A segment is defined by one or more (column, value) pairs. For example: [(“gender”, “female”), (“region”, “north”)]
- slice_def¶
The column-value pairs that define this segment. Single-column slice: [(“gender”, “female”)] Two-column slice: [(“gender”, “female”), (“age_bin”, “Q1”)]
- Type:
list of (column, value) tuples
- n¶
Number of rows in this segment.
- Type:
int
- metric_name¶
The metric computed (e.g. “accuracy”, “mae”).
- Type:
str
- metric_value¶
The metric value for this segment.
- Type:
float
- overall_metric¶
The metric value across the full test set (baseline).
- Type:
float
- gap¶
metric_value - overall_metric. Sign interpretation depends on MetricDirection:
HIGHER_IS_BETTER → negative gap = segment underperforms
LOWER_IS_BETTER → positive gap = segment underperforms
- Type:
float
- is_significant¶
True if the gap is statistically significant (p < 0.05). Set to False when n < 30 (test unreliable at small n).
- Type:
bool
- low_n¶
True when n < min_samples. Result is included but flagged. Renderer displays a warning overlay on these cells.
- Type:
bool
- p_value¶
The p-value from the significance test. None when the test could not be run (e.g. n=0, all same label).
- Type:
float or None
- test_used¶
Name of the statistical test applied: “proportion_z”, “fisher_exact”, “bootstrap_ci”, or None.
- Type:
str or None
- extra¶
Reserved for future use (confidence intervals, etc.).
- Type:
dict
- property abs_gap: float¶
Absolute gap — used for sort ordering.
- property direction: MetricDirection¶
Looks up the metric direction from the registry.
- extra: Dict[str, Any]¶
- gap: float¶
- is_significant: bool = False¶
- property is_underperforming: bool¶
True when the segment genuinely performs worse than baseline, taking metric direction into account.
- property label: str¶
Human-readable segment label, e.g. ‘gender=female & age_bin=Q1’. Used by the renderer for axis labels and CSV column headers.
- low_n: bool = False¶
- metric_name: str¶
- metric_value: float¶
- n: int¶
- overall_metric: float¶
- p_value: float | None = None¶
- slice_def: List[Tuple[str, Any]]¶
- test_used: str | None = None¶
Exporters¶
- pyslicekit.exporter.to_csv(results: List[SliceResult], filepath: str) None[source]¶
Export your entire slice evaluation into a clean, easy-to-read CSV file.
import pyslicekit from pyslicekit.exporter import to_csv, to_json # Save your findings to show your manager or colleagues to_csv(results, "audit_results.csv")
Parameters:
results(List[SliceResult]) – The exact list of results that the evaluate() function gave you.filepath(str) – Where do you want to save the file? (e.g. “my_results.csv”)
- pyslicekit.exporter.to_json(results: List[SliceResult], filepath: str) None[source]¶
Export your slice evaluation into a structured JSON file.
This is perfect if you want to take the results and feed them into a web dashboard or another automated system.
import pyslicekit from pyslicekit.exporter import to_csv, to_json # Save as JSON for your web app to_json(results, "audit_results.json")
Parameters:
results(List[SliceResult]) – The exact list of results that the evaluate() function gave you.filepath(str) – Where do you want to save the file? (e.g. “my_results.json”)
Exceptions¶
- exception pyslicekit.exceptions.PySliceKitError[source]¶
Base class for all pyslicekit errors.
Catch this to handle any library error generically:
try: results = pyslicekit.evaluate(...) except PySliceKitError as e: print(f"pyslicekit failed: {e}")
- exception pyslicekit.exceptions.PySliceKitValidationError[source]¶
Raised when the inputs to evaluate() fail validation.
Common causes: - y_true and y_pred have different lengths - slice_cols contains column names not present in df - metric name is not in SUPPORTED_METRICS - model has no predict() method - df is empty
The error message always names the specific problem.
What triggers this (Example):
# ❌ WRONG: Passing a metric that doesn't exist pyslicekit.evaluate(..., metric="made_up_metric") # Raises: PySliceKitValidationError("Metric 'made_up_metric' is not supported.") # ❌ WRONG: y_true and y_pred lengths don't match pyslicekit.evaluate(..., y_true=[1, 0, 1], y_pred=[1, 0]) # Raises: PySliceKitValidationError("Length mismatch: y_true has 3, y_pred has 2")
- exception pyslicekit.exceptions.PySliceKitNoSegmentsError[source]¶
Raised when slicing produces zero usable segments.
This happens when every candidate segment has n < min_samples and there is nothing left to evaluate.
Includes a suggestion to lower min_samples or change slice_cols.
What triggers this (Example):
# ❌ WRONG: Setting min_samples too high for a small dataset # If your df only has 100 rows, and you ask for min_samples=200, # all segments will be dropped! pyslicekit.evaluate(..., df=small_df, min_samples=200) # Raises: PySliceKitNoSegmentsError("All candidate segments were dropped...")