User Guide
==========

This guide teaches you everything you need to use PySliceKit confidently —
from the core idea, through two complete walkthroughs, to the statistical
machinery running under the hood. Read it once and you will never misread
a result again.

----

The Problem with Global Metrics
--------------------------------

Imagine you deploy a pricing model for real estate, and it boasts an overall
Mean Absolute Error (MAE) of $40,000. On paper, it sounds robust. But when
you dig into the data, you discover that for houses older than 50 years near
the ocean, the MAE balloons to $120,000.

Relying purely on "global" metrics masks critical algorithmic bias, data drift,
and localized underfitting. Finding these edge cases manually by writing endless
Pandas ``groupby()`` statements is tedious, non-scalable, and statistically
dangerous — you might mistake random noise in a small sample for a real problem,
or miss a real problem because you never thought to look there.

The PySliceKit Solution
-----------------------

**PySliceKit** acts as an automated detective for your models. Instead of you
manually guessing where your model might fail, PySliceKit does five things
automatically:

1. **Bins numeric columns:** Converts continuous columns like Age or Income
   into human-readable quartile labels (``Q1(18–34)``, ``Q2(34–52)``, …)
   so you never have to write ``pd.cut()`` yourself.

2. **Cross-products features:** Combines columns together up to a configurable
   depth — so ``Age`` and ``Geography`` become ``Age=Q1 & Geography=North``,
   ``Age=Q1 & Geography=South``, and so on.

3. **Applies statistical rigor:** Runs the right hypothesis test automatically
   — Z-Test, Fisher's Exact, or Bootstrap CI — to ensure a drop in performance
   is a mathematically real failure, not just noise from a small sample.

4. **Flags low-sample segments:** Any segment below ``min_samples`` is still
   shown, but is visually hatched so you know to treat it with caution.

5. **Enforces a visual contract:** In every chart PySliceKit produces,
   **Red always means bad** — regardless of whether your metric is Accuracy
   (higher is better) or MAE (lower is better). You never have to remember
   which direction the metric goes.

----

How PySliceKit Processes Your Data
------------------------------------

Understanding what happens inside ``pyslicekit.evaluate()`` makes results much easier
to interpret. Here is the exact sequence of steps:

**Step 1 — Validation.**
PySliceKit checks every input before doing any work. If ``y_true`` and
``y_pred`` have different lengths, if a ``slice_col`` does not exist in your
DataFrame, or if the metric string is not supported, you get a specific
``PySliceKitValidationError`` that names the exact problem.

**Step 2 — Column pre-processing.**
Each column in ``slice_cols`` is inspected. Numeric columns (integer or float
dtype) are automatically binned into quartiles using ``pd.qcut``. Categorical
or string columns are used as-is. Columns with more than 20 unique values
trigger a ``UserWarning`` — they will still be processed, but they will produce
many segments. You may want to group them first.

**Step 3 — Segment construction.**
PySliceKit generates every combination of column values up to ``depth``
levels deep. At ``depth=1``, each unique value in each column becomes a
segment. At ``depth=2``, every pair of values across columns is also a
segment. The total number of segments grows quickly, so ``depth`` is capped
at 2 in the current version.

**Step 4 — Metric and gap computation.**
For each segment, PySliceKit computes your chosen metric (e.g. MAE, accuracy,
F1) on only the rows in that segment. It then subtracts the overall dataset
metric to produce a signed **gap**:

.. code-block:: text

    gap = segment_metric − overall_metric

A gap of ``+0.149`` on MAE means this segment's error is 0.149 units *higher*
than the baseline — which is bad, because lower MAE is better. A gap of
``-0.092`` on F1 means this segment's F1 is 0.092 points *lower* than the
baseline — which is also bad, because higher F1 is better. PySliceKit
understands this distinction automatically.

**Step 5 — Statistical significance testing.**
Each gap is tested to determine whether it is a genuine structural failure or
just random noise. The test chosen depends on the task type and sample size.
This is explained in full in the section below.

**Step 6 — Sorting and rendering.**
Results are sorted by absolute gap, worst first. The renderer then produces
two figures: a heatmap (single-column slices only) and a ranked bar chart
(all slices). Both figures are returned, and optionally saved to disk.

----

Understanding the Gap Sign
---------------------------

This is the single most common source of confusion. The gap is always
``segment_metric − overall_metric``, but what "bad" means depends on the
metric direction:

.. list-table::
   :header-rows: 1
   :widths: 20 20 30 30

   * - Metric
     - Direction
     - Positive gap means…
     - Negative gap means…
   * - ``accuracy``
     - Higher is better
     - Segment **outperforms** (green)
     - Segment **underperforms** (red)
   * - ``f1``, ``f1_macro``, ``f1_weighted``
     - Higher is better
     - Segment **outperforms** (green)
     - Segment **underperforms** (red)
   * - ``precision``, ``recall``
     - Higher is better
     - Segment **outperforms** (green)
     - Segment **underperforms** (red)
   * - ``r2``
     - Higher is better
     - Segment **outperforms** (green)
     - Segment **underperforms** (red)
   * - ``mae``, ``rmse``, ``mse``
     - **Lower is better**
     - Segment **underperforms** (red)
     - Segment **outperforms** (green)

PySliceKit stores the direction for every metric in an internal registry
(``METRIC_REGISTRY`` in ``types.py``). The renderer reads this registry
so the colour scale is always correct — you never need to configure it.

----

How PySliceKit Decides if a Gap is Real
-----------------------------------------

A gap is just a number. Before you act on it, you need to know whether it
reflects a genuine structural weakness in your model, or whether it could
have appeared by chance because the segment is small.

PySliceKit runs a hypothesis test on every segment automatically. The test
chosen depends on two factors: the task type (classification vs regression)
and the segment size. Here is the complete decision tree:

.. code-block:: text

    Is the metric value NaN?
    └── Yes → Cannot test. No marker shown.
    Is n < min_samples?
    └── Yes → Marked ⚠ (low-n). Test is skipped as unreliable.
    Is the task regression (mae, rmse, mse, r2)?
    └── Yes → Bootstrap Confidence Interval (1,000 resamples)
    Is n >= 30?
    └── Yes → Two-Proportion Z-Test
    Is n < 30?
    └── Yes → Fisher's Exact Test

A segment marked with ``*`` passed its test at p < 0.05. A segment with no
``*`` either did not pass, or the test was skipped.


Test 1 — Two-Proportion Z-Test (classification, n ≥ 30)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When you have 30 or more samples in a segment and you are running a
classification task, PySliceKit runs a **two-proportion z-test**.

The intuition: it treats every row as a binary outcome — "did the model get
this right?" It then asks: *"Is the proportion of correct predictions in
this segment statistically different from the overall proportion?"*

The formula for the z-statistic is:

.. code-block:: text

    z = (p_segment − p_overall) / sqrt(p_overall × (1 − p_overall) / n)

where ``p_segment`` is the fraction of correct predictions in the segment,
``p_overall`` is the fraction correct on the full test set, and ``n`` is the
segment size. A two-tailed p-value is computed from the standard normal
distribution. If p < 0.05, the segment is marked ``*``.

The z-test is fast, analytically exact for large n, and the right default
for any segment with 30 or more samples. Below 30, its normal approximation
starts to break down — which is why PySliceKit switches to Fisher's Exact
for small segments.


Test 2 — Fisher's Exact Test (classification, n < 30)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When a classification segment has fewer than 30 samples, PySliceKit
automatically switches to **Fisher's Exact Test**.

Fisher's Exact makes no distributional assumptions. It works directly with
counts in a 2×2 contingency table and is valid even for very small samples:

.. code-block:: text

    ┌──────────────────┬─────────┬───────────┐
    │                  │ Correct │ Incorrect │
    ├──────────────────┼─────────┼───────────┤
    │ Segment (actual) │    a    │     b     │
    │ Expected at p₀   │    c    │     d     │
    └──────────────────┴─────────┴───────────┘

where ``c`` and ``d`` are derived from the overall accuracy × n. The exact
probability of this table (or a more extreme one) is computed directly.

The trade-off: Fisher's is more reliable than the z-test at small n, but even
Fisher's has limited power when n is below about 10. This is why segments
below ``min_samples`` are flagged ⚠ regardless of which test is used — the
result is included so you can see it, but you should collect more data before
acting on it.


Test 3 — Bootstrap Confidence Interval (regression)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For regression metrics (MAE, RMSE, MSE, R²), there is no clean "proportion
correct" framing, so z-tests and Fisher's Exact do not apply. PySliceKit
instead uses a **bootstrap confidence interval**.

The procedure:

1. Resample the segment's rows 1,000 times with replacement.
2. Compute the chosen metric on each resample.
3. Build a 95% confidence interval from the 2.5th and 97.5th percentiles
   of those 1,000 values.
4. If the overall dataset metric falls *outside* that interval, the gap is
   statistically significant (marked ``*``).

A pseudo p-value is also computed: the fraction of bootstrap samples where
the metric was at least as extreme as the overall metric. This is stored in
``SliceResult.p_value`` and is exported with ``pyslicekit.to_csv()`` and ``pyslicekit.to_json()``.

The bootstrap approach is distribution-free and works for any regression
metric. Its main cost is computational — 1,000 resamples per segment — but
this is acceptable for typical audit dataset sizes.

----

Reading the Charts
-------------------

PySliceKit always produces exactly two figures. Here is how to read each one.


The Heatmap
~~~~~~~~~~~

The heatmap shows **single-column slices only** (``depth=1`` results). Each
row in the heatmap corresponds to one of your ``slice_cols``. Each cell
within a row corresponds to one unique value (or quartile bin) of that column.

Looking at the California Housing heatmap above:

- **Row label** (left side): the column name — ``AveRooms``, ``HouseAge``.
- **Cell label** (top line in cell): the value or bin — ``Q4(6.01–133)``.
- **Metric value** (bottom line in cell): the MAE for that segment — ``0.400``.
- **n=**: the number of rows in that segment.
- **Cell colour**: red means the segment underperforms the baseline; green
  means it outperforms; grey means it is near the baseline (gap < 2%).
- **Hatching** (diagonal lines): the segment has fewer rows than
  ``min_samples`` — treat with caution.
- **Asterisk \*** after the value: the gap is statistically significant
  at p < 0.05.

Two-column cross-product segments (``depth=2``) do **not** appear in the
heatmap, because a pair of columns cannot be laid out cleanly on a 2D grid.
They appear in the bar chart instead.


The Bar Chart
~~~~~~~~~~~~~

The bar chart shows the **top N segments ranked by absolute gap**, across
all depths. It is sorted worst-first, so the most urgent problems are always
at the top.

Looking at the California Housing bar chart above:

- **Y-axis labels**: the full segment definition — e.g.
  ``HouseAge=Q4(37–52) & AveRooms=Q4(6.01–133)``. Two-column segments
  appear here even though they are absent from the heatmap.
- **Bar length**: the magnitude of the gap. Bars extending to the right are
  positive gaps; bars extending to the left are negative gaps.
- **Bar colour**: same rule as the heatmap — red is always bad, green is
  always good, regardless of metric direction.
- **Text inside the bar**: the gap value (e.g. ``+0.149``) and the sample
  size (e.g. ``n=129``).
- **⚠ after n=**: this segment is below ``min_samples``.
- **\* after the segment label**: the gap is statistically significant.
- **Dashed vertical line at 0**: the baseline. Everything to the right of
  this line has a positive gap; everything to the left has a negative gap.
  Whether positive or negative is "bad" depends on the metric — but the
  colour tells you immediately.

----

Walkthrough 1: Regression (California Housing)
-----------------------------------------------

Let's see how this works on a real dataset. We train a Random Forest on the
California Housing dataset and evaluate it with MAE.

.. code-block:: python

    import pandas as pd
    from sklearn.datasets import fetch_california_housing
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split
    import pyslicekit

    # Load data
    cali = fetch_california_housing(as_frame=True)
    df = cali.frame
    X = df.drop(columns=['MedHouseVal'])
    y = df['MedHouseVal']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train model
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Audit the model across HouseAge and AveRooms segments
    results = pyslicekit.evaluate(
        model=model,
        df=X_test,
        y_true=y_test,
        y_pred=y_pred,
        slice_cols=['HouseAge', 'AveRooms'],
        metric='mae',
        min_samples=50,
        depth=2
    )

    # Inspect the top 3 worst segments programmatically
    for r in results[:3]:
        print(r)

**Heatmap output:**

.. image:: _static/california_heatmap.png
   :alt: California Housing per-segment performance heatmap
   :align: center
   :width: 100%

**Bar chart output:**

.. image:: _static/california_bar.png
   :alt: California Housing top segments by gap
   :align: center
   :width: 100%

**What did the library just find?**

The overall MAE across the test set is **0.328**. Now look at the bar chart.

The worst segment is ``HouseAge=Q4(37–52) & AveRooms=Q4(6.01–133)`` with a
gap of ``+0.149`` and n=129. Because this is MAE (lower is better), a positive
gap means this segment's error is 0.149 units *higher* than the baseline —
so the model's actual MAE on these old, large-roomed houses is
0.328 + 0.149 = **0.477**. That is 45% worse than the overall figure.
The absence of a ``*`` here means the gap did not reach statistical
significance — possibly because n=129, while large, has high variance in
this particular segment. You would investigate further.

The fourth segment in the bar chart, ``HouseAge=Q3(29–37) & AveRooms=Q2(4.4–5.19)``,
has a gap of ``-0.094`` and n=295. Because this is MAE, a *negative* gap is
actually good — the model performs better than baseline on these houses.
The renderer colours it green automatically.

On the heatmap, the ``AveRooms=Q4(6.01–133)`` cell is red (MAE=0.400, which
is 0.072 above the 0.328 baseline), while ``AveRooms=Q2(4.4–5.19)`` is green
(MAE=0.274). This tells you that average room count is a meaningful slice
dimension — the model systematically struggles more on high-room properties.

**Exporting the results:**

.. code-block:: python

    import pyslicekit

    # Save for stakeholder review
    pyslicekit.to_csv(results, "california_housing_audit.csv")
    pyslicekit.to_json(results, "california_housing_audit.json")
    print("Exported to CSV and JSON successfully!")

----

Walkthrough 2: Classification (Breast Cancer)
----------------------------------------------

Now let's audit a Logistic Regression model on a binary classification task,
using F1 score as the metric.

.. code-block:: python

    from sklearn.datasets import load_breast_cancer
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    import pyslicekit

    cancer = load_breast_cancer(as_frame=True)
    df_c = cancer.frame
    X_c = df_c.drop(columns=['target'])
    y_c = df_c['target']

    X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
        X_c, y_c, test_size=0.25, random_state=42
    )

    clf = LogisticRegression(max_iter=5000)
    clf.fit(X_train_c, y_train_c)
    y_pred_c = clf.predict(X_test_c)

    # Audit using F1 Score
    results_clf = pyslicekit.evaluate(
        model=clf,
        df=X_test_c,
        y_true=y_test_c,
        y_pred=y_pred_c,
        slice_cols=['mean radius', 'mean texture'],
        metric='f1',
        min_samples=10,
        depth=2
    )

**Heatmap output:**

.. image:: _static/cancer_heatmap.png
   :alt: Breast Cancer per-segment performance heatmap
   :align: center
   :width: 100%

**Bar chart output:**

.. image:: _static/cancer_bar.png
   :alt: Breast Cancer top segments by gap
   :align: center
   :width: 100%

**What did the library just find?**

The overall F1 across the test set is **0.972**. Now look at the heatmap.

The ``mean radius=Q3(13.5–15.9) *`` cell is deep red with F1=0.895 and n=35.
The ``*`` tells you this gap is statistically significant — a two-proportion
z-test (n=35 ≥ 30) confirmed that the drop from 0.972 to 0.895 is not random
noise. This is a genuine blind spot in the model for mid-range tumour radii.

Notice that ``mean radius=Q4(15.9–25.2)`` shows ``n/a`` with a light grey
background. This means F1 could not be computed for that segment — likely
because the segment contained only one class in the test split, making binary
F1 undefined. PySliceKit surfaces this as ``NaN`` rather than crashing, and
the renderer fills the cell with a "no data" colour.

On the bar chart, the heavily hatched bars (diagonal lines) are segments below
``min_samples=10``. The worst of these — ``mean radius=Q3 & mean texture=Q3``
with gap ``-0.305`` and n=8 — looks alarming, but the hatching and ⚠ marker
tell you this result is based on only 8 samples. Do not act on it without
more data. The ``*`` segments without hatching (like ``mean texture=Q3(19.3–22.4) *``
with n=35) are the ones worth investigating immediately.

----

Working with SliceResult Objects
----------------------------------

``pyslicekit.evaluate()`` returns a ``List[SliceResult]``, sorted worst-first by
absolute gap. Every field on ``SliceResult`` is accessible directly:

.. code-block:: python

    results = pyslicekit.evaluate(...)

    # The worst-performing segment
    worst = results[0]

    print(worst.label)          # "mean radius=Q3(13.5–15.9)"
    print(worst.n)              # 35
    print(worst.metric_value)   # 0.8952...
    print(worst.overall_metric) # 0.972...
    print(worst.gap)            # -0.0768...
    print(worst.is_significant) # True
    print(worst.p_value)        # 0.0083...
    print(worst.test_used)      # "proportion_z"
    print(worst.low_n)          # False
    print(worst.slice_def)      # [("mean radius", "Q3(13.5–15.9)")]

    # Filter to only statistically significant underperformers
    flagged = [
        r for r in results
        if r.is_significant and r.is_underperforming
    ]

    # Filter to only segments large enough to trust
    trusted = [r for r in results if not r.low_n]

The ``is_underperforming`` property handles metric direction for you — it
returns ``True`` when the segment is genuinely worse than baseline, regardless
of whether the gap is positive or negative.

----

Choosing the Right Parameters
-------------------------------

``metric``
~~~~~~~~~~

Choose the metric that matches how you evaluate your model in production.
If you care about raw error magnitude, use ``mae`` or ``rmse``. If you care
about classification quality, use ``accuracy`` for balanced classes or ``f1``
for imbalanced ones. A full table of supported metrics is in the API reference.

Do not use ``accuracy`` for imbalanced classification problems — a model
that always predicts the majority class can score 95% accuracy while being
completely useless. Use ``f1``, ``f1_weighted``, or ``recall`` instead.

``min_samples``
~~~~~~~~~~~~~~~

``min_samples`` controls the minimum number of rows a segment must contain
to be included in results. Segments below this threshold are included but
flagged as low-n (⚠ in the bar chart, hatching in the heatmap).

- **Too high** (e.g. 200): you may drop many valid segments and see
  ``PySliceKitNoSegmentsError``.
- **Too low** (e.g. 5): you will see many hatched, statistically unreliable
  results cluttering the charts.
- **Recommended starting point**: 30 for classification (the z-test threshold),
  50 for regression (bootstrap CI needs enough variance to be meaningful).

``depth``
~~~~~~~~~

- ``depth=1``: check each column independently. Fast. Good for an initial scan.
- ``depth=2``: also check every pair of columns. Finds cross-cutting failures
  that depth=1 misses entirely — a model that works fine on Age alone and fine
  on Geography alone, but fails specifically for young people in the North.

Start with ``depth=1`` if you have many columns. Move to ``depth=2`` on the
2–3 columns that looked most interesting.

``render_visuals``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pass ``render_visuals=False`` to skip chart generation entirely (useful when
calling ``pyslicekit.evaluate()`` in automated pipelines).

.. code-block:: python

    # Headless pipeline — no charts, just data
    results=pyslicekit.evaluate(
        model=model,
        df=X_test,
        y_true=y_test,
        y_pred=y_pred,
        slice_cols=['Age', 'Geography'],
        metric='accuracy',
        render_visuals=False,
    )
    pyslicekit.to_json(results, "pipeline_output.json")

----

Common Mistakes and How to Fix Them
--------------------------------------

**"All candidate segments were dropped" (PySliceKitNoSegmentsError)**

Every segment fell below ``min_samples``. Either lower ``min_samples`` or
choose columns with higher-cardinality groups. A column with 3 unique values
and a dataset of 100 rows means each segment averages only ~33 rows — right
at the default floor.

**"I get a high-cardinality UserWarning"**

A categorical column has more than 20 unique values. The library will still
run, but you will get many segments (one per unique value), most of which will
be low-n. Consider grouping the column into coarser buckets before passing it
to ``pyslicekit.evaluate()``.

.. code-block:: python

    # Instead of raw city (500 unique values), group into regions first
    df['region'] = df['city'].map(city_to_region_dict)
    results = pyslicekit.evaluate(..., slice_cols=['region', 'age_group'], ...)

**"The heatmap is blank / shows a placeholder message"**

This appears when all your segments are from ``depth=2`` cross-products.
The heatmap only displays ``depth=1`` slices (single-column segments) because
two-column segments cannot be placed on a 2D grid cleanly. Run with
``depth=1`` first to populate the heatmap, then run with ``depth=2`` for
the bar chart's cross-product rows.

**"depth=3 raises a PySliceKitValidationError"**

Depth 3 is intentionally not supported in V1. A three-column cross-product
of columns with 4 values each produces 64 segments, most of which will be
low-n on any real dataset, and the chart becomes unreadable. Use ``depth=2``
and iterate on the columns that matter.

**"metric_value is NaN for some segments"**

This is expected for segments where the metric is structurally undefined —
for example, a segment where ``y_true`` contains only one class makes binary
F1 undefined. PySliceKit catches this gracefully and surfaces it as NaN
rather than raising. The cell appears light grey ("no data") in the heatmap.

----

Exporting Results
------------------

Both exporters write every field of ``SliceResult`` to disk.

.. code-block:: python

    import pyslicekit
    from pyslicekit.exporter import to_csv, to_json
    pyslicekit.to_csv(results, "audit.csv")
    pyslicekit.to_json(results, "audit.json")

The CSV columns are: ``segment``, ``n``, ``metric``, ``metric_value``,
``overall_metric``, ``gap``, ``is_significant``, ``low_n``, ``p_value``,
``test_used``.

The JSON includes an additional ``slice_def`` field containing the raw list
of ``[column, value]`` pairs, which is useful for programmatic downstream
processing.

----

Next Steps
-----------

- See the :doc:`api` page for the complete parameter reference for
  ``pyslicekit.evaluate()``, ``pyslicekit.to_csv()``, ``pyslicekit.to_json()``, and the ``SliceResult``
  data class.
- See the :doc:`getting_started` page for installation, dependencies, and
  the full list of supported metric strings.