Generalized Additive Models¶

Penalized regression spline GAMs via P-IRLS matching R’s mgcv::gam().

Generalized Additive Models (GAM).

Provides smooth term specification and basis construction for penalized regression spline GAMs following Wood (2017).

Usage:

from pystatistics.gam import gam, s

sol = gam(
    y,
    smooths=[s('x1', k=15), s('x2', bs='tp')],
    smooth_data={'x1': x1, 'x2': x2},
)
print(sol.summary())

class pystatistics.gam.GAMParams(coefficients, fitted_values, linear_predictor, residuals, edf, total_edf, lambdas, s_scales, covariance, scale, gcv, ubre, reml_score, deviance, null_deviance, log_likelihood, aic, n_obs, family_name, link_name, dispersion_fixed, converged, outer_converged, n_iter, method, backend_name)[source]¶

Bases: object

Parameters from a fitted GAM.

Carries the full numerical output of the stable (augmented-QR) penalized iteratively re-weighted least squares fitting procedure.

Parameters:

coefficients (ndarray[tuple[Any, ...], dtype[floating[Any]]])
fitted_values (ndarray[tuple[Any, ...], dtype[floating[Any]]])
linear_predictor (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals (ndarray[tuple[Any, ...], dtype[floating[Any]]])
edf (ndarray[tuple[Any, ...], dtype[floating[Any]]])
total_edf (float)
lambdas (ndarray[tuple[Any, ...], dtype[floating[Any]]])
s_scales (ndarray[tuple[Any, ...], dtype[floating[Any]]])
covariance (ndarray[tuple[Any, ...], dtype[floating[Any]]])
scale (float)
gcv (float)
ubre (float)
reml_score (float | None)
deviance (float)
null_deviance (float)
log_likelihood (float)
aic (float)
n_obs (int)
family_name (str)
link_name (str)
dispersion_fixed (bool)
converged (bool)
outer_converged (bool)
n_iter (int)
method (str)
backend_name (str)

coefficients¶

Full coefficient vector (parametric + constrained basis coefficients; smooth coefficients live in the sum-to-zero constrained parameterisation, so a smooth declared with k basis functions contributes k - 1 coefficients — same as mgcv).

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

fitted_values¶

Response-scale predictions (mu_hat).

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

linear_predictor¶

Link-scale predictions (eta_hat).

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

residuals¶

Working residuals (y - mu_hat).

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

edf¶

Effective degrees of freedom per smooth term.

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

total_edf¶

Trace of the influence matrix (parametric + smooths).

Type:: float

lambdas¶

Selected (or user-fixed) smoothing parameters, one per smooth. For cr smooths these are directly comparable to mgcv’s sp (identical penalty construction and scaling); for tp smooths the fit is function-space identical to mgcv’s but the penalty parameterisation (and hence the sp units) differs by the eigenbasis convention — compare tp fits via EDF/fitted values, not raw sp.

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

s_scales¶

Per-smooth S.scale penalty normalisation factors (divide lambdas by these to get function-space units).

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

covariance¶

Bayesian posterior covariance of the coefficients (scale * (X'WX + S_lambda)^{-1}, mgcv’s Vp).

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

scale¶

Estimated or fixed dispersion parameter.

Type:: float

gcv¶

GCV score (scale-free families: the UBRE score is the criterion actually minimised; both are reported).

Type:: float

ubre¶

UBRE / scaled-AIC score.

Type:: float

reml_score¶

Laplace REML criterion at the selected lambdas — only populated when method='REML'; None otherwise.

Type:: float | None

deviance¶

Model deviance.

Type:: float

null_deviance¶

Null-model deviance (intercept only).

Type:: float

log_likelihood¶

Maximized log-likelihood.

Type:: float

aic¶

Akaike information criterion, classical GAM convention -2 loglik + 2 (total_edf [+1 if scale estimated]). NOTE: mgcv >= 1.8.x corrects the df for smoothing-parameter uncertainty (Wood, Pya & Saefken 2016, ‘edf2’), so mgcv’s AIC() is systematically slightly larger; documented, not hidden.

Type:: float

n_obs¶

Number of observations used in fitting.

Type:: int

family_name¶

Name of the exponential family (e.g. ‘gaussian’).

Type:: str

link_name¶

Name of the link function (e.g. ‘identity’).

Type:: str

dispersion_fixed¶

True when the family’s dispersion is fixed at 1 (binomial, poisson, negative.binomial) — controls z-vs-t and Chi.sq-vs-F conventions downstream.

Type:: bool

converged¶

Whether the P-IRLS algorithm converged.

Type:: bool

outer_converged¶

Whether the smoothing-parameter search converged away from its bounds (always True when sp was user-fixed).

Type:: bool

n_iter¶

Number of P-IRLS iterations in the final fit.

Type:: int

method¶

Smoothing parameter selection method (‘GCV’ or ‘REML’).

Type:: str

backend_name¶

Execution-path disclosure (always 'cpu_qr_pirls'; the GAM module is CPU-only).

Type:: str

coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

linear_predictor: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

residuals: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

edf: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

total_edf: float¶

lambdas: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

s_scales: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

covariance: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

scale: float¶

gcv: float¶

ubre: float¶

reml_score: float | None¶

deviance: float¶

null_deviance: float¶

log_likelihood: float¶

aic: float¶

n_obs: int¶

family_name: str¶

link_name: str¶

dispersion_fixed: bool¶

converged: bool¶

outer_converged: bool¶

n_iter: int¶

method: str¶

backend_name: str¶

class pystatistics.gam.SmoothInfo(term_name, var_name, basis_type, k, edf, ref_df, chi_sq, p_value, coef_indices, lambda_, s_scale)[source]¶

Bases: object

Information about a single smooth term in a fitted GAM.

One SmoothInfo is produced per s() term after fitting. It records the basis metadata and the approximate significance test for the smooth.

Parameters:

term_name (str)
var_name (str)
basis_type (str)
k (int)
edf (float)
ref_df (float)
chi_sq (float)
p_value (float)
coef_indices (tuple[int, int])
lambda_ (float)
s_scale (float)

term_name¶

Display name, e.g. 's(x1)'.

Type:: str

var_name¶

Bare predictor name, e.g. 'x1'.

Type:: str

basis_type¶

Basis identifier: 'cr' (cubic regression spline) or 'tp' (thin plate regression spline).

Type:: str

k¶

The basis dimension as declared in s(x, k=...) (the smooth contributes k - 1 coefficients after the sum-to-zero identifiability constraint, same as mgcv).

Type:: int

edf¶

Effective degrees of freedom for this term.

Type:: float

ref_df¶

Reference degrees of freedom, tr(2H - HH) over the term’s block (mgcv’s Ref.df convention).

Type:: float

chi_sq¶

Approximate significance statistic. For fixed-dispersion families this is the chi-squared-referenced quadratic form beta' pinv(V_beta) beta; for estimated-dispersion families it is that form divided by Ref.df, i.e. an F STATISTIC (the summary labels the column accordingly, matching mgcv). A simplified form of mgcv’s Wood (2013) test — agrees with mgcv’s reported statistic to ~0.2% on validated cases, not bit-identical.

Type:: float

p_value¶

Approximate p-value for the term.

Type:: float

coef_indices¶

(start, end) slice into the full coefficient vector identifying this term’s (constrained) coefficients.

Type:: tuple[int, int]

lambda_¶

Selected smoothing parameter for this term.

Type:: float

s_scale¶

Penalty normalisation factor for this term (lambda_ / s_scale is in function-space units).

Type:: float

term_name: str¶

var_name: str¶

basis_type: str¶

k: int¶

edf: float¶

ref_df: float¶

chi_sq: float¶

p_value: float¶

coef_indices: tuple[int, int]¶

lambda_: float¶

s_scale: float¶

pystatistics.gam.s(var_name, k=10, bs='cr')[source]¶

Convenience constructor for smooth terms, matching mgcv::s().

Usage:

from pystatistics.gam import s, gam
result = gam(y, X, smooths=[s('x1', k=15), s('x2')])

Parameters:

var_name (str) – Name of the predictor variable.
k (int) – Basis dimension (default 10).
bs (str) – Basis type: 'cr' or 'tp' (default 'cr').

Returns:

A SmoothTerm specification object.

Return type:

SmoothTerm

class pystatistics.gam.SmoothTerm(var_name, k=10, bs='cr')[source]¶

Bases: object

Specification for a smooth term s(x) in a GAM formula.

A pure, immutable-by-convention specification: variable name, basis type and basis dimension. All fit-derived quantities (basis matrices, penalties, constraint reparameterisations) live inside the fit — a SmoothTerm can safely be reused across multiple gam() calls and datasets without stale-state hazards.

Parameters:

var_name (str)
k (int)
bs (str)

var_name¶: Name of the predictor variable.

k¶: Basis dimension (>= 3), exactly as mgcv’s s(x, k=...). The fitted smooth contributes k - 1 coefficients after its sum-to-zero identifiability constraint (same as mgcv).

bs¶: Basis type: 'cr' (cubic regression spline, default) or 'tp' (thin plate regression spline).

__init__(var_name, k=10, bs='cr')[source]¶

Create a smooth term specification.

Parameters:

var_name (str) – Name of the predictor variable.
k (int) – Basis dimension (default 10, matching mgcv). Must be >= 3.
bs (str) – Basis type – 'cr' for cubic regression spline (default) or 'tp' for thin plate regression spline.

Raises:

ValidationError – If var_name is empty, k is out of range, or bs is not a recognised basis type.

Return type:

None

var_name¶

k¶

bs¶

pystatistics.gam.gam(y, X=None, *, smooths=None, smooth_data=None, family='gaussian', method='GCV', sp=None, tol=1e-08, max_iter=200, names=None)[source]¶

Fit a Generalized Additive Model.

Matches R’s mgcv::gam() for common cases. A GAM models the response as:

g(E[y]) = beta_0 + f_1(x_1) + f_2(x_2) + ... + X @ beta

where each f_j is a smooth function estimated from the data using penalized regression splines, with per-smooth sum-to-zero identifiability constraints (a smooth declared with k basis functions contributes k - 1 coefficients, exactly as mgcv).

Parameters:

y (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) – Response variable, 1-D array of n observations.
X (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None) – Parametric design matrix (n, p). May include an intercept column. If None, an intercept-only parametric part is used.
smooths (list[SmoothTerm] | None) – Smooth term specifications from s().
smooth_data (dict[str, _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]] | None) – Mapping from smooth variable names to (n,) arrays.
family (str | Family) – Exponential family, e.g. 'gaussian', 'binomial', 'poisson', 'Gamma'. method='REML' supports the fixed-dispersion families and the Gaussian-identity model.
method (str) – Smoothing-parameter selection criterion: 'GCV' (mgcv GCV.Cp semantics — UBRE is used automatically for known-scale families, as mgcv does) or 'REML' (Laplace, Wood 2011).
sp (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None) – Optional fixed smoothing parameters, one per smooth (mgcv’s sp=). Skips selection; use for reproducing a specific fit.
tol (float) – P-IRLS convergence tolerance (relative penalized deviance).
max_iter (int) – Maximum P-IRLS iterations.
names (list[str] | None) – Optional display names for the parametric coefficients.

Returns:

A GAMSolution.

Raises:

ValidationError – invalid inputs, unknown family/method/basis, k exceeding the unique values of a smooth variable, or a sp vector of the wrong length/sign.
ConvergenceError – a genuinely divergent P-IRLS (loud, never a silent answer).

Return type:

GAMSolution

class pystatistics.gam.GAMSolution(_result, _smooth_infos, _names=None)[source]¶

Bases: SolutionReprMixin

User-facing GAM result with convenient properties and R-style output.

Wraps a Result[GAMParams] and provides:

Coefficient access.
Smooth-term information (EDF, approximate significance tests).
R-style summary() matching mgcv::summary.gam().

Parameters:

_result (Result[GAMParams])
_smooth_infos (list[SmoothInfo])
_names (list[str] | None)

__init__(_result, _smooth_infos, _names=None)[source]¶

Initialise the solution wrapper.

Parameters:

_result (Result[GAMParams]) – The generic Result envelope carrying GAMParams.
_smooth_infos (list[SmoothInfo]) – One SmoothInfo per smooth term.
_names (list[str] | None) – Optional names for parametric coefficients.

Return type:

None

property params: GAMParams¶: The underlying frozen parameter payload.

property coefficients: ndarray[tuple[Any, ...], dtype[_ScalarT]]¶: Full coefficient vector (parametric + basis).

property fitted_values: ndarray[tuple[Any, ...], dtype[_ScalarT]]¶: Response-scale predictions.

property residuals: ndarray[tuple[Any, ...], dtype[_ScalarT]]¶: Working residuals (y - mu_hat).

property edf: ndarray[tuple[Any, ...], dtype[_ScalarT]]¶: Effective degrees of freedom per smooth term.

property total_edf: float¶: Total effective degrees of freedom.

property smooth_terms: list[SmoothInfo]¶: Information about each smooth term.

property deviance: float¶: Model deviance.

property null_deviance: float¶: Null-model deviance.

property deviance_explained: float¶

1 - dev / null_dev.

Type:: Proportion of null deviance explained

property aic: float¶: Akaike information criterion.

property gcv: float¶: GCV score.

property scale: float¶: Estimated dispersion parameter.

property ubre: float¶: UBRE / scaled-AIC score (the criterion mgcv’s GCV.Cp minimises for known-scale families).

property lambdas: ndarray[tuple[Any, ...], dtype[_ScalarT]]¶

Selected (or user-fixed) smoothing parameters, one per smooth.

Reported in the same penalty scaling mgcv reports sp in; divide by params.s_scales for function-space units.

property covariance: ndarray[tuple[Any, ...], dtype[_ScalarT]]¶: Bayesian posterior covariance of the coefficients (mgcv Vp).

property se: ndarray[tuple[Any, ...], dtype[_ScalarT]]¶: Standard errors for ALL coefficients (posterior, mgcv-style).

property reml_score: float | None¶: Laplace REML criterion (None unless method='REML').

property converged: bool¶: Whether the P-IRLS algorithm converged.

property outer_converged: bool¶: Whether the smoothing-parameter search converged off-boundary.

property n_iter: int¶: Number of P-IRLS iterations executed.

property r_squared_adj: float¶

Adjusted R-squared based on deviance explained.

Matches the formula used by mgcv:

R_adj = 1 - (deviance / df_resid) / (null_deviance / df_null)

summary()[source]¶

R-style summary matching summary(gam(...)).

Includes:

Family and link function.
Parametric coefficient table with standard errors and tests.
Approximate significance tests for each smooth term.
Adjusted R-squared, deviance explained, GCV, and scale.

Returns:: Multi-line summary string.
Return type:: str