Generalized Additive Models

Penalized regression spline GAMs via P-IRLS matching R’s mgcv::gam().

Generalized Additive Models (GAM).

Provides smooth term specification and basis construction for penalized regression spline GAMs following Wood (2017).

Usage:

from pystatistics.gam import gam, s

sol = gam(
    y,
    smooths=[s('x1', k=15), s('x2', bs='tp')],
    smooth_data={'x1': x1, 'x2': x2},
)
print(sol.summary())
class pystatistics.gam.GAMParams(coefficients, fitted_values, linear_predictor, residuals, edf, total_edf, lambdas, s_scales, covariance, scale, gcv, ubre, reml_score, deviance, null_deviance, log_likelihood, aic, n_obs, family_name, link_name, dispersion_fixed, converged, outer_converged, n_iter, method, backend_name)[source]

Bases: object

Parameters from a fitted GAM.

Carries the full numerical output of the stable (augmented-QR) penalized iteratively re-weighted least squares fitting procedure.

Parameters:
coefficients

Full coefficient vector (parametric + constrained basis coefficients; smooth coefficients live in the sum-to-zero constrained parameterisation, so a smooth declared with k basis functions contributes k - 1 coefficients — same as mgcv).

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

fitted_values

Response-scale predictions (mu_hat).

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

linear_predictor

Link-scale predictions (eta_hat).

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

residuals

Working residuals (y - mu_hat).

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

edf

Effective degrees of freedom per smooth term.

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

total_edf

Trace of the influence matrix (parametric + smooths).

Type:

float

lambdas

Selected (or user-fixed) smoothing parameters, one per smooth. For cr smooths these are directly comparable to mgcv’s sp (identical penalty construction and scaling); for tp smooths the fit is function-space identical to mgcv’s but the penalty parameterisation (and hence the sp units) differs by the eigenbasis convention — compare tp fits via EDF/fitted values, not raw sp.

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

s_scales

Per-smooth S.scale penalty normalisation factors (divide lambdas by these to get function-space units).

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

covariance

Bayesian posterior covariance of the coefficients (scale * (X'WX + S_lambda)^{-1}, mgcv’s Vp).

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.floating[Any]]]

scale

Estimated or fixed dispersion parameter.

Type:

float

gcv

GCV score (scale-free families: the UBRE score is the criterion actually minimised; both are reported).

Type:

float

ubre

UBRE / scaled-AIC score.

Type:

float

reml_score

Laplace REML criterion at the selected lambdas — only populated when method='REML'; None otherwise.

Type:

float | None

deviance

Model deviance.

Type:

float

null_deviance

Null-model deviance (intercept only).

Type:

float

log_likelihood

Maximized log-likelihood.

Type:

float

aic

Akaike information criterion, classical GAM convention -2 loglik + 2 (total_edf [+1 if scale estimated]). NOTE: mgcv >= 1.8.x corrects the df for smoothing-parameter uncertainty (Wood, Pya & Saefken 2016, ‘edf2’), so mgcv’s AIC() is systematically slightly larger; documented, not hidden.

Type:

float

n_obs

Number of observations used in fitting.

Type:

int

family_name

Name of the exponential family (e.g. ‘gaussian’).

Type:

str

Name of the link function (e.g. ‘identity’).

Type:

str

dispersion_fixed

True when the family’s dispersion is fixed at 1 (binomial, poisson, negative.binomial) — controls z-vs-t and Chi.sq-vs-F conventions downstream.

Type:

bool

converged

Whether the P-IRLS algorithm converged.

Type:

bool

outer_converged

Whether the smoothing-parameter search converged away from its bounds (always True when sp was user-fixed).

Type:

bool

n_iter

Number of P-IRLS iterations in the final fit.

Type:

int

method

Smoothing parameter selection method (‘GCV’ or ‘REML’).

Type:

str

backend_name

Execution-path disclosure (always 'cpu_qr_pirls'; the GAM module is CPU-only).

Type:

str

coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]
fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]
linear_predictor: ndarray[tuple[Any, ...], dtype[floating[Any]]]
residuals: ndarray[tuple[Any, ...], dtype[floating[Any]]]
edf: ndarray[tuple[Any, ...], dtype[floating[Any]]]
total_edf: float
lambdas: ndarray[tuple[Any, ...], dtype[floating[Any]]]
s_scales: ndarray[tuple[Any, ...], dtype[floating[Any]]]
covariance: ndarray[tuple[Any, ...], dtype[floating[Any]]]
scale: float
gcv: float
ubre: float
reml_score: float | None
deviance: float
null_deviance: float
log_likelihood: float
aic: float
n_obs: int
family_name: str
link_name: str
dispersion_fixed: bool
converged: bool
outer_converged: bool
n_iter: int
method: str
backend_name: str
class pystatistics.gam.SmoothInfo(term_name, var_name, basis_type, k, edf, ref_df, chi_sq, p_value, coef_indices, lambda_, s_scale)[source]

Bases: object

Information about a single smooth term in a fitted GAM.

One SmoothInfo is produced per s() term after fitting. It records the basis metadata and the approximate significance test for the smooth.

Parameters:
term_name

Display name, e.g. 's(x1)'.

Type:

str

var_name

Bare predictor name, e.g. 'x1'.

Type:

str

basis_type

Basis identifier: 'cr' (cubic regression spline) or 'tp' (thin plate regression spline).

Type:

str

k

The basis dimension as declared in s(x, k=...) (the smooth contributes k - 1 coefficients after the sum-to-zero identifiability constraint, same as mgcv).

Type:

int

edf

Effective degrees of freedom for this term.

Type:

float

ref_df

Reference degrees of freedom, tr(2H - HH) over the term’s block (mgcv’s Ref.df convention).

Type:

float

chi_sq

Approximate significance statistic. For fixed-dispersion families this is the chi-squared-referenced quadratic form beta' pinv(V_beta) beta; for estimated-dispersion families it is that form divided by Ref.df, i.e. an F STATISTIC (the summary labels the column accordingly, matching mgcv). A simplified form of mgcv’s Wood (2013) test — agrees with mgcv’s reported statistic to ~0.2% on validated cases, not bit-identical.

Type:

float

p_value

Approximate p-value for the term.

Type:

float

coef_indices

(start, end) slice into the full coefficient vector identifying this term’s (constrained) coefficients.

Type:

tuple[int, int]

lambda_

Selected smoothing parameter for this term.

Type:

float

s_scale

Penalty normalisation factor for this term (lambda_ / s_scale is in function-space units).

Type:

float

term_name: str
var_name: str
basis_type: str
k: int
edf: float
ref_df: float
chi_sq: float
p_value: float
coef_indices: tuple[int, int]
lambda_: float
s_scale: float
pystatistics.gam.s(var_name, k=10, bs='cr')[source]

Convenience constructor for smooth terms, matching mgcv::s().

Usage:

from pystatistics.gam import s, gam
result = gam(y, X, smooths=[s('x1', k=15), s('x2')])
Parameters:
  • var_name (str) – Name of the predictor variable.

  • k (int) – Basis dimension (default 10).

  • bs (str) – Basis type: 'cr' or 'tp' (default 'cr').

Returns:

A SmoothTerm specification object.

Return type:

SmoothTerm

class pystatistics.gam.SmoothTerm(var_name, k=10, bs='cr')[source]

Bases: object

Specification for a smooth term s(x) in a GAM formula.

A pure, immutable-by-convention specification: variable name, basis type and basis dimension. All fit-derived quantities (basis matrices, penalties, constraint reparameterisations) live inside the fit — a SmoothTerm can safely be reused across multiple gam() calls and datasets without stale-state hazards.

Parameters:
var_name

Name of the predictor variable.

k

Basis dimension (>= 3), exactly as mgcv’s s(x, k=...). The fitted smooth contributes k - 1 coefficients after its sum-to-zero identifiability constraint (same as mgcv).

bs

Basis type: 'cr' (cubic regression spline, default) or 'tp' (thin plate regression spline).

__init__(var_name, k=10, bs='cr')[source]

Create a smooth term specification.

Parameters:
  • var_name (str) – Name of the predictor variable.

  • k (int) – Basis dimension (default 10, matching mgcv). Must be >= 3.

  • bs (str) – Basis type – 'cr' for cubic regression spline (default) or 'tp' for thin plate regression spline.

Raises:

ValidationError – If var_name is empty, k is out of range, or bs is not a recognised basis type.

Return type:

None

var_name
k
bs
pystatistics.gam.gam(y, X=None, *, smooths=None, smooth_data=None, family='gaussian', method='GCV', sp=None, tol=1e-08, max_iter=200, names=None)[source]

Fit a Generalized Additive Model.

Matches R’s mgcv::gam() for common cases. A GAM models the response as:

g(E[y]) = beta_0 + f_1(x_1) + f_2(x_2) + ... + X @ beta

where each f_j is a smooth function estimated from the data using penalized regression splines, with per-smooth sum-to-zero identifiability constraints (a smooth declared with k basis functions contributes k - 1 coefficients, exactly as mgcv).

Parameters:
  • y (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) – Response variable, 1-D array of n observations.

  • X (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None) – Parametric design matrix (n, p). May include an intercept column. If None, an intercept-only parametric part is used.

  • smooths (list[SmoothTerm] | None) – Smooth term specifications from s().

  • smooth_data (dict[str, _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]] | None) – Mapping from smooth variable names to (n,) arrays.

  • family (str | Family) – Exponential family, e.g. 'gaussian', 'binomial', 'poisson', 'Gamma'. method='REML' supports the fixed-dispersion families and the Gaussian-identity model.

  • method (str) – Smoothing-parameter selection criterion: 'GCV' (mgcv GCV.Cp semantics — UBRE is used automatically for known-scale families, as mgcv does) or 'REML' (Laplace, Wood 2011).

  • sp (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None) – Optional fixed smoothing parameters, one per smooth (mgcv’s sp=). Skips selection; use for reproducing a specific fit.

  • tol (float) – P-IRLS convergence tolerance (relative penalized deviance).

  • max_iter (int) – Maximum P-IRLS iterations.

  • names (list[str] | None) – Optional display names for the parametric coefficients.

Returns:

A GAMSolution.

Raises:
  • ValidationError – invalid inputs, unknown family/method/basis, k exceeding the unique values of a smooth variable, or a sp vector of the wrong length/sign.

  • ConvergenceError – a genuinely divergent P-IRLS (loud, never a silent answer).

Return type:

GAMSolution

class pystatistics.gam.GAMSolution(_result, _smooth_infos, _names=None)[source]

Bases: SolutionReprMixin

User-facing GAM result with convenient properties and R-style output.

Wraps a Result[GAMParams] and provides:

  • Coefficient access.

  • Smooth-term information (EDF, approximate significance tests).

  • R-style summary() matching mgcv::summary.gam().

Parameters:
__init__(_result, _smooth_infos, _names=None)[source]

Initialise the solution wrapper.

Parameters:
  • _result (Result[GAMParams]) – The generic Result envelope carrying GAMParams.

  • _smooth_infos (list[SmoothInfo]) – One SmoothInfo per smooth term.

  • _names (list[str] | None) – Optional names for parametric coefficients.

Return type:

None

property params: GAMParams

The underlying frozen parameter payload.

property coefficients: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Full coefficient vector (parametric + basis).

property fitted_values: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Response-scale predictions.

property residuals: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Working residuals (y - mu_hat).

property edf: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Effective degrees of freedom per smooth term.

property total_edf: float

Total effective degrees of freedom.

property smooth_terms: list[SmoothInfo]

Information about each smooth term.

property deviance: float

Model deviance.

property null_deviance: float

Null-model deviance.

property deviance_explained: float

1 - dev / null_dev.

Type:

Proportion of null deviance explained

property aic: float

Akaike information criterion.

property gcv: float

GCV score.

property scale: float

Estimated dispersion parameter.

property ubre: float

UBRE / scaled-AIC score (the criterion mgcv’s GCV.Cp minimises for known-scale families).

property lambdas: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Selected (or user-fixed) smoothing parameters, one per smooth.

Reported in the same penalty scaling mgcv reports sp in; divide by params.s_scales for function-space units.

property covariance: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Bayesian posterior covariance of the coefficients (mgcv Vp).

property se: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Standard errors for ALL coefficients (posterior, mgcv-style).

property reml_score: float | None

Laplace REML criterion (None unless method='REML').

property converged: bool

Whether the P-IRLS algorithm converged.

property outer_converged: bool

Whether the smoothing-parameter search converged off-boundary.

property n_iter: int

Number of P-IRLS iterations executed.

property r_squared_adj: float

Adjusted R-squared based on deviance explained.

Matches the formula used by mgcv:

R_adj = 1 - (deviance / df_resid) / (null_deviance / df_null)
summary()[source]

R-style summary matching summary(gam(...)).

Includes:

  • Family and link function.

  • Parametric coefficient table with standard errors and tests.

  • Approximate significance tests for each smooth term.

  • Adjusted R-squared, deviance explained, GCV, and scale.

Returns:

Multi-line summary string.

Return type:

str