Regression

Linear and generalized linear models. OLS via QR (CPU) or Cholesky (GPU). GLM via IRLS with Gaussian, Binomial, and Poisson families.

Linear and generalized linear models.

Usage:

from pystatistics import DataSource from pystatistics.regression import Design, fit

# OLS from DataSource ds = DataSource.from_file(“data.csv”) design = Design.from_datasource(ds, y=’target’) result = fit(design)

# OLS from arrays (convenience) result = fit(X, y)

# GLM (logistic regression) result = fit(X, y, family=’binomial’)

# GLM (Poisson regression) result = fit(X, y, family=’poisson’)

pystatistics.regression.fit(X_or_design, y=None, *, family=None, backend='auto', force=False, tol=1e-08, max_iter=25)[source]

Fit a linear or generalized linear model.

When family is None (default), fits ordinary least squares (LM) via QR decomposition or GPU Cholesky. When family is specified, fits a GLM via IRLS (Iteratively Reweighted Least Squares).

Accepts EITHER:
  1. A Design object (from DataSource or arrays)

  2. Raw X and y arrays (convenience)

Parameters:
  • X_or_design (ArrayLike | Design) – Design object or X matrix

  • y (ArrayLike | None) – Response vector (required if X_or_design is array)

  • family (str | Family | None) – GLM family specification. None for OLS, or a string (‘gaussian’, ‘binomial’, ‘poisson’) or Family instance.

  • backend (Literal['auto', 'cpu', 'gpu', 'cpu_qr', 'cpu_svd', 'gpu_qr']) – ‘auto’, ‘cpu’, ‘gpu’, ‘cpu_qr’, ‘cpu_svd’, ‘gpu_qr’

  • force (bool) – If True, proceed with GPU Cholesky even on ill-conditioned matrices. Has no effect on CPU backends.

  • tol (float) – Convergence tolerance for IRLS (GLM only). Default 1e-8 matches R’s glm.control().

  • max_iter (int) – Maximum IRLS iterations (GLM only). Default 25 matches R’s glm.control().

Returns:

LinearSolution when family is None. GLMSolution when family is specified.

Return type:

LinearSolution | GLMSolution

Examples

# OLS (unchanged from before) >>> result = fit(X, y) >>> result = fit(X, y, backend=’gpu’)

# Logistic regression >>> result = fit(X, y, family=’binomial’)

# Poisson regression >>> result = fit(X, y, family=’poisson’)

# Gaussian GLM (equivalent to OLS) >>> result = fit(X, y, family=’gaussian’)

class pystatistics.regression.Design(_X, _y, _n, _p, _source=None)[source]

Bases: object

Regression design matrix specification.

Wraps a DataSource and provides X, y for regression. Immutable after construction.

Construction:

Design.from_datasource(ds, y=’target’) # X = all other columns Design.from_datasource(ds, x=[‘a’,’b’], y=’c’) # X = specified columns Design.from_datasource(ds) # Uses ds[‘X’] and ds[‘y’] Design.from_arrays(X, y) # Direct from arrays

Parameters:
classmethod from_datasource(source, *, x=None, y=None)[source]

Build Design from DataSource.

Parameters:
  • source (DataSource) – The DataSource

  • x (str | list[str] | None) – Predictor column(s). If None and source has ‘X’, uses that. If None and y is specified, uses all columns except y.

  • y (str | None) – Response column. If None, uses ‘y’ from source.

Returns:

Design ready for regression

Return type:

Design

Assumes good faith: garbage in, garbage out.

classmethod from_arrays(X, y)[source]

Build Design directly from arrays.

Parameters:
Return type:

Design

property X: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Design matrix (n x p).

property y: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Response vector (n,).

property n: int

Number of observations.

property p: int

Number of predictors.

property source: DataSource | None

Original DataSource, if available.

supports(capability)[source]

Check if underlying data supports a capability.

Parameters:

capability (str)

Return type:

bool

XtX()[source]

Compute X’X (for standard errors).

Return type:

ndarray[tuple[Any, …], dtype[floating[Any]]]

Xty()[source]

Compute X’y.

Return type:

ndarray[tuple[Any, …], dtype[floating[Any]]]

class pystatistics.regression.DataSource(_data, _capabilities, _metadata=<factory>)[source]

Bases: object

Universal data container. Domain-agnostic.

Construct via factory classmethods, not directly.

The lumber yard analogy: DataSource has data (logs). It doesn’t know or care what you’re building—furniture (regression), paper (MVN MLE), or two-by-fours (survival analysis).

Parameters:
keys()[source]

Return the names of all available arrays.

Returns:

frozenset of array names

Return type:

frozenset[str]

Example

>>> ds = DataSource.from_arrays(X=X, y=y)
>>> ds.keys()
frozenset({'X', 'y'})
property n_observations: int

Number of statistical units (rows).

property metadata: dict[str, Any]

Domain-agnostic metadata.

supports(capability)[source]

Check if this DataSource supports a capability.

Parameters:

capability (str) – Use constants from pystatistics.core.capabilities

Returns:

True if supported, False otherwise

Return type:

bool

Note

Unknown capabilities return False, never raise.

classmethod from_arrays(*, X=None, y=None, data=None, columns=None, **named_arrays)[source]

Construct from NumPy arrays.

Parameters:
Return type:

DataSource

classmethod from_file(path, *, columns=None)[source]

Construct from file (CSV, NPY).

Parameters:
Return type:

DataSource

classmethod from_dataframe(df, *, source_path=None)[source]

Construct from pandas DataFrame.

Parameters:
  • df (pd.DataFrame)

  • source_path (str | None)

Return type:

DataSource

classmethod from_tensors(*, X=None, y=None, **named_tensors)[source]

Construct from PyTorch tensors (already on GPU).

Parameters:
  • X (torch.Tensor | None)

  • y (torch.Tensor | None)

  • named_tensors (torch.Tensor)

Return type:

DataSource

classmethod build(*args, **kwargs)[source]

Convenience factory that dispatches to appropriate from_* method.

Examples

DataSource.build(X=X, y=y) # from_arrays DataSource.build(“data.csv”) # from_file

Return type:

DataSource

class pystatistics.regression.LinearSolution(_result, _design, _standard_errors=None, _t_statistics=None, _p_values=None)[source]

Bases: object

User-facing regression results.

Wraps the backend Result and provides convenient accessors for all regression outputs including standard errors and t-statistics.

Parameters:
property coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]
property residuals: ndarray[tuple[Any, ...], dtype[floating[Any]]]
property fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]
property rss: float
property tss: float
property r_squared: float
property adjusted_r_squared: float
property residual_std_error: float
property standard_errors: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Standard errors of coefficients.

For CPU QR backend: uses R^{-1} from QR decomposition to compute (X’X)^{-1} = R^{-1} R^{-T}, matching R’s backsolve(R, diag(p)) exactly.

For GPU backend (no R available): falls back to np.linalg.inv(X’X).

For rank-deficient matrices, aliased coefficients get NaN standard errors.

property t_statistics: ndarray[tuple[Any, ...], dtype[floating[Any]]]

t-statistics for coefficients.

property p_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Two-sided p-values for coefficient t-tests.

property rank: int
property df_residual: int
property info: dict[str, Any]
property timing: dict[str, float] | None
property backend_name: str
property warnings: tuple[str, ...]
summary()[source]

Generate R-style summary output.

Return type:

str

class pystatistics.regression.LinearParams(coefficients, residuals, fitted_values, rss, tss, rank, df_residual)[source]

Bases: object

Parameter payload for linear regression.

This is the immutable data computed by backends.

Parameters:
coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]
residuals: ndarray[tuple[Any, ...], dtype[floating[Any]]]
fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]
rss: float
tss: float
rank: int
df_residual: int
class pystatistics.regression.GLMSolution(_result, _design, _standard_errors=None, _test_statistics=None, _p_values=None)[source]

Bases: object

User-facing GLM results.

Wraps the backend Result and provides convenient accessors for all GLM-specific outputs including deviance, AIC, and multiple residual types.

Parameters:
property coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]
property fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Fitted values on the response scale (μ).

property linear_predictor: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Linear predictor η = X @ β.

property residuals_deviance: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Deviance residuals (signed).

property residuals_pearson: ndarray[tuple[Any, ...], dtype[floating[Any]]]

(y - μ) / sqrt(V(μ)).

Type:

Pearson residuals

property residuals_working: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Working residuals from the final IRLS iteration.

property residuals_response: ndarray[tuple[Any, ...], dtype[floating[Any]]]

y - μ.

Type:

Response residuals

property deviance: float

Residual deviance.

property null_deviance: float

Null deviance (intercept-only model).

property aic: float

Akaike Information Criterion.

property bic: float

Bayesian Information Criterion.

property dispersion: float

Dispersion parameter.

Fixed at 1.0 for Binomial and Poisson. Estimated from data for Gaussian.

property rank: int
property df_residual: int
property df_null: int
property converged: bool
property n_iter: int
property family_name: str
property standard_errors: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Standard errors of coefficients.

Computed as sqrt(dispersion * diag((X’WX)^{-1})) where W are the final IRLS weights. Uses QR from the final iteration when available (CPU), falls back to direct inversion (GPU).

property test_statistics: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Wald test statistics (z for fixed dispersion, t for estimated).

property p_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Two-sided p-values.

Uses z-distribution for Binomial/Poisson (fixed dispersion=1). Uses t-distribution for Gaussian (estimated dispersion).

property info: dict[str, Any]
property timing: dict[str, float] | None
property backend_name: str
property warnings: tuple[str, ...]
summary()[source]

Generate R-style GLM summary output.

Return type:

str

class pystatistics.regression.GLMParams(coefficients, fitted_values, linear_predictor, residuals_working, residuals_deviance, residuals_pearson, residuals_response, deviance, null_deviance, aic, dispersion, rank, df_residual, df_null, n_iter, converged, family_name, link_name)[source]

Bases: object

Parameter payload for GLM (IRLS).

This is the immutable data computed by GLM backends.

Parameters:
coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]
fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]
linear_predictor: ndarray[tuple[Any, ...], dtype[floating[Any]]]
residuals_working: ndarray[tuple[Any, ...], dtype[floating[Any]]]
residuals_deviance: ndarray[tuple[Any, ...], dtype[floating[Any]]]
residuals_pearson: ndarray[tuple[Any, ...], dtype[floating[Any]]]
residuals_response: ndarray[tuple[Any, ...], dtype[floating[Any]]]
deviance: float
null_deviance: float
aic: float
dispersion: float
rank: int
df_residual: int
df_null: int
n_iter: int
converged: bool
family_name: str
class pystatistics.regression.Family(link=None)[source]

Bases: ABC

GLM family specification.

Defines the relationship between the mean and variance of the response distribution, along with a link function.

Parameters:

link (str | Link | None)

abstract property name: str
abstractmethod variance(mu)[source]

Variance function V(μ).

Parameters:

mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

abstractmethod deviance(y, mu, wt)[source]

Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).

The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.

Parameters:
Return type:

float

abstractmethod initialize(y)[source]

Initialize μ from y for IRLS starting values.

Must return values in the valid range for the link function.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

property dispersion_is_fixed: bool

Whether the dispersion parameter is known a priori.

True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).

abstractmethod log_likelihood(y, mu, wt, dispersion)[source]

Compute the log-likelihood for AIC.

This must match R’s family$aic() / (-2) for consistency.

Parameters:
Return type:

float

aic(y, mu, wt, rank, dispersion)[source]

Compute AIC = -2 * loglik + 2 * rank.

For families with estimated dispersion (Gaussian), R computes AIC differently. Subclasses may override.

Parameters:
Return type:

float

class pystatistics.regression.Gaussian(link=None)[source]

Bases: Family

Gaussian (Normal) family. Default link: identity.

V(μ) = 1 Deviance = Σ wt_i * (y_i - μ_i)² (= RSS for identity link)

Parameters:

link (str | Link | None)

property name: str
variance(mu)[source]

Variance function V(μ).

Parameters:

mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

initialize(y)[source]

Initialize μ from y for IRLS starting values.

Must return values in the valid range for the link function.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

deviance(y, mu, wt)[source]

Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).

The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.

Parameters:
Return type:

float

log_likelihood(y, mu, wt, dispersion)[source]

Compute the log-likelihood for AIC.

This must match R’s family$aic() / (-2) for consistency.

Parameters:
Return type:

float

aic(y, mu, wt, rank, dispersion)[source]

Compute AIC matching R’s gaussian family.

R’s gaussian()$aic uses MLE dispersion (dev/n, not dev/df) and adds +2 for the dispersion parameter. The formula is:

AIC = -2 * loglik(sigma_mle) + 2 + 2 * rank

where sigma_mle = sqrt(deviance / n), and the +2 comes from the gaussian family counting the dispersion as an extra parameter.

Parameters:
Return type:

float

property dispersion_is_fixed: bool

Whether the dispersion parameter is known a priori.

True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).

class pystatistics.regression.Binomial(link=None)[source]

Bases: Family

Binomial family. Default link: logit.

V(μ) = μ(1-μ) Deviance = 2 * Σ wt_i * [y_i log(y_i/μ_i) + (1-y_i) log((1-y_i)/(1-μ_i))]

Parameters:

link (str | Link | None)

property name: str
variance(mu)[source]

Variance function V(μ).

Parameters:

mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

initialize(y)[source]

Initialize μ from y for IRLS starting values.

Must return values in the valid range for the link function.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

deviance(y, mu, wt)[source]

Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).

The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.

Parameters:
Return type:

float

log_likelihood(y, mu, wt, dispersion)[source]

Compute the log-likelihood for AIC.

This must match R’s family$aic() / (-2) for consistency.

Parameters:
Return type:

float

property dispersion_is_fixed: bool

Whether the dispersion parameter is known a priori.

True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).

class pystatistics.regression.Poisson(link=None)[source]

Bases: Family

Poisson family. Default link: log.

V(μ) = μ Deviance = 2 * Σ wt_i * [y_i log(y_i/μ_i) - (y_i - μ_i)]

Parameters:

link (str | Link | None)

property name: str
variance(mu)[source]

Variance function V(μ).

Parameters:

mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

initialize(y)[source]

Initialize μ from y for IRLS starting values.

Must return values in the valid range for the link function.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

ndarray[tuple[Any, …], dtype[_ScalarT]]

deviance(y, mu, wt)[source]

Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).

The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.

Parameters:
Return type:

float

log_likelihood(y, mu, wt, dispersion)[source]

Compute the log-likelihood for AIC.

This must match R’s family$aic() / (-2) for consistency.

Parameters:
Return type:

float

property dispersion_is_fixed: bool

Whether the dispersion parameter is known a priori.

True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).