Regression¶

Linear and generalized linear models. OLS via QR (CPU) or Cholesky (GPU). GLM via IRLS with Gaussian, Binomial, and Poisson families.

Linear and generalized linear models.

Usage:

from pystatistics import DataSource from pystatistics.regression import Design, fit

# OLS from DataSource ds = DataSource.from_file(“data.csv”) design = Design.from_datasource(ds, y=’target’) result = fit(design)

# OLS from arrays (convenience) result = fit(X, y)

# GLM (logistic regression) result = fit(X, y, family=’binomial’)

# GLM (Poisson regression) result = fit(X, y, family=’poisson’)

pystatistics.regression.fit(X_or_design, y=None, *, family=None, backend='auto', force=False, tol=1e-08, max_iter=25)[source]¶

Fit a linear or generalized linear model.

When family is None (default), fits ordinary least squares (LM) via QR decomposition or GPU Cholesky. When family is specified, fits a GLM via IRLS (Iteratively Reweighted Least Squares).

Accepts EITHER:

A Design object (from DataSource or arrays)
Raw X and y arrays (convenience)

Parameters:

X_or_design (ArrayLike | Design) – Design object or X matrix
y (ArrayLike | None) – Response vector (required if X_or_design is array)
family (str | Family | None) – GLM family specification. None for OLS, or a string (‘gaussian’, ‘binomial’, ‘poisson’) or Family instance.
backend (Literal['auto', 'cpu', 'gpu', 'cpu_qr', 'cpu_svd', 'gpu_qr']) – ‘auto’, ‘cpu’, ‘gpu’, ‘cpu_qr’, ‘cpu_svd’, ‘gpu_qr’
force (bool) – If True, proceed with GPU Cholesky even on ill-conditioned matrices. Has no effect on CPU backends.
tol (float) – Convergence tolerance for IRLS (GLM only). Default 1e-8 matches R’s glm.control().
max_iter (int) – Maximum IRLS iterations (GLM only). Default 25 matches R’s glm.control().

Returns:

LinearSolution when family is None. GLMSolution when family is specified.

Return type:

LinearSolution | GLMSolution

Examples

# OLS (unchanged from before) >>> result = fit(X, y) >>> result = fit(X, y, backend=’gpu’)

# Logistic regression >>> result = fit(X, y, family=’binomial’)

# Poisson regression >>> result = fit(X, y, family=’poisson’)

# Gaussian GLM (equivalent to OLS) >>> result = fit(X, y, family=’gaussian’)

class pystatistics.regression.Design(_X, _y, _n, _p, _source=None)[source]¶

Bases: object

Regression design matrix specification.

Wraps a DataSource and provides X, y for regression. Immutable after construction.

Construction:: Design.from_datasource(ds, y=’target’) # X = all other columns Design.from_datasource(ds, x=[‘a’,’b’], y=’c’) # X = specified columns Design.from_datasource(ds) # Uses ds[‘X’] and ds[‘y’] Design.from_arrays(X, y) # Direct from arrays

Parameters:

_X (ndarray[tuple[Any, ...], dtype[floating[Any]]])
_y (ndarray[tuple[Any, ...], dtype[floating[Any]]])
_n (int)
_p (int)
_source (DataSource | None)

classmethod from_datasource(source, *, x=None, y=None)[source]¶

Build Design from DataSource.

Parameters:

source (DataSource) – The DataSource
x (str | list[str] | None) – Predictor column(s). If None and source has ‘X’, uses that. If None and y is specified, uses all columns except y.
y (str | None) – Response column. If None, uses ‘y’ from source.

Returns:

Design ready for regression

Return type:

Design

Assumes good faith: garbage in, garbage out.

classmethod from_arrays(X, y)[source]¶

Build Design directly from arrays.

Parameters:

X (ndarray[tuple[Any, ...], dtype[_ScalarT]])
y (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

Design

property X: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: Design matrix (n x p).

property y: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: Response vector (n,).

property n: int¶: Number of observations.

property p: int¶: Number of predictors.

property source: DataSource | None¶: Original DataSource, if available.

supports(capability)[source]¶

Check if underlying data supports a capability.

Parameters:: capability (str)
Return type:: bool

XtX()[source]¶

Compute X’X (for standard errors).

Return type:: ndarray[tuple[Any, …], dtype[floating[Any]]]

Xty()[source]¶

Compute X’y.

Return type:: ndarray[tuple[Any, …], dtype[floating[Any]]]

class pystatistics.regression.DataSource(_data, _capabilities, _metadata=<factory>)[source]¶

Bases: object

Universal data container. Domain-agnostic.

Construct via factory classmethods, not directly.

The lumber yard analogy: DataSource has data (logs). It doesn’t know or care what you’re building—furniture (regression), paper (MVN MLE), or two-by-fours (survival analysis).

Parameters:

_data (dict[str, Any])
_capabilities (frozenset[str])
_metadata (dict[str, Any])

keys()[source]¶

Return the names of all available arrays.

Returns:: frozenset of array names
Return type:: frozenset[str]

Example

>>> ds = DataSource.from_arrays(X=X, y=y)
>>> ds.keys()
frozenset({'X', 'y'})

property n_observations: int¶: Number of statistical units (rows).

property metadata: dict[str, Any]¶: Domain-agnostic metadata.

supports(capability)[source]¶

Check if this DataSource supports a capability.

Parameters:: capability (str) – Use constants from pystatistics.core.capabilities
Returns:: True if supported, False otherwise
Return type:: bool

Note

Unknown capabilities return False, never raise.

classmethod from_arrays(*, X=None, y=None, data=None, columns=None, **named_arrays)[source]¶

Construct from NumPy arrays.

Parameters:

X (ndarray[tuple[Any, ...], dtype[_ScalarT]] | None)
y (ndarray[tuple[Any, ...], dtype[_ScalarT]] | None)
data (ndarray[tuple[Any, ...], dtype[_ScalarT]] | None)
columns (list[str] | None)
named_arrays (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

DataSource

classmethod from_file(path, *, columns=None)[source]¶

Construct from file (CSV, NPY).

Parameters:

path (str | Path)
columns (list[str] | None)

Return type:

DataSource

classmethod from_dataframe(df, *, source_path=None)[source]¶

Construct from pandas DataFrame.

Parameters:

df (pd.DataFrame)
source_path (str | None)

Return type:

DataSource

classmethod from_tensors(*, X=None, y=None, **named_tensors)[source]¶

Construct from PyTorch tensors (already on GPU).

Parameters:

X (torch.Tensor | None)
y (torch.Tensor | None)
named_tensors (torch.Tensor)

Return type:

DataSource

classmethod build(*args, **kwargs)[source]¶

Convenience factory that dispatches to appropriate from_* method.

Examples

DataSource.build(X=X, y=y) # from_arrays DataSource.build(“data.csv”) # from_file

Return type:: DataSource

class pystatistics.regression.LinearSolution(_result, _design, _standard_errors=None, _t_statistics=None, _p_values=None)[source]¶

Bases: object

User-facing regression results.

Wraps the backend Result and provides convenient accessors for all regression outputs including standard errors and t-statistics.

Parameters:

_result (Result[LinearParams])
_design (Design)
_standard_errors (ndarray[tuple[Any, ...], dtype[floating[Any]]] | None)
_t_statistics (ndarray[tuple[Any, ...], dtype[floating[Any]]] | None)
_p_values (ndarray[tuple[Any, ...], dtype[floating[Any]]] | None)

property coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

property residuals: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

property fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

property rss: float¶

property tss: float¶

property r_squared: float¶

property adjusted_r_squared: float¶

property residual_std_error: float¶

property standard_errors: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

Standard errors of coefficients.

For CPU QR backend: uses R^{-1} from QR decomposition to compute (X’X)^{-1} = R^{-1} R^{-T}, matching R’s backsolve(R, diag(p)) exactly.

For GPU backend (no R available): falls back to np.linalg.inv(X’X).

For rank-deficient matrices, aliased coefficients get NaN standard errors.

property t_statistics: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: t-statistics for coefficients.

property p_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: Two-sided p-values for coefficient t-tests.

property rank: int¶

property df_residual: int¶

property info: dict[str, Any]¶

property timing: dict[str, float] | None¶

property backend_name: str¶

property warnings: tuple[str, ...]¶

summary()[source]¶

Generate R-style summary output.

Return type:: str

class pystatistics.regression.LinearParams(coefficients, residuals, fitted_values, rss, tss, rank, df_residual)[source]¶

Bases: object

Parameter payload for linear regression.

This is the immutable data computed by backends.

Parameters:

coefficients (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals (ndarray[tuple[Any, ...], dtype[floating[Any]]])
fitted_values (ndarray[tuple[Any, ...], dtype[floating[Any]]])
rss (float)
tss (float)
rank (int)
df_residual (int)

coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

residuals: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

rss: float¶

tss: float¶

rank: int¶

df_residual: int¶

class pystatistics.regression.GLMSolution(_result, _design, _standard_errors=None, _test_statistics=None, _p_values=None)[source]¶

Bases: object

User-facing GLM results.

Wraps the backend Result and provides convenient accessors for all GLM-specific outputs including deviance, AIC, and multiple residual types.

Parameters:

_result (Result[GLMParams])
_design (Design)
_standard_errors (ndarray[tuple[Any, ...], dtype[floating[Any]]] | None)
_test_statistics (ndarray[tuple[Any, ...], dtype[floating[Any]]] | None)
_p_values (ndarray[tuple[Any, ...], dtype[floating[Any]]] | None)

property coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

property fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: Fitted values on the response scale (μ).

property linear_predictor: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: Linear predictor η = X @ β.

property residuals_deviance: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: Deviance residuals (signed).

property residuals_pearson: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

(y - μ) / sqrt(V(μ)).

Type:: Pearson residuals

property residuals_working: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: Working residuals from the final IRLS iteration.

property residuals_response: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

y - μ.

Type:: Response residuals

property deviance: float¶: Residual deviance.

property null_deviance: float¶: Null deviance (intercept-only model).

property aic: float¶: Akaike Information Criterion.

property bic: float¶: Bayesian Information Criterion.

property dispersion: float¶

Dispersion parameter.

Fixed at 1.0 for Binomial and Poisson. Estimated from data for Gaussian.

property rank: int¶

property df_residual: int¶

property df_null: int¶

property converged: bool¶

property n_iter: int¶

property family_name: str¶

property link_name: str¶

property standard_errors: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

Standard errors of coefficients.

Computed as sqrt(dispersion * diag((X’WX)^{-1})) where W are the final IRLS weights. Uses QR from the final iteration when available (CPU), falls back to direct inversion (GPU).

property test_statistics: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶: Wald test statistics (z for fixed dispersion, t for estimated).

property p_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

Two-sided p-values.

Uses z-distribution for Binomial/Poisson (fixed dispersion=1). Uses t-distribution for Gaussian (estimated dispersion).

property info: dict[str, Any]¶

property timing: dict[str, float] | None¶

property backend_name: str¶

property warnings: tuple[str, ...]¶

summary()[source]¶

Generate R-style GLM summary output.

Return type:: str

class pystatistics.regression.GLMParams(coefficients, fitted_values, linear_predictor, residuals_working, residuals_deviance, residuals_pearson, residuals_response, deviance, null_deviance, aic, dispersion, rank, df_residual, df_null, n_iter, converged, family_name, link_name)[source]¶

Bases: object

Parameter payload for GLM (IRLS).

This is the immutable data computed by GLM backends.

Parameters:

coefficients (ndarray[tuple[Any, ...], dtype[floating[Any]]])
fitted_values (ndarray[tuple[Any, ...], dtype[floating[Any]]])
linear_predictor (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals_working (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals_deviance (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals_pearson (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals_response (ndarray[tuple[Any, ...], dtype[floating[Any]]])
deviance (float)
null_deviance (float)
aic (float)
dispersion (float)
rank (int)
df_residual (int)
df_null (int)
n_iter (int)
converged (bool)
family_name (str)
link_name (str)

coefficients: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

linear_predictor: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

residuals_working: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

residuals_deviance: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

residuals_pearson: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

residuals_response: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶

deviance: float¶

null_deviance: float¶

aic: float¶

dispersion: float¶

rank: int¶

df_residual: int¶

df_null: int¶

n_iter: int¶

converged: bool¶

family_name: str¶

link_name: str¶

class pystatistics.regression.Family(link=None)[source]¶

Bases: ABC

GLM family specification.

Defines the relationship between the mean and variance of the response distribution, along with a link function.

Parameters:: link (str | Link | None)

abstract property name: str¶

property link: Link¶

abstractmethod variance(mu)[source]¶

Variance function V(μ).

Parameters:: mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
Return type:: ndarray[tuple[Any, …], dtype[_ScalarT]]

abstractmethod deviance(y, mu, wt)[source]¶

Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).

The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

float

abstractmethod initialize(y)[source]¶

Initialize μ from y for IRLS starting values.

Must return values in the valid range for the link function.

Parameters:: y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
Return type:: ndarray[tuple[Any, …], dtype[_ScalarT]]

property dispersion_is_fixed: bool¶

Whether the dispersion parameter is known a priori.

True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).

abstractmethod log_likelihood(y, mu, wt, dispersion)[source]¶

Compute the log-likelihood for AIC.

This must match R’s family$aic() / (-2) for consistency.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])
dispersion (float)

Return type:

float

aic(y, mu, wt, rank, dispersion)[source]¶

Compute AIC = -2 * loglik + 2 * rank.

For families with estimated dispersion (Gaussian), R computes AIC differently. Subclasses may override.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])
rank (int)
dispersion (float)

Return type:

float

class pystatistics.regression.Gaussian(link=None)[source]¶

Bases: Family

Gaussian (Normal) family. Default link: identity.

V(μ) = 1 Deviance = Σ wt_i * (y_i - μ_i)² (= RSS for identity link)

Parameters:: link (str | Link | None)

property name: str¶

variance(mu)[source]¶

Variance function V(μ).

Parameters:: mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
Return type:: ndarray[tuple[Any, …], dtype[_ScalarT]]

initialize(y)[source]¶

Initialize μ from y for IRLS starting values.

Must return values in the valid range for the link function.

Parameters:: y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
Return type:: ndarray[tuple[Any, …], dtype[_ScalarT]]

deviance(y, mu, wt)[source]¶

Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).

The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

float

log_likelihood(y, mu, wt, dispersion)[source]¶

Compute the log-likelihood for AIC.

This must match R’s family$aic() / (-2) for consistency.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])
dispersion (float)

Return type:

float

aic(y, mu, wt, rank, dispersion)[source]¶

Compute AIC matching R’s gaussian family.

R’s gaussian()$aic uses MLE dispersion (dev/n, not dev/df) and adds +2 for the dispersion parameter. The formula is:

AIC = -2 * loglik(sigma_mle) + 2 + 2 * rank

where sigma_mle = sqrt(deviance / n), and the +2 comes from the gaussian family counting the dispersion as an extra parameter.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])
rank (int)
dispersion (float)

Return type:

float

property dispersion_is_fixed: bool¶

Whether the dispersion parameter is known a priori.

True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).

class pystatistics.regression.Binomial(link=None)[source]¶

Bases: Family

Binomial family. Default link: logit.

V(μ) = μ(1-μ) Deviance = 2 * Σ wt_i * [y_i log(y_i/μ_i) + (1-y_i) log((1-y_i)/(1-μ_i))]

Parameters:: link (str | Link | None)

property name: str¶

variance(mu)[source]¶

Variance function V(μ).

Parameters:: mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
Return type:: ndarray[tuple[Any, …], dtype[_ScalarT]]

initialize(y)[source]¶

Initialize μ from y for IRLS starting values.

Must return values in the valid range for the link function.

Parameters:: y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
Return type:: ndarray[tuple[Any, …], dtype[_ScalarT]]

deviance(y, mu, wt)[source]¶

Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).

The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

float

log_likelihood(y, mu, wt, dispersion)[source]¶

Compute the log-likelihood for AIC.

This must match R’s family$aic() / (-2) for consistency.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])
dispersion (float)

Return type:

float

property dispersion_is_fixed: bool¶

Whether the dispersion parameter is known a priori.

True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).

class pystatistics.regression.Poisson(link=None)[source]¶

Bases: Family

Poisson family. Default link: log.

V(μ) = μ Deviance = 2 * Σ wt_i * [y_i log(y_i/μ_i) - (y_i - μ_i)]

Parameters:: link (str | Link | None)

property name: str¶

variance(mu)[source]¶

Variance function V(μ).

Parameters:: mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
Return type:: ndarray[tuple[Any, …], dtype[_ScalarT]]

initialize(y)[source]¶

Initialize μ from y for IRLS starting values.

Must return values in the valid range for the link function.

Parameters:: y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
Return type:: ndarray[tuple[Any, …], dtype[_ScalarT]]

deviance(y, mu, wt)[source]¶

Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).

The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

float

log_likelihood(y, mu, wt, dispersion)[source]¶

Compute the log-likelihood for AIC.

This must match R’s family$aic() / (-2) for consistency.

Parameters:

y (ndarray[tuple[Any, ...], dtype[_ScalarT]])
mu (ndarray[tuple[Any, ...], dtype[_ScalarT]])
wt (ndarray[tuple[Any, ...], dtype[_ScalarT]])
dispersion (float)

Return type:

float

property dispersion_is_fixed: bool¶

Whether the dispersion parameter is known a priori.

True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).