Regression¶
Linear and generalized linear models. OLS via QR (CPU) or Cholesky (GPU). GLM via IRLS with Gaussian, Binomial, and Poisson families.
Linear and generalized linear models.
- Usage:
from pystatistics import DataSource from pystatistics.regression import Design, fit
# OLS from DataSource ds = DataSource.from_file(“data.csv”) design = Design.from_datasource(ds, y=’target’) result = fit(design)
# OLS from arrays (convenience) result = fit(X, y)
# GLM (logistic regression) result = fit(X, y, family=’binomial’)
# GLM (Poisson regression) result = fit(X, y, family=’poisson’)
- pystatistics.regression.fit(X_or_design, y=None, *, family=None, backend='auto', force=False, tol=1e-08, max_iter=25)[source]¶
Fit a linear or generalized linear model.
When family is None (default), fits ordinary least squares (LM) via QR decomposition or GPU Cholesky. When family is specified, fits a GLM via IRLS (Iteratively Reweighted Least Squares).
- Accepts EITHER:
A Design object (from DataSource or arrays)
Raw X and y arrays (convenience)
- Parameters:
X_or_design (ArrayLike | Design) – Design object or X matrix
y (ArrayLike | None) – Response vector (required if X_or_design is array)
family (str | Family | None) – GLM family specification. None for OLS, or a string (‘gaussian’, ‘binomial’, ‘poisson’) or Family instance.
backend (Literal['auto', 'cpu', 'gpu', 'cpu_qr', 'cpu_svd', 'gpu_qr']) – ‘auto’, ‘cpu’, ‘gpu’, ‘cpu_qr’, ‘cpu_svd’, ‘gpu_qr’
force (bool) – If True, proceed with GPU Cholesky even on ill-conditioned matrices. Has no effect on CPU backends.
tol (float) – Convergence tolerance for IRLS (GLM only). Default 1e-8 matches R’s glm.control().
max_iter (int) – Maximum IRLS iterations (GLM only). Default 25 matches R’s glm.control().
- Returns:
LinearSolution when family is None. GLMSolution when family is specified.
- Return type:
Examples
# OLS (unchanged from before) >>> result = fit(X, y) >>> result = fit(X, y, backend=’gpu’)
# Logistic regression >>> result = fit(X, y, family=’binomial’)
# Poisson regression >>> result = fit(X, y, family=’poisson’)
# Gaussian GLM (equivalent to OLS) >>> result = fit(X, y, family=’gaussian’)
- class pystatistics.regression.Design(_X, _y, _n, _p, _source=None)[source]¶
Bases:
objectRegression design matrix specification.
Wraps a DataSource and provides X, y for regression. Immutable after construction.
- Construction:
Design.from_datasource(ds, y=’target’) # X = all other columns Design.from_datasource(ds, x=[‘a’,’b’], y=’c’) # X = specified columns Design.from_datasource(ds) # Uses ds[‘X’] and ds[‘y’] Design.from_arrays(X, y) # Direct from arrays
- Parameters:
- classmethod from_datasource(source, *, x=None, y=None)[source]¶
Build Design from DataSource.
- Parameters:
source (DataSource) – The DataSource
x (str | list[str] | None) – Predictor column(s). If None and source has ‘X’, uses that. If None and y is specified, uses all columns except y.
y (str | None) – Response column. If None, uses ‘y’ from source.
- Returns:
Design ready for regression
- Return type:
Assumes good faith: garbage in, garbage out.
- property source: DataSource | None¶
Original DataSource, if available.
- class pystatistics.regression.DataSource(_data, _capabilities, _metadata=<factory>)[source]¶
Bases:
objectUniversal data container. Domain-agnostic.
Construct via factory classmethods, not directly.
The lumber yard analogy: DataSource has data (logs). It doesn’t know or care what you’re building—furniture (regression), paper (MVN MLE), or two-by-fours (survival analysis).
- keys()[source]¶
Return the names of all available arrays.
Example
>>> ds = DataSource.from_arrays(X=X, y=y) >>> ds.keys() frozenset({'X', 'y'})
- supports(capability)[source]¶
Check if this DataSource supports a capability.
- Parameters:
capability (str) – Use constants from pystatistics.core.capabilities
- Returns:
True if supported, False otherwise
- Return type:
Note
Unknown capabilities return False, never raise.
- classmethod from_arrays(*, X=None, y=None, data=None, columns=None, **named_arrays)[source]¶
Construct from NumPy arrays.
- classmethod from_file(path, *, columns=None)[source]¶
Construct from file (CSV, NPY).
- Parameters:
- Return type:
- classmethod from_dataframe(df, *, source_path=None)[source]¶
Construct from pandas DataFrame.
- Parameters:
df (pd.DataFrame)
source_path (str | None)
- Return type:
- classmethod from_tensors(*, X=None, y=None, **named_tensors)[source]¶
Construct from PyTorch tensors (already on GPU).
- Parameters:
X (torch.Tensor | None)
y (torch.Tensor | None)
named_tensors (torch.Tensor)
- Return type:
- class pystatistics.regression.LinearSolution(_result, _design, _standard_errors=None, _t_statistics=None, _p_values=None)[source]¶
Bases:
objectUser-facing regression results.
Wraps the backend Result and provides convenient accessors for all regression outputs including standard errors and t-statistics.
- Parameters:
- property standard_errors: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
Standard errors of coefficients.
For CPU QR backend: uses R^{-1} from QR decomposition to compute (X’X)^{-1} = R^{-1} R^{-T}, matching R’s backsolve(R, diag(p)) exactly.
For GPU backend (no R available): falls back to np.linalg.inv(X’X).
For rank-deficient matrices, aliased coefficients get NaN standard errors.
- property t_statistics: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
t-statistics for coefficients.
- class pystatistics.regression.LinearParams(coefficients, residuals, fitted_values, rss, tss, rank, df_residual)[source]¶
Bases:
objectParameter payload for linear regression.
This is the immutable data computed by backends.
- Parameters:
- class pystatistics.regression.GLMSolution(_result, _design, _standard_errors=None, _test_statistics=None, _p_values=None)[source]¶
Bases:
objectUser-facing GLM results.
Wraps the backend Result and provides convenient accessors for all GLM-specific outputs including deviance, AIC, and multiple residual types.
- Parameters:
- property fitted_values: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
Fitted values on the response scale (μ).
- property linear_predictor: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
Linear predictor η = X @ β.
- property residuals_deviance: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
Deviance residuals (signed).
- property residuals_pearson: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
(y - μ) / sqrt(V(μ)).
- Type:
Pearson residuals
- property residuals_working: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
Working residuals from the final IRLS iteration.
- property residuals_response: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
y - μ.
- Type:
Response residuals
- property dispersion: float¶
Dispersion parameter.
Fixed at 1.0 for Binomial and Poisson. Estimated from data for Gaussian.
- property standard_errors: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
Standard errors of coefficients.
Computed as sqrt(dispersion * diag((X’WX)^{-1})) where W are the final IRLS weights. Uses QR from the final iteration when available (CPU), falls back to direct inversion (GPU).
- property test_statistics: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
Wald test statistics (z for fixed dispersion, t for estimated).
- class pystatistics.regression.GLMParams(coefficients, fitted_values, linear_predictor, residuals_working, residuals_deviance, residuals_pearson, residuals_response, deviance, null_deviance, aic, dispersion, rank, df_residual, df_null, n_iter, converged, family_name, link_name)[source]¶
Bases:
objectParameter payload for GLM (IRLS).
This is the immutable data computed by GLM backends.
- Parameters:
coefficients (ndarray[tuple[Any, ...], dtype[floating[Any]]])
fitted_values (ndarray[tuple[Any, ...], dtype[floating[Any]]])
linear_predictor (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals_working (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals_deviance (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals_pearson (ndarray[tuple[Any, ...], dtype[floating[Any]]])
residuals_response (ndarray[tuple[Any, ...], dtype[floating[Any]]])
deviance (float)
null_deviance (float)
aic (float)
dispersion (float)
rank (int)
df_residual (int)
df_null (int)
n_iter (int)
converged (bool)
family_name (str)
link_name (str)
- class pystatistics.regression.Family(link=None)[source]¶
Bases:
ABCGLM family specification.
Defines the relationship between the mean and variance of the response distribution, along with a link function.
- Parameters:
link (str | Link | None)
- property link: Link¶
- abstractmethod deviance(y, mu, wt)[source]¶
Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).
The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.
- abstractmethod initialize(y)[source]¶
Initialize μ from y for IRLS starting values.
Must return values in the valid range for the link function.
- property dispersion_is_fixed: bool¶
Whether the dispersion parameter is known a priori.
True for Binomial (φ=1) and Poisson (φ=1). False for Gaussian (φ=σ² estimated from data).
- abstractmethod log_likelihood(y, mu, wt, dispersion)[source]¶
Compute the log-likelihood for AIC.
This must match R’s family$aic() / (-2) for consistency.
- class pystatistics.regression.Gaussian(link=None)[source]¶
Bases:
FamilyGaussian (Normal) family. Default link: identity.
V(μ) = 1 Deviance = Σ wt_i * (y_i - μ_i)² (= RSS for identity link)
- Parameters:
link (str | Link | None)
- initialize(y)[source]¶
Initialize μ from y for IRLS starting values.
Must return values in the valid range for the link function.
- deviance(y, mu, wt)[source]¶
Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).
The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.
- log_likelihood(y, mu, wt, dispersion)[source]¶
Compute the log-likelihood for AIC.
This must match R’s family$aic() / (-2) for consistency.
- aic(y, mu, wt, rank, dispersion)[source]¶
Compute AIC matching R’s gaussian family.
R’s gaussian()$aic uses MLE dispersion (dev/n, not dev/df) and adds +2 for the dispersion parameter. The formula is:
AIC = -2 * loglik(sigma_mle) + 2 + 2 * rank
where sigma_mle = sqrt(deviance / n), and the +2 comes from the gaussian family counting the dispersion as an extra parameter.
- class pystatistics.regression.Binomial(link=None)[source]¶
Bases:
FamilyBinomial family. Default link: logit.
V(μ) = μ(1-μ) Deviance = 2 * Σ wt_i * [y_i log(y_i/μ_i) + (1-y_i) log((1-y_i)/(1-μ_i))]
- Parameters:
link (str | Link | None)
- initialize(y)[source]¶
Initialize μ from y for IRLS starting values.
Must return values in the valid range for the link function.
- deviance(y, mu, wt)[source]¶
Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).
The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.
- class pystatistics.regression.Poisson(link=None)[source]¶
Bases:
FamilyPoisson family. Default link: log.
V(μ) = μ Deviance = 2 * Σ wt_i * [y_i log(y_i/μ_i) - (y_i - μ_i)]
- Parameters:
link (str | Link | None)
- initialize(y)[source]¶
Initialize μ from y for IRLS starting values.
Must return values in the valid range for the link function.
- deviance(y, mu, wt)[source]¶
Compute total deviance: 2 * Σ wt_i * d(y_i, μ_i).
The deviance is twice the difference between the saturated log-likelihood and the model log-likelihood.