Generalized Estimating Equations¶
GEE for clustered and longitudinal data matching R’s geepack::geeglm().
GEE: Generalized Estimating Equations for clustered/correlated data.
Fits regression models to clustered data (repeated measures on subjects, patients within clinics, etc.) using working correlation structures and sandwich variance estimation for valid inference even under misspecification.
- Supports four working correlation structures:
Independence: R = I (recovers GLM with sandwich SE)
Exchangeable: R_ij = alpha (compound symmetry)
AR(1): R_ij = alpha^|i-j| (autoregressive)
Unstructured: R_ij estimated freely (equal cluster sizes only)
- Supports four GLM families via pystatistics:
Gaussian (identity link)
Binomial (logit link)
Poisson (log link)
Gamma (inverse link)
Validates against: R geepack::geeglm()
- class pystatsbio.gee.GEESolution(result)[source]¶
Bases:
SolutionReprMixinPublic result of a GEE fit — a Solution wrapping
Result[GEEParams].Exposes every fit output as a read-only property plus the uniform
.backend_name/.timing/.warnings/.infometadata and a Jupyter_repr_html_(viaSolutionReprMixin).- Parameters:
result (Result[GEEParams])
- class pystatsbio.gee.GEEParams(coefficients, naive_se, robust_se, naive_vcov, robust_vcov, z_values, p_values, fitted_values, residuals, correlation_type, correlation_params, scale, n_clusters, n_obs, family_name, link_name, converged, n_iter, names)[source]¶
Bases:
objectComputed payload of a GEE fit.
- Parameters:
- coefficients¶
Regression coefficients (length p).
- Type:
NDArray
- naive_se¶
Model-based (naive) standard errors.
- Type:
NDArray
- robust_se¶
Sandwich (robust) standard errors.
- Type:
NDArray
- naive_vcov¶
Model-based variance-covariance matrix (p x p).
- Type:
NDArray
- robust_vcov¶
Sandwich variance-covariance matrix (p x p).
- Type:
NDArray
- z_values¶
Wald z-statistics: coefficients / robust_se.
- Type:
NDArray
- p_values¶
Two-sided p-values from z (normal approximation).
- Type:
NDArray
- fitted_values¶
Predicted means (mu_hat) for all observations.
- Type:
NDArray
- residuals¶
Pearson residuals: (y - mu_hat) / sqrt(V(mu_hat)).
- Type:
NDArray
- pystatsbio.gee.gee(y, X, cluster_id, *, family='gaussian', corr_structure='exchangeable', names=None, scale_fix=None, tol=1e-06, max_iter=50, backend=None)[source]¶
Fit a Generalized Estimating Equations (GEE) model.
GEE extends GLMs to handle correlated/clustered data by specifying a working correlation structure and using sandwich variance estimation for valid inference even if the working correlation is wrong.
- Parameters:
y (ArrayLike) – Response variable (1-D array, n observations).
X (ArrayLike) – Design matrix (n x p). Should include an intercept column if an intercept is desired.
cluster_id (ArrayLike) – Cluster/subject identifiers (1-D array, n observations). Observations with the same cluster_id are treated as correlated.
family (str) – GLM family: ‘gaussian’, ‘binomial’, ‘poisson’, or ‘gamma’.
corr_structure (str) – Working correlation: ‘independence’, ‘exchangeable’, ‘ar1’, or ‘unstructured’.
names (list[str] | None) – Optional coefficient names (length p). If None, defaults to ‘x0’, ‘x1’, … in output.
scale_fix (float | None) – If not None, fix the dispersion parameter at this value instead of estimating it.
tol (float) – Convergence tolerance for relative coefficient change.
max_iter (int) – Maximum GEE iterations.
backend (str | None) – Execution target (device and precision), following the pystatistics convention:
'cpu'(float64, the reference path),'gpu'(float32),'gpu_fp64'(float64, CUDA only — raises on Apple Silicon/MPS, which has no float64), or'auto'(CUDA if present, else CPU).Noneresolves from the input: a numpy array →'cpu', a GPUtorch.Tensor→'gpu'. Precision lives in the backend string; there is no separateuse_fp64flag.
- Returns:
Solution wrapping the GEE coefficients, robust SE, correlation parameters, convergence diagnostics, and a summary method, with the uniform .backend_name/.timing/.warnings/.info accessors.
- Return type:
- Raises:
ValidationError – If inputs fail validation (mismatched lengths, non-finite values, invalid family or correlation structure, fewer than 2 clusters).
Examples
>>> import numpy as np >>> from pystatsbio.gee import gee >>> n_clusters, cluster_size = 50, 5 >>> n = n_clusters * cluster_size >>> cluster_id = np.repeat(np.arange(n_clusters), cluster_size) >>> X = np.column_stack([np.ones(n), np.random.randn(n)]) >>> y = X @ np.array([1.0, 0.5]) + np.random.randn(n) >>> result = gee(y, X, cluster_id, family='gaussian') >>> print(result.summary())
- class pystatsbio.gee.CorrStructure[source]¶
Bases:
ABCBase class for GEE working correlation structures.
Subclasses must implement: -
nameproperty: human-readable structure name -working_corr(cluster_size): build the R_i matrix -estimate(pearson_resids, phi): update parameters from residuals -paramsproperty: current parameter values as a dict- abstractmethod working_corr(cluster_size)[source]¶
Return the working correlation matrix R_i for a cluster.
- Parameters:
cluster_size (int) – Number of observations in this cluster.
- Returns:
Correlation matrix of shape (cluster_size, cluster_size).
- Return type:
NDArray
- class pystatsbio.gee.IndependenceCorr[source]¶
Bases:
CorrStructureIndependence working correlation: R = I.
Assumes all within-cluster observations are uncorrelated. No parameters to estimate. This is the simplest structure and recovers standard GLM-like estimates (though with sandwich SE).
- working_corr(cluster_size)[source]¶
Return identity matrix.
- Parameters:
cluster_size (int) – Number of observations in this cluster.
- Returns:
Identity matrix of shape (cluster_size, cluster_size).
- Return type:
NDArray
- class pystatsbio.gee.ExchangeableCorr[source]¶
Bases:
CorrStructureExchangeable (compound symmetry) working correlation.
R_ij = alpha for i != j, R_ii = 1.
- Alpha is estimated as:
alpha = [sum_i sum_{j<k} r_ij * r_ik] / [sum_i n_i*(n_i-1)/2 - p] / phi
where r_ij are Pearson residuals, phi is the dispersion, and p is the number of regression parameters.
References
Liang, K.-Y. & Zeger, S. L. (1986). Biometrika, 73(1), 13-22.
- working_corr(cluster_size)[source]¶
Build exchangeable correlation matrix.
- Parameters:
cluster_size (int) – Number of observations in this cluster.
- Returns:
Matrix with 1 on diagonal and alpha off-diagonal.
- Return type:
NDArray
- class pystatsbio.gee.AR1Corr[source]¶
Bases:
CorrStructureAR(1) working correlation: R_ij = alpha^|i-j|.
Assumes observations within a cluster are ordered (e.g., by time) and adjacent observations have correlation alpha.
- Alpha is estimated as:
alpha = [sum_i sum_t r_{i,t} * r_{i,t+1}] / [sum_i (n_i - 1) - p] / phi
References
Liang, K.-Y. & Zeger, S. L. (1986). Biometrika, 73(1), 13-22.
- working_corr(cluster_size)[source]¶
Build AR(1) correlation matrix.
- Parameters:
cluster_size (int) – Number of observations in this cluster.
- Returns:
Toeplitz matrix with R_ij = alpha^|i-j|.
- Return type:
NDArray
- class pystatsbio.gee.UnstructuredCorr[source]¶
Bases:
CorrStructureUnstructured working correlation.
R_jk is estimated freely for each pair (j, k). This is only feasible when all clusters have the same size, because each R_jk requires data from all clusters at positions j and k.
R_jk = sum_i r_ij * r_ik / (K - p) / phi
where K is the number of clusters and p is the number of parameters.
- Raises:
ValidationError – If clusters have unequal sizes during estimation.
References
Liang, K.-Y. & Zeger, S. L. (1986). Biometrika, 73(1), 13-22.
- working_corr(cluster_size)[source]¶
Return the estimated unstructured correlation matrix.
If not yet estimated, returns the identity matrix.
- Parameters:
cluster_size (int) – Number of observations in this cluster.
- Returns:
Correlation matrix of shape (cluster_size, cluster_size).
- Return type:
NDArray