Generalized Estimating Equations

GEE for clustered and longitudinal data matching R’s geepack::geeglm().

GEE: Generalized Estimating Equations for clustered/correlated data.

Fits regression models to clustered data (repeated measures on subjects, patients within clinics, etc.) using working correlation structures and sandwich variance estimation for valid inference even under misspecification.

Supports four working correlation structures:
  • Independence: R = I (recovers GLM with sandwich SE)

  • Exchangeable: R_ij = alpha (compound symmetry)

  • AR(1): R_ij = alpha^|i-j| (autoregressive)

  • Unstructured: R_ij estimated freely (equal cluster sizes only)

Supports four GLM families via pystatistics:
  • Gaussian (identity link)

  • Binomial (logit link)

  • Poisson (log link)

  • Gamma (inverse link)

Validates against: R geepack::geeglm()

class pystatsbio.gee.GEESolution(result)[source]

Bases: SolutionReprMixin

Public result of a GEE fit — a Solution wrapping Result[GEEParams].

Exposes every fit output as a read-only property plus the uniform .backend_name / .timing / .warnings / .info metadata and a Jupyter _repr_html_ (via SolutionReprMixin).

Parameters:

result (Result[GEEParams])

property backend_name: str
property timing: dict[str, float] | None
property warnings: tuple[str, ...]
property info: dict
property coefficients: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property naive_se: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property robust_se: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property naive_vcov: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property robust_vcov: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property z_values: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property p_values: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property fitted_values: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property residuals: ndarray[tuple[Any, ...], dtype[_ScalarT]]
property correlation_type: str
property correlation_params: dict
property scale: float
property n_clusters: int
property n_obs: int
property family_name: str
property converged: bool
property n_iter: int
property names: tuple[str, ...] | None
property coef: dict[str, float]

Coefficient dict (name -> value).

Uses provided names, or ‘x0’, ‘x1’, … if names were not given.

summary()[source]

R-style summary matching geepack::geeglm() output.

Return type:

str

class pystatsbio.gee.GEEParams(coefficients, naive_se, robust_se, naive_vcov, robust_vcov, z_values, p_values, fitted_values, residuals, correlation_type, correlation_params, scale, n_clusters, n_obs, family_name, link_name, converged, n_iter, names)[source]

Bases: object

Computed payload of a GEE fit.

Parameters:
coefficients

Regression coefficients (length p).

Type:

NDArray

naive_se

Model-based (naive) standard errors.

Type:

NDArray

robust_se

Sandwich (robust) standard errors.

Type:

NDArray

naive_vcov

Model-based variance-covariance matrix (p x p).

Type:

NDArray

robust_vcov

Sandwich variance-covariance matrix (p x p).

Type:

NDArray

z_values

Wald z-statistics: coefficients / robust_se.

Type:

NDArray

p_values

Two-sided p-values from z (normal approximation).

Type:

NDArray

fitted_values

Predicted means (mu_hat) for all observations.

Type:

NDArray

residuals

Pearson residuals: (y - mu_hat) / sqrt(V(mu_hat)).

Type:

NDArray

correlation_type

Working correlation structure name.

Type:

str

correlation_params

Estimated correlation parameters (e.g. {‘alpha’: 0.35}).

Type:

dict

scale

Estimated dispersion parameter (phi).

Type:

float

n_clusters

Number of clusters in the data.

Type:

int

n_obs

Total number of observations.

Type:

int

family_name

GLM family name (e.g. ‘gaussian’, ‘binomial’).

Type:

str

Link function name (e.g. ‘identity’, ‘logit’).

Type:

str

converged

Whether the iterative algorithm converged.

Type:

bool

n_iter

Number of iterations performed.

Type:

int

names

Coefficient names, if provided.

Type:

tuple[str, …] | None

coefficients: ndarray[tuple[Any, ...], dtype[_ScalarT]]
naive_se: ndarray[tuple[Any, ...], dtype[_ScalarT]]
robust_se: ndarray[tuple[Any, ...], dtype[_ScalarT]]
naive_vcov: ndarray[tuple[Any, ...], dtype[_ScalarT]]
robust_vcov: ndarray[tuple[Any, ...], dtype[_ScalarT]]
z_values: ndarray[tuple[Any, ...], dtype[_ScalarT]]
p_values: ndarray[tuple[Any, ...], dtype[_ScalarT]]
fitted_values: ndarray[tuple[Any, ...], dtype[_ScalarT]]
residuals: ndarray[tuple[Any, ...], dtype[_ScalarT]]
correlation_type: str
correlation_params: dict
scale: float
n_clusters: int
n_obs: int
family_name: str
link_name: str
converged: bool
n_iter: int
names: tuple[str, ...] | None
pystatsbio.gee.gee(y, X, cluster_id, *, family='gaussian', corr_structure='exchangeable', names=None, scale_fix=None, tol=1e-06, max_iter=50, backend=None)[source]

Fit a Generalized Estimating Equations (GEE) model.

GEE extends GLMs to handle correlated/clustered data by specifying a working correlation structure and using sandwich variance estimation for valid inference even if the working correlation is wrong.

Parameters:
  • y (ArrayLike) – Response variable (1-D array, n observations).

  • X (ArrayLike) – Design matrix (n x p). Should include an intercept column if an intercept is desired.

  • cluster_id (ArrayLike) – Cluster/subject identifiers (1-D array, n observations). Observations with the same cluster_id are treated as correlated.

  • family (str) – GLM family: ‘gaussian’, ‘binomial’, ‘poisson’, or ‘gamma’.

  • corr_structure (str) – Working correlation: ‘independence’, ‘exchangeable’, ‘ar1’, or ‘unstructured’.

  • names (list[str] | None) – Optional coefficient names (length p). If None, defaults to ‘x0’, ‘x1’, … in output.

  • scale_fix (float | None) – If not None, fix the dispersion parameter at this value instead of estimating it.

  • tol (float) – Convergence tolerance for relative coefficient change.

  • max_iter (int) – Maximum GEE iterations.

  • backend (str | None) – Execution target (device and precision), following the pystatistics convention: 'cpu' (float64, the reference path), 'gpu' (float32), 'gpu_fp64' (float64, CUDA only — raises on Apple Silicon/MPS, which has no float64), or 'auto' (CUDA if present, else CPU). None resolves from the input: a numpy array → 'cpu', a GPU torch.Tensor'gpu'. Precision lives in the backend string; there is no separate use_fp64 flag.

Returns:

Solution wrapping the GEE coefficients, robust SE, correlation parameters, convergence diagnostics, and a summary method, with the uniform .backend_name/.timing/.warnings/.info accessors.

Return type:

GEESolution

Raises:

ValidationError – If inputs fail validation (mismatched lengths, non-finite values, invalid family or correlation structure, fewer than 2 clusters).

Examples

>>> import numpy as np
>>> from pystatsbio.gee import gee
>>> n_clusters, cluster_size = 50, 5
>>> n = n_clusters * cluster_size
>>> cluster_id = np.repeat(np.arange(n_clusters), cluster_size)
>>> X = np.column_stack([np.ones(n), np.random.randn(n)])
>>> y = X @ np.array([1.0, 0.5]) + np.random.randn(n)
>>> result = gee(y, X, cluster_id, family='gaussian')
>>> print(result.summary())
class pystatsbio.gee.CorrStructure[source]

Bases: ABC

Base class for GEE working correlation structures.

Subclasses must implement: - name property: human-readable structure name - working_corr(cluster_size): build the R_i matrix - estimate(pearson_resids, phi): update parameters from residuals - params property: current parameter values as a dict

abstract property name: str

Human-readable correlation structure name.

abstractmethod working_corr(cluster_size)[source]

Return the working correlation matrix R_i for a cluster.

Parameters:

cluster_size (int) – Number of observations in this cluster.

Returns:

Correlation matrix of shape (cluster_size, cluster_size).

Return type:

NDArray

abstractmethod estimate(pearson_resids, phi, n_params)[source]

Estimate correlation parameters from standardized Pearson residuals.

Updates internal state in-place. Each element of pearson_resids is a vector of Pearson residuals for one cluster.

Parameters:
  • pearson_resids (list[NDArray]) – Per-cluster Pearson residual vectors.

  • phi (float) – Current dispersion estimate.

  • n_params (int) – Number of regression parameters (p), used in denominator.

Return type:

None

abstract property params: dict[str, float]

Current parameter values as a dict.

class pystatsbio.gee.IndependenceCorr[source]

Bases: CorrStructure

Independence working correlation: R = I.

Assumes all within-cluster observations are uncorrelated. No parameters to estimate. This is the simplest structure and recovers standard GLM-like estimates (though with sandwich SE).

property name: str

Human-readable correlation structure name.

working_corr(cluster_size)[source]

Return identity matrix.

Parameters:

cluster_size (int) – Number of observations in this cluster.

Returns:

Identity matrix of shape (cluster_size, cluster_size).

Return type:

NDArray

estimate(pearson_resids, phi, n_params)[source]

No-op: independence has no parameters to estimate.

Parameters:
Return type:

None

property params: dict[str, float]

no correlation parameters.

Type:

Empty dict

class pystatsbio.gee.ExchangeableCorr[source]

Bases: CorrStructure

Exchangeable (compound symmetry) working correlation.

R_ij = alpha for i != j, R_ii = 1.

Alpha is estimated as:

alpha = [sum_i sum_{j<k} r_ij * r_ik] / [sum_i n_i*(n_i-1)/2 - p] / phi

where r_ij are Pearson residuals, phi is the dispersion, and p is the number of regression parameters.

References

Liang, K.-Y. & Zeger, S. L. (1986). Biometrika, 73(1), 13-22.

property name: str

Human-readable correlation structure name.

working_corr(cluster_size)[source]

Build exchangeable correlation matrix.

Parameters:

cluster_size (int) – Number of observations in this cluster.

Returns:

Matrix with 1 on diagonal and alpha off-diagonal.

Return type:

NDArray

estimate(pearson_resids, phi, n_params)[source]

Estimate alpha from cross-products of Pearson residuals.

Parameters:
  • pearson_resids (list[NDArray]) – Per-cluster Pearson residual vectors.

  • phi (float) – Current dispersion estimate.

  • n_params (int) – Number of regression parameters (p).

Return type:

None

property params: dict[str, float]

Current alpha estimate.

class pystatsbio.gee.AR1Corr[source]

Bases: CorrStructure

AR(1) working correlation: R_ij = alpha^|i-j|.

Assumes observations within a cluster are ordered (e.g., by time) and adjacent observations have correlation alpha.

Alpha is estimated as:

alpha = [sum_i sum_t r_{i,t} * r_{i,t+1}] / [sum_i (n_i - 1) - p] / phi

References

Liang, K.-Y. & Zeger, S. L. (1986). Biometrika, 73(1), 13-22.

property name: str

Human-readable correlation structure name.

working_corr(cluster_size)[source]

Build AR(1) correlation matrix.

Parameters:

cluster_size (int) – Number of observations in this cluster.

Returns:

Toeplitz matrix with R_ij = alpha^|i-j|.

Return type:

NDArray

estimate(pearson_resids, phi, n_params)[source]

Estimate alpha from consecutive residual products.

Parameters:
  • pearson_resids (list[NDArray]) – Per-cluster Pearson residual vectors.

  • phi (float) – Current dispersion estimate.

  • n_params (int) – Number of regression parameters (p).

Return type:

None

property params: dict[str, float]

Current alpha estimate.

class pystatsbio.gee.UnstructuredCorr[source]

Bases: CorrStructure

Unstructured working correlation.

R_jk is estimated freely for each pair (j, k). This is only feasible when all clusters have the same size, because each R_jk requires data from all clusters at positions j and k.

R_jk = sum_i r_ij * r_ik / (K - p) / phi

where K is the number of clusters and p is the number of parameters.

Raises:

ValidationError – If clusters have unequal sizes during estimation.

References

Liang, K.-Y. & Zeger, S. L. (1986). Biometrika, 73(1), 13-22.

property name: str

Human-readable correlation structure name.

working_corr(cluster_size)[source]

Return the estimated unstructured correlation matrix.

If not yet estimated, returns the identity matrix.

Parameters:

cluster_size (int) – Number of observations in this cluster.

Returns:

Correlation matrix of shape (cluster_size, cluster_size).

Return type:

NDArray

estimate(pearson_resids, phi, n_params)[source]

Estimate all pairwise correlations from residuals.

Parameters:
  • pearson_resids (list[NDArray]) – Per-cluster Pearson residual vectors. All must have the same length.

  • phi (float) – Current dispersion estimate.

  • n_params (int) – Number of regression parameters (p).

Raises:

ValidationError – If cluster sizes are not all equal.

Return type:

None

property params: dict[str, float]

Current correlation matrix entries as a flat dict.