Open Source

PyStatistics

GPU-accelerated statistical computing for Python

Validated against R to machine precision. Free forever.

What It Is

PyStatistics is a comprehensive statistical computing library for Python that maintains two parallel computational paths:

  • CPU backends are validated against R to near machine precision (rtol = 1e-10). When a CPU result disagrees with R, PyStatistics has a bug.
  • GPU backends prioritize throughput and scalability using FP32 arithmetic, validated against CPU backends with documented tolerances.

The library covers the full spectrum of classical statistics: regression, survival analysis, ANOVA, mixed models, bootstrap methods, hypothesis testing, descriptive statistics, and multivariate normal MLE with missing data.

Design Principles

1. Correctness > Fidelity > Performance > Convenience
2. Fail fast, fail loud — no silent fallbacks
3. Explicit over implicit — require parameters, don't assume intent
4. Two-tier validation — CPU vs R, then GPU vs CPU

Installation

Coming soon to PyPI. For now, install from source:

pip install git+https://github.com/sgcx-org/pystatistics.git

With GPU support (requires PyTorch):

pip install "pystatistics[gpu] @ git+https://github.com/sgcx-org/pystatistics.git"

Quick Start

Linear Regression

from pystatistics.regression import fit
import numpy as np

X = np.random.randn(1000, 5)
y = X @ [1, 2, 3, -1, 0.5] + np.random.randn(1000) * 0.1
result = fit(X, y)
print(result.summary())

# Logistic regression
y_binary = (X @ [1, -1, 0.5, 0, 0] + np.random.randn(1000) > 0).astype(float)
result = fit(X, y_binary, family='binomial')

# GPU acceleration (any model)
result = fit(X, y, backend='gpu')

Hypothesis Testing

from pystatistics.hypothesis import t_test, p_adjust

result = t_test([1,2,3,4,5], [3,4,5,6,7])
print(result.statistic, result.p_value, result.conf_int)
print(result.summary())  # R-style output

# Multiple testing correction
p_adjusted = p_adjust([0.01, 0.04, 0.03, 0.005], method='BH')

Survival Analysis

from pystatistics.survival import kaplan_meier, coxph
import numpy as np

time = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
event = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 1])

km = kaplan_meier(time, event)
print(km.survival, km.se, km.ci_lower, km.ci_upper)

X = np.column_stack([np.random.randn(10)])
cox = coxph(time, event, X)
print(cox.coefficients, cox.hazard_ratios)

Modules

Every module follows the same architecture: DataSource → Design → fit() → Backend.solve() → Result

regression

Linear & generalized linear models. OLS, logistic, Poisson via IRLS. CPU QR, GPU Cholesky.

API Reference →

descriptive

Mean, SD, correlation, covariance, quantiles (all 9 R types), skewness, kurtosis.

API Reference →

hypothesis

t-test, chi-squared, Fisher exact, Wilcoxon, KS, proportions, F-test, p.adjust.

API Reference →

montecarlo

Bootstrap (ordinary, balanced, parametric), permutation tests, 5 CI methods, batched GPU solver.

API Reference →

survival

Kaplan-Meier, log-rank test, Cox PH (CPU), discrete-time survival (GPU).

API Reference →

anova

One-way, factorial, ANCOVA, repeated measures. Type I/II/III SS. Tukey, Bonferroni, Dunnett.

API Reference →

mixed

LMM & GLMM. Random intercepts/slopes, nested/crossed, REML/ML, Satterthwaite df.

API Reference →

mvnmle

Multivariate normal MLE with missing data. Direct & EM algorithms. Little's MCAR test.

API Reference →
Full API Reference → View on GitHub ← Back to Technology