PyStatistics

What It Is

PyStatistics is a comprehensive statistical computing library for Python that maintains two parallel computational paths:

CPU backends are validated against R to near machine precision (rtol = 1e-10). When a CPU result disagrees with R, PyStatistics has a bug.
GPU backends prioritize throughput and scalability using FP32 arithmetic, validated against CPU backends with documented tolerances.

The library covers the full spectrum of classical statistics: regression, survival analysis, ANOVA, mixed models, bootstrap methods, hypothesis testing, descriptive statistics, and multivariate normal MLE with missing data.

Design Principles

1. Correctness > Fidelity > Performance > Convenience
2. Fail fast, fail loud — no silent fallbacks
3. Explicit over implicit — require parameters, don't assume intent
4. Two-tier validation — CPU vs R, then GPU vs CPU

Installation

Coming soon to PyPI. For now, install from source:

pip install git+https://github.com/sgcx-org/pystatistics.git

With GPU support (requires PyTorch):

                pip install "pystatistics[gpu] @ git+https://github.com/sgcx-org/pystatistics.git"
            

Quick Start

Linear Regression

from pystatistics.regression import fit
import numpy as np

X = np.random.randn(1000, 5)
y = X @ [1, 2, 3, -1, 0.5] + np.random.randn(1000) * 0.1
result = fit(X, y)
print(result.summary())

# Logistic regression
y_binary = (X @ [1, -1, 0.5, 0, 0] + np.random.randn(1000) > 0).astype(float)
result = fit(X, y_binary, family='binomial')

# GPU acceleration (any model)
result = fit(X, y, backend='gpu')

Hypothesis Testing

from pystatistics.hypothesis import t_test, p_adjust

result = t_test([1,2,3,4,5], [3,4,5,6,7])
print(result.statistic, result.p_value, result.conf_int)
print(result.summary())  # R-style output

# Multiple testing correction
p_adjusted = p_adjust([0.01, 0.04, 0.03, 0.005], method='BH')

Survival Analysis

from pystatistics.survival import kaplan_meier, coxph
import numpy as np

time = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
event = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 1])

km = kaplan_meier(time, event)
print(km.survival, km.se, km.ci_lower, km.ci_upper)

X = np.column_stack([np.random.randn(10)])
cox = coxph(time, event, X)
print(cox.coefficients, cox.hazard_ratios)

Modules

Every module follows the same architecture: DataSource → Design → fit() → Backend.solve() → Result

PyStatistics

What It Is

Design Principles

Installation

Quick Start

Linear Regression

Hypothesis Testing

Survival Analysis

Modules

regression

descriptive

hypothesis

montecarlo

survival

anova

mixed

mvnmle