Multivariate Normal MLE¶
Maximum likelihood estimation for multivariate normal distributions with missing data. Direct BFGS and EM algorithms. Little’s MCAR test. Missing data pattern analysis.
Multivariate normal maximum likelihood estimation with missing data.
- Public API:
mlest(data, …) -> MVNSolution
- pystatistics.mvnmle.mlest(data_or_design, *, algorithm='direct', backend=None, method=None, tol=None, max_iter=None, regularize=True, verbose=False)[source]¶
Maximum likelihood estimation for multivariate normal with missing data.
- Accepts EITHER:
An MVNDesign object
Raw data array or DataFrame (convenience)
- Parameters:
data_or_design (array-like or MVNDesign) – Data matrix with NaN for missing values, or MVNDesign object.
algorithm (str) –
Estimation algorithm: - ‘direct’ (default): BFGS optimization on the log-likelihood,
using R-exact inverse Cholesky parameterization.
’em’: Expectation-Maximization algorithm. Typically slower to converge but guaranteed monotone likelihood increase.
’monotone’: Closed-form MLE for monotone missingness patterns (Anderson 1957). Raises ValidationError if the data are not monotone — users should check with
pystatistics.mvnmle.is_monotone()first, or use EM/direct for general patterns. When applicable, this is orders of magnitude faster than iterative algorithms.
backend (str or None) – Backend selection. Default None → ‘cpu’ (R-reference path, validated for regulated-industry use). Explicit values: ‘cpu’, ‘gpu’, or ‘auto’ to prefer GPU when available.
method (str or None) – Optimization method for direct algorithm. If None, auto-selected by backend. Ignored for EM.
tol (float or None) – Convergence tolerance. If None, uses algorithm-appropriate default: direct = 1e-5 (gradient tolerance), em = 1e-4 (parameter change).
max_iter (int or None) – Maximum iterations. If None, uses algorithm-appropriate default: direct = 100, em = 1000.
verbose (bool) – Print progress information.
regularize (bool)
- Return type:
Examples
>>> from pystatistics.mvnmle import mlest, datasets >>> result = mlest(datasets.apple) >>> result_em = mlest(datasets.apple, algorithm='em') >>> print(result.muhat) >>> print(result.loglik)
- class pystatistics.mvnmle.MVNDesign(_data, _n, _p)[source]¶
Bases:
objectDesign for multivariate normal MLE with missing data.
Wraps a data matrix (n observations x p variables) that may contain NaN values representing missing data. Immutable after construction.
- Construction:
MVNDesign.from_array(data) MVNDesign.from_datasource(ds, columns=[‘a’, ‘b’, ‘c’])
- classmethod from_array(data)[source]¶
Build MVNDesign from array-like data.
- Parameters:
data (array-like) – 2D data matrix. Can be numpy array, pandas DataFrame, or any array-like with .values attribute.
- Return type:
- classmethod from_datasource(source, *, columns=None)[source]¶
Build MVNDesign from a DataSource.
- Parameters:
source (DataSource) – Data source providing columns
columns (list of str, optional) – Column names to include. If None, uses all columns.
- Return type:
- class pystatistics.mvnmle.MVNSolution(_result, _design)[source]¶
Bases:
objectUser-facing MVN MLE results.
Wraps the backend Result and provides convenient accessors for all MVN estimation outputs.
- property correlation_matrix: ndarray[tuple[Any, ...], dtype[floating[Any]]]¶
Correlation matrix derived from estimated covariance.
- class pystatistics.mvnmle.MVNParams(muhat, sigmahat, loglik, n_iter, converged, gradient_norm=None)[source]¶
Bases:
objectParameter payload for MVN MLE.
Immutable data computed by backends.
- Parameters:
- pystatistics.mvnmle.analyze_patterns(data)[source]¶
Analyze missingness patterns in the data.
- Parameters:
data (array-like) – Data matrix with missing values as np.nan. Can be NumPy array or pandas DataFrame.
- Returns:
Patterns sorted by frequency (most common first).
- Return type:
List[PatternInfo]
- pystatistics.mvnmle.pattern_summary(patterns, data_shape=None)[source]¶
Generate summary statistics for missingness patterns.
- Parameters:
patterns (List[PatternInfo]) – Pattern information from analyze_patterns()
data_shape (Optional[Tuple[int, int]]) – Original data shape (n_obs, n_vars)
- Return type:
- class pystatistics.mvnmle.PatternInfo(pattern_id, observed_indices, missing_indices, n_cases, data, pattern_vector)[source]¶
Bases:
objectInformation about a single missingness pattern.
- Parameters:
- class pystatistics.mvnmle.PatternSummary(n_patterns, total_cases, overall_missing_rate, most_common_pattern, complete_cases, complete_cases_percent, variable_missing_rates)[source]¶
Bases:
objectSummary statistics for all missingness patterns in a dataset.
- Parameters:
- most_common_pattern: PatternInfo¶
- pystatistics.mvnmle.little_mcar_test(data, alpha=0.05, backend=None, algorithm='em', regularize=True, condition_threshold=1000000000000.0, drop_all_missing_rows=True, verbose=False)[source]¶
Little’s test for Missing Completely at Random (MCAR).
- Parameters:
data (array-like, shape (n_observations, n_variables)) – Data matrix with missing values as np.nan.
alpha (float, default=0.05) – Significance level.
backend (str or None, default None) – Backend for the ML estimation step (the dominant cost). Default None → ‘cpu’ (R-reference path). Explicit: ‘cpu’, ‘gpu’, or ‘auto’ to prefer GPU when available. The per-pattern test-statistic accumulation runs on CPU regardless — it’s O(P × v³) for tiny v and is not the bottleneck.
algorithm (str, default 'em') – ML algorithm forwarded to mlest: ‘em’ (EM, default) or ‘direct’ (BFGS). EM is the default because BFGS convergence scales poorly with the number of missingness patterns — on realistic tabular data (e.g. 13 vars × 107 patterns) BFGS can take 400+ seconds while EM finishes in under a second. Little’s statistic depends only on the final ML estimates, not on which algorithm produced them.
regularize (bool, default True) – When an observed-variable covariance submatrix is ill-conditioned (common on real tabular data — iris, wine, breast_cancer all trip the default threshold), fall back to the Moore-Penrose pseudo-inverse with a warning rather than raising. Matches R’s
BaylorEdPsych::LittleMCARandmisty::na.testdefaults. Pass False for strict fail-fast behaviour.condition_threshold (float, default 1e12) – Maximum condition number before a per-pattern covariance submatrix is considered ill-conditioned.
drop_all_missing_rows (bool, default True) – Drop rows with no observed values before fitting. Such rows contribute nothing to either the MLE or the chi-square statistic; at realistic missingness rates on low-dimensional data they show up routinely and shouldn’t block the test. A warning reports how many rows were dropped.
verbose (bool, default False) – Print detailed progress.
- Return type:
- class pystatistics.mvnmle.MCARTestResult(statistic, df, p_value, rejected, alpha, patterns, n_patterns, n_patterns_used, ml_mean, ml_cov, convergence_warnings, method='Little (MLE plug-in)')[source]¶
Bases:
objectResult of an MCAR test.
Canonical result type for
little_mcar_test(Little 1988, MLE plug-in). The dataclass is also reused by downstream packages that ship their own MCAR-test variants (e.g. Lacuna’s MoM plug-in atlacuna.analysis.mcar.mom_mcar_test) — themethodfield identifies which estimator produced the result so a caller can disambiguate without tracking the calling function.Notes
ml_mean/ml_covcarry the plug-in estimates used in the chi-square statistic.- Parameters:
- patterns: List[PatternInfo]¶
- pystatistics.mvnmle.monotone_permutation(data)[source]¶
Find a variable ordering that makes the missingness pattern monotone, or return
Noneif no such ordering exists.Algorithm: order columns by number of missing values ascending. The pattern is monotone under that order iff each column’s set of missing-row indices is a subset of the next column’s set. Equivalent to “at each prefix, the set of rows with at least one missing among the prefix columns is the set of rows where the last prefix column is missing.”
- Parameters:
data (array-like, shape (n, v))
- Returns:
order – Permutation that yields monotone patterns, or
None.- Return type:
ndarray of int64, shape (v,) or None
- pystatistics.mvnmle.mlest_monotone_closed_form(data)[source]¶
Closed-form MVN MLE for monotone missingness.
Works by ordering variables as per
monotone_permutationso that missingness is a nested sequence, then performing a chain of OLS regressions. Specifically, with variables ordered $X_1, X_2, ldots, X_v$ such that every observation with $X_k$ observed also has $X_1, ldots, X_{k-1}$ observed:$hatmu_1, hatsigma_{11}$ are the sample mean and variance of $X_1$ over all observations (it is never missing under the monotone order).
For $k = 2, ldots, v$: using observations where $X_k$ is observed, fit OLS regression $X_k = beta_{k,0} + sum_{j<k} beta_{k,j} X_j + epsilon_k$ with $mathrm{Var}(epsilon_k) = rho_k^2$. Back- substitute into the full $(mu, Sigma)$ via
$hatmu_k = beta_{k,0} + beta_{k,1:}^top hatmu_{1:k-1}$ $hatSigma_{k, 1:k-1} = beta_{k,1:}^top hatSigma_{1:k-1, 1:k-1}$ $hatSigma_{k, k} = rho_k^2 + beta_{k,1:}^top hatSigma_{1:k-1, 1:k-1} beta_{k,1:}$
Undo the variable permutation to return parameters in the user’s original variable order.
Matches Anderson (1957). Runs in $O(v^3 n)$ deterministic work; no iteration.
- Parameters:
data (array-like, shape (n, v))
- Returns:
mu (ndarray (v,))
sigma (ndarray (v, v))
n_effective (int) – Number of rows that participated in at least one regression (all rows that aren’t all-missing).
- Raises:
ValidationError – If the data are not monotone, or if a regression step has fewer observations than predictors (identifiability failure).
- Return type:
tuple[ndarray[tuple[Any, …], dtype[_ScalarT]], ndarray[tuple[Any, …], dtype[_ScalarT]], int]