Multivariate Normal MLE

Maximum likelihood estimation for multivariate normal distributions with missing data. Direct BFGS and EM algorithms. Little’s MCAR test. Missing data pattern analysis.

Multivariate normal maximum likelihood estimation with missing data.

Public API:

mlest(data, …) -> MVNSolution

pystatistics.mvnmle.mlest(data_or_design, *, algorithm='direct', backend=None, method=None, tol=None, max_iter=None, regularize=True, verbose=False)[source]

Maximum likelihood estimation for multivariate normal with missing data.

Accepts EITHER:
  1. An MVNDesign object

  2. Raw data array or DataFrame (convenience)

Parameters:
  • data_or_design (array-like or MVNDesign) – Data matrix with NaN for missing values, or MVNDesign object.

  • algorithm (str) –

    Estimation algorithm: - ‘direct’ (default): BFGS optimization on the log-likelihood,

    using R-exact inverse Cholesky parameterization.

    • ’em’: Expectation-Maximization algorithm. Typically slower to converge but guaranteed monotone likelihood increase.

    • ’monotone’: Closed-form MLE for monotone missingness patterns (Anderson 1957). Raises ValidationError if the data are not monotone — users should check with pystatistics.mvnmle.is_monotone() first, or use EM/direct for general patterns. When applicable, this is orders of magnitude faster than iterative algorithms.

  • backend (str or None) – Backend selection. Default None → ‘cpu’ (R-reference path, validated for regulated-industry use). Explicit values: ‘cpu’, ‘gpu’, or ‘auto’ to prefer GPU when available.

  • method (str or None) – Optimization method for direct algorithm. If None, auto-selected by backend. Ignored for EM.

  • tol (float or None) – Convergence tolerance. If None, uses algorithm-appropriate default: direct = 1e-5 (gradient tolerance), em = 1e-4 (parameter change).

  • max_iter (int or None) – Maximum iterations. If None, uses algorithm-appropriate default: direct = 100, em = 1000.

  • verbose (bool) – Print progress information.

  • regularize (bool)

Return type:

MVNSolution

Examples

>>> from pystatistics.mvnmle import mlest, datasets
>>> result = mlest(datasets.apple)
>>> result_em = mlest(datasets.apple, algorithm='em')
>>> print(result.muhat)
>>> print(result.loglik)
class pystatistics.mvnmle.MVNDesign(_data, _n, _p)[source]

Bases: object

Design for multivariate normal MLE with missing data.

Wraps a data matrix (n observations x p variables) that may contain NaN values representing missing data. Immutable after construction.

Construction:

MVNDesign.from_array(data) MVNDesign.from_datasource(ds, columns=[‘a’, ‘b’, ‘c’])

Parameters:
classmethod from_array(data)[source]

Build MVNDesign from array-like data.

Parameters:

data (array-like) – 2D data matrix. Can be numpy array, pandas DataFrame, or any array-like with .values attribute.

Return type:

MVNDesign

classmethod from_datasource(source, *, columns=None)[source]

Build MVNDesign from a DataSource.

Parameters:
  • source (DataSource) – Data source providing columns

  • columns (list of str, optional) – Column names to include. If None, uses all columns.

Return type:

MVNDesign

property data: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Data matrix (n x p), may contain NaN.

property n: int

Number of observations.

property p: int

Number of variables.

property n_missing: int

Total number of missing values.

property missing_rate: float

Overall missing rate (0.0 to 1.0).

property has_missing: bool

Whether data has any missing values.

class pystatistics.mvnmle.MVNSolution(_result, _design)[source]

Bases: object

User-facing MVN MLE results.

Wraps the backend Result and provides convenient accessors for all MVN estimation outputs.

Parameters:
property muhat: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Estimated mean vector.

property sigmahat: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Estimated covariance matrix.

property loglik: float

Log-likelihood at the estimated parameters.

property converged: bool

Whether the optimization converged.

property n_iter: int

Number of iterations.

property gradient_norm: float | None

Final gradient norm, if available.

property correlation_matrix: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Correlation matrix derived from estimated covariance.

property standard_deviations: ndarray[tuple[Any, ...], dtype[floating[Any]]]

Standard deviations from estimated covariance.

property aic: float

Akaike Information Criterion.

property bic: float

Bayesian Information Criterion.

property info: dict[str, Any]

Backend metadata.

property timing: dict[str, float] | None

Execution timing breakdown.

property backend_name: str

Name of the backend that produced this result.

property warnings: tuple[str, ...]

Non-fatal warnings from computation.

summary()[source]

Generate summary output.

Return type:

str

to_dict()[source]

Convert to dictionary for serialization.

Return type:

dict[str, Any]

class pystatistics.mvnmle.MVNParams(muhat, sigmahat, loglik, n_iter, converged, gradient_norm=None)[source]

Bases: object

Parameter payload for MVN MLE.

Immutable data computed by backends.

Parameters:
muhat: ndarray[tuple[Any, ...], dtype[floating[Any]]]
sigmahat: ndarray[tuple[Any, ...], dtype[floating[Any]]]
loglik: float
n_iter: int
converged: bool
gradient_norm: float | None = None
pystatistics.mvnmle.analyze_patterns(data)[source]

Analyze missingness patterns in the data.

Parameters:

data (array-like) – Data matrix with missing values as np.nan. Can be NumPy array or pandas DataFrame.

Returns:

Patterns sorted by frequency (most common first).

Return type:

List[PatternInfo]

pystatistics.mvnmle.pattern_summary(patterns, data_shape=None)[source]

Generate summary statistics for missingness patterns.

Parameters:
  • patterns (List[PatternInfo]) – Pattern information from analyze_patterns()

  • data_shape (Optional[Tuple[int, int]]) – Original data shape (n_obs, n_vars)

Return type:

PatternSummary

class pystatistics.mvnmle.PatternInfo(pattern_id, observed_indices, missing_indices, n_cases, data, pattern_vector)[source]

Bases: object

Information about a single missingness pattern.

Parameters:
pattern_id: int
observed_indices: ndarray
missing_indices: ndarray
n_cases: int
data: ndarray
pattern_vector: ndarray
property n_observed: int
property n_missing: int
property percent_cases: float
class pystatistics.mvnmle.PatternSummary(n_patterns, total_cases, overall_missing_rate, most_common_pattern, complete_cases, complete_cases_percent, variable_missing_rates)[source]

Bases: object

Summary statistics for all missingness patterns in a dataset.

Parameters:
n_patterns: int
total_cases: int
overall_missing_rate: float
most_common_pattern: PatternInfo
complete_cases: int
complete_cases_percent: float
variable_missing_rates: Dict[int, float]
pystatistics.mvnmle.little_mcar_test(data, alpha=0.05, backend=None, algorithm='em', regularize=True, condition_threshold=1000000000000.0, drop_all_missing_rows=True, verbose=False)[source]

Little’s test for Missing Completely at Random (MCAR).

Parameters:
  • data (array-like, shape (n_observations, n_variables)) – Data matrix with missing values as np.nan.

  • alpha (float, default=0.05) – Significance level.

  • backend (str or None, default None) – Backend for the ML estimation step (the dominant cost). Default None → ‘cpu’ (R-reference path). Explicit: ‘cpu’, ‘gpu’, or ‘auto’ to prefer GPU when available. The per-pattern test-statistic accumulation runs on CPU regardless — it’s O(P × v³) for tiny v and is not the bottleneck.

  • algorithm (str, default 'em') – ML algorithm forwarded to mlest: ‘em’ (EM, default) or ‘direct’ (BFGS). EM is the default because BFGS convergence scales poorly with the number of missingness patterns — on realistic tabular data (e.g. 13 vars × 107 patterns) BFGS can take 400+ seconds while EM finishes in under a second. Little’s statistic depends only on the final ML estimates, not on which algorithm produced them.

  • regularize (bool, default True) – When an observed-variable covariance submatrix is ill-conditioned (common on real tabular data — iris, wine, breast_cancer all trip the default threshold), fall back to the Moore-Penrose pseudo-inverse with a warning rather than raising. Matches R’s BaylorEdPsych::LittleMCAR and misty::na.test defaults. Pass False for strict fail-fast behaviour.

  • condition_threshold (float, default 1e12) – Maximum condition number before a per-pattern covariance submatrix is considered ill-conditioned.

  • drop_all_missing_rows (bool, default True) – Drop rows with no observed values before fitting. Such rows contribute nothing to either the MLE or the chi-square statistic; at realistic missingness rates on low-dimensional data they show up routinely and shouldn’t block the test. A warning reports how many rows were dropped.

  • verbose (bool, default False) – Print detailed progress.

Return type:

MCARTestResult

class pystatistics.mvnmle.MCARTestResult(statistic, df, p_value, rejected, alpha, patterns, n_patterns, n_patterns_used, ml_mean, ml_cov, convergence_warnings, method='Little (MLE plug-in)')[source]

Bases: object

Result of an MCAR test.

Canonical result type for little_mcar_test (Little 1988, MLE plug-in). The dataclass is also reused by downstream packages that ship their own MCAR-test variants (e.g. Lacuna’s MoM plug-in at lacuna.analysis.mcar.mom_mcar_test) — the method field identifies which estimator produced the result so a caller can disambiguate without tracking the calling function.

Notes

ml_mean / ml_cov carry the plug-in estimates used in the chi-square statistic.

Parameters:
statistic: float
df: int
p_value: float
rejected: bool
alpha: float
patterns: List[PatternInfo]
n_patterns: int
n_patterns_used: int
ml_mean: ndarray
ml_cov: ndarray
convergence_warnings: List[str]
method: str = 'Little (MLE plug-in)'
summary()[source]

Generate human-readable summary of test results.

Return type:

str

pystatistics.mvnmle.is_monotone(data)[source]

Return True iff the missingness pattern is monotone.

Parameters:

data (array-like, shape (n, v)) – Data matrix with NaN for missing values.

Return type:

bool

pystatistics.mvnmle.monotone_permutation(data)[source]

Find a variable ordering that makes the missingness pattern monotone, or return None if no such ordering exists.

Algorithm: order columns by number of missing values ascending. The pattern is monotone under that order iff each column’s set of missing-row indices is a subset of the next column’s set. Equivalent to “at each prefix, the set of rows with at least one missing among the prefix columns is the set of rows where the last prefix column is missing.”

Parameters:

data (array-like, shape (n, v))

Returns:

order – Permutation that yields monotone patterns, or None.

Return type:

ndarray of int64, shape (v,) or None

pystatistics.mvnmle.mlest_monotone_closed_form(data)[source]

Closed-form MVN MLE for monotone missingness.

Works by ordering variables as per monotone_permutation so that missingness is a nested sequence, then performing a chain of OLS regressions. Specifically, with variables ordered $X_1, X_2, ldots, X_v$ such that every observation with $X_k$ observed also has $X_1, ldots, X_{k-1}$ observed:

  1. $hatmu_1, hatsigma_{11}$ are the sample mean and variance of $X_1$ over all observations (it is never missing under the monotone order).

  2. For $k = 2, ldots, v$: using observations where $X_k$ is observed, fit OLS regression $X_k = beta_{k,0} + sum_{j<k} beta_{k,j} X_j + epsilon_k$ with $mathrm{Var}(epsilon_k) = rho_k^2$. Back- substitute into the full $(mu, Sigma)$ via

    $hatmu_k = beta_{k,0} + beta_{k,1:}^top hatmu_{1:k-1}$ $hatSigma_{k, 1:k-1} = beta_{k,1:}^top hatSigma_{1:k-1, 1:k-1}$ $hatSigma_{k, k} = rho_k^2 + beta_{k,1:}^top hatSigma_{1:k-1, 1:k-1} beta_{k,1:}$

  3. Undo the variable permutation to return parameters in the user’s original variable order.

Matches Anderson (1957). Runs in $O(v^3 n)$ deterministic work; no iteration.

Parameters:

data (array-like, shape (n, v))

Returns:

  • mu (ndarray (v,))

  • sigma (ndarray (v, v))

  • n_effective (int) – Number of rows that participated in at least one regression (all rows that aren’t all-missing).

Raises:

ValidationError – If the data are not monotone, or if a regression step has fewer observations than predictors (identifiability failure).

Return type:

tuple[ndarray[tuple[Any, …], dtype[_ScalarT]], ndarray[tuple[Any, …], dtype[_ScalarT]], int]