Mixture Distribution

Mixture distributions are useful when a statistical population contains two or more subpopulations that are unobserved. DMixLearn allows for the specification of any model for the components of the mixture distribution. Assuming we have an observation x of data type T (any heterogenous form), the data generating process for a K-component mixture model is given by

\[\begin{split} \begin{array}{ll} z &\sim \boldsymbol{\pi} \\ x \vert z &\sim f_k(x \vert \theta_k) \end{array}\end{split}\]

where \(\pi_k\) representing the probability of x being drawn from component distribution \(f_k(x \vert \theta_k)\). For more details see Mixture Distribution.

MixtureDistribution

class dml.stats.mixture.MixtureDistribution(components, w, name=None, keys=(None, None))

MixtureDistribution object defined by component distributions and weights.

components

List of component distributions (data type T).

Type:: Sequence[SequenceEncodableProbabilityDistribution]

w

Mixture weights assigned from args (w).

Type:: ndarray[float]

zw

True if a weight is 0.0, else False.

Type:: ndarray[bool]

log_w

Log of weights (w). set to -np.inf, where zw is True.

Type:: ndarray[float]

num_components

Number of components in MixtureDistribution instance.

Type:: int

name

String name to MixtureDistribution object.

Type:: Optional[str]

keys

Set keys for the weights and component distributions.

Type:: Tuple[Optional[str], Optional[str]]

__init__(components, w, name=None, keys=(None, None))

MixtureDistribution object.

Parameters:

components (Sequence[SequenceEncodableProbabilityDistribution]) – Component distributions.
w (ndarray[float]) – Mixture weights, must sum to 1.0.
name (Optional[str]) – Assign string name to MixtureDistribution object.
keys (Tuple[Optional[str], Optional[str]]) – Set keys for the weights and component distributions.

component_log_density(x)

Evaluate component-wise log-density of Mixture distribution at observation x.

Returns num_components dim array with \(\log{\left(f_k(x)\right)}\) in each entry.

Parameters:: x (T) – Single observation from mixture distribution. T is data type of components.
Returns:: Component-wise log-density at x.
Return type:: np.ndarray

density(x)

Evaluate density of Mixture distribution at observation x.

See log_density for details.

Parameters:: x (T) – Single observation from mixture distribution. T is data type of components.
Returns:: Density at x.
Return type:: float

dist_to_encoder()

Create DataSequenceEncoder object for SequenceEncodableProbabilityDistribution instance.

Return type:: MixtureDataEncoder
Returns:: DataSequenceEncoder

estimator(pseudo_count=None)

Create a ParameterEstimator for corresponding SequenceEncodableProbabilityDistribution.

Parameters:: pseudo_count (Optional[float]) – Regularize sufficient statistics in estimation step.
Return type:: MixtureEstimator
Returns:: ParameterEstimator

log_density(x)

Evaluate log-density of Mixture distribution at observation x.

\[\log{f(x)} = \log{\left(\sum_{k=1}^{K} f_k(x) \pi_k\right)}.\]

Parameters:: x (T) – Single observation from mixture distribution. T is data type of components.
Returns:: Log-density at x.
Return type:: float

posterior(x)

Obtain the posterior distribution for each mixture component at observation x.

\[f(z=k \vert x ) = \frac{f_k(x) \pi_k}{\sum_{k=1}^{K} f_k(x) \pi_k}\]

Parameters:: x (T) – Single observation from mixture distribution. T is data type of components.
Returns:: Posterior distribution at observation x.
Return type:: np.ndarray

sampler(seed=None)

Create a DistributionSampler object for a given ProbabilityDistribution.

Parameters:: seed (Optional[int]) – Set seed for drawing samples from distribution.
Return type:: MixtureSampler

seq_component_log_density(x)

Vectorized evaluation of component_log_density.

Parameters:: x (MixtureEncodedDataSequence) – EncodedDataSequence for mixture component.
Returns:: 2-d numpy array of floats having shape (sz,K), where sz is the number of iid obs in encoded sequence x, and K is the number of mixture components.
Return type:: np.ndarray

seq_log_density(x)

Vectorized evaluation of the log density.

Parameters:: x (EncodedDataSequence) – EncodedDataSequence for corresponding SequenceEncodedProbabilityDistribution.
Return type:: ndarray
Returns:: np.ndarray

seq_posterior(x)

Vectorized evaluation of posterior.

Parameters:: x (MixtureEncodedDataSequence) – EncodedDataSequence for mixture component.
Returns:: Posterior of each observation in encoded sequence.
Return type:: np.ndarray

MixtureEstimator

class dml.stats.mixture.MixtureEstimator(estimators, fixed_weights=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))

MixtureEstimator object used to estimate MixtureDistribution from aggregated sufficient statistics.

estimators

Sequence of ParameterEstimator objects for the mixture components.

Type:: Sequence[ParameterEstimator]

fixed_weights

Treat mixture weights as fixed values. Must sum to 1.0.

Type:: Optional[np.ndarray]

suff_stat

Weights of the mixture. Must sum to 1.0.

Type:: Optional[np.ndarray]

pseudo_count

Used to re-weight the member variable sufficient statistics in estimation.

Type:: Optional[float]

name

Name for MixtureEstimator object.

Type:: Optional[str]

keys

Keys for the weights and component distributions.

Type:: Tuple[Optional[str], Optional[str]]

__init__(estimators, fixed_weights=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))

MixtureEstimator object.

Parameters:

estimators (Sequence[ParameterEstimator]) – Sequence of ParameterEstimator objects for the mixture components.
fixed_weights (Optional[Union[Sequence[float], np.ndarray]]) – Set fixed values for mixture weights.
suff_stat (Optional[np.ndarray]) – Numpy array of floats with length equal to length of estimators.
pseudo_count (Optional[float]) – Used to re-weight the member variable sufficient statistics in estimation.
name (Optional[str]) – Set a name to the MixtureEstimator object.
keys (Tuple[Optional[str], Optional[str]]) – Set keys for the weights and component distributions.

accumulator_factory()

Create SequenceEncodableStatisticAccumulator object.

Return type:: MixtureAccumulatorFactory

estimate(nobs, suff_stat)

Estimate SequenceEncodableProbabilityDistribution for sufficient statistics.

Parameters:

nobs (Optional[float]) – Weighted number of observations.
suff_stat (Tuple[int, np.ndarray, np.ndarray, np.ndarray]) – Sufficient statistics for dirichlet distribution.

Return type:

MixtureDistribution

Returns:

SequenceEncodableProbabilityDistribution

MixtureSampler

class dml.stats.mixture.MixtureSampler(dist, seed=None)

MixtureSampler used to generate samples from instance of MixtureDistribution.

dist

MixtureDistribution to draw samples from.

Type:: MixtureDistribution

rng

Seeded RandomState for sampling.

Type:: RandomState

comp_samplers

List of DistributionSampler objects for each mixture component.

Type:: Sequence[DistributionSamplers]

sample(size=None)

Draw iid samples from a mixture distribution.

The data type drawn from ‘comp_samplers’ is type T, corresponding to the data type of the mixture components.

If size is None, a single sample (of data type T) is drawn and returned. If size is not None, ‘size’-iid mixture samples are drawn and returned as a Sequence with data type List[T].

Parameters:: size (Optional[int]) – Number of iid samples to draw.
Return type:: Union[Sequence[Any], Any]
Returns:: Data type T or Sequence[T].