Mixture Distribution

Mixture distributions are useful when a statistical population contains two or more subpopulations that are unobserved. DMixLearn allows for the specification of any model for the components of the mixture distribution. Assuming we have an observation x of data type T (any heterogenous form), the data generating process for a K-component mixture model is given by

\[\begin{split} \begin{array}{ll} z &\sim \boldsymbol{\pi} \\ x \vert z &\sim f_k(x \vert \theta_k) \end{array}\end{split}\]

where \(\pi_k\) representing the probability of x being drawn from component distribution \(f_k(x \vert \theta_k)\). For more details see Mixture Distribution.

MixtureDistribution

class dml.stats.mixture.MixtureDistribution(components, w, name=None, keys=(None, None))

MixtureDistribution object defined by component distributions and weights.

components

List of component distributions (data type T).

Type:

Sequence[SequenceEncodableProbabilityDistribution]

w

Mixture weights assigned from args (w).

Type:

ndarray[float]

zw

True if a weight is 0.0, else False.

Type:

ndarray[bool]

log_w

Log of weights (w). set to -np.inf, where zw is True.

Type:

ndarray[float]

num_components

Number of components in MixtureDistribution instance.

Type:

int

name

String name to MixtureDistribution object.

Type:

Optional[str]

keys

Set keys for the weights and component distributions.

Type:

Tuple[Optional[str], Optional[str]]

__init__(components, w, name=None, keys=(None, None))

MixtureDistribution object.

Parameters:
  • components (Sequence[SequenceEncodableProbabilityDistribution]) – Component distributions.

  • w (ndarray[float]) – Mixture weights, must sum to 1.0.

  • name (Optional[str]) – Assign string name to MixtureDistribution object.

  • keys (Tuple[Optional[str], Optional[str]]) – Set keys for the weights and component distributions.

component_log_density(x)

Evaluate component-wise log-density of Mixture distribution at observation x.

Returns num_components dim array with \(\log{\left(f_k(x)\right)}\) in each entry.

Parameters:

x (T) – Single observation from mixture distribution. T is data type of components.

Returns:

Component-wise log-density at x.

Return type:

np.ndarray

density(x)

Evaluate density of Mixture distribution at observation x.

See log_density for details.

Parameters:

x (T) – Single observation from mixture distribution. T is data type of components.

Returns:

Density at x.

Return type:

float

dist_to_encoder()

Create DataSequenceEncoder object for SequenceEncodableProbabilityDistribution instance.

Return type:

MixtureDataEncoder

Returns:

DataSequenceEncoder

estimator(pseudo_count=None)

Create a ParameterEstimator for corresponding SequenceEncodableProbabilityDistribution.

Parameters:

pseudo_count (Optional[float]) – Regularize sufficient statistics in estimation step.

Return type:

MixtureEstimator

Returns:

ParameterEstimator

log_density(x)

Evaluate log-density of Mixture distribution at observation x.

\[\log{f(x)} = \log{\left(\sum_{k=1}^{K} f_k(x) \pi_k\right)}.\]
Parameters:

x (T) – Single observation from mixture distribution. T is data type of components.

Returns:

Log-density at x.

Return type:

float

posterior(x)

Obtain the posterior distribution for each mixture component at observation x.

\[f(z=k \vert x ) = \frac{f_k(x) \pi_k}{\sum_{k=1}^{K} f_k(x) \pi_k}\]
Parameters:

x (T) – Single observation from mixture distribution. T is data type of components.

Returns:

Posterior distribution at observation x.

Return type:

np.ndarray

sampler(seed=None)

Create a DistributionSampler object for a given ProbabilityDistribution.

Parameters:

seed (Optional[int]) – Set seed for drawing samples from distribution.

Return type:

MixtureSampler

seq_component_log_density(x)

Vectorized evaluation of component_log_density.

Parameters:

x (MixtureEncodedDataSequence) – EncodedDataSequence for mixture component.

Returns:

2-d numpy array of floats having shape (sz,K), where sz is the number of iid obs in encoded sequence x, and K is the number of mixture components.

Return type:

np.ndarray

seq_log_density(x)

Vectorized evaluation of the log density.

Parameters:

x (EncodedDataSequence) – EncodedDataSequence for corresponding SequenceEncodedProbabilityDistribution.

Return type:

ndarray

Returns:

np.ndarray

seq_posterior(x)

Vectorized evaluation of posterior.

Parameters:

x (MixtureEncodedDataSequence) – EncodedDataSequence for mixture component.

Returns:

Posterior of each observation in encoded sequence.

Return type:

np.ndarray

MixtureEstimator

class dml.stats.mixture.MixtureEstimator(estimators, fixed_weights=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))

MixtureEstimator object used to estimate MixtureDistribution from aggregated sufficient statistics.

estimators

Sequence of ParameterEstimator objects for the mixture components.

Type:

Sequence[ParameterEstimator]

fixed_weights

Treat mixture weights as fixed values. Must sum to 1.0.

Type:

Optional[np.ndarray]

suff_stat

Weights of the mixture. Must sum to 1.0.

Type:

Optional[np.ndarray]

pseudo_count

Used to re-weight the member variable sufficient statistics in estimation.

Type:

Optional[float]

name

Name for MixtureEstimator object.

Type:

Optional[str]

keys

Keys for the weights and component distributions.

Type:

Tuple[Optional[str], Optional[str]]

__init__(estimators, fixed_weights=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))

MixtureEstimator object.

Parameters:
  • estimators (Sequence[ParameterEstimator]) – Sequence of ParameterEstimator objects for the mixture components.

  • fixed_weights (Optional[Union[Sequence[float], np.ndarray]]) – Set fixed values for mixture weights.

  • suff_stat (Optional[np.ndarray]) – Numpy array of floats with length equal to length of estimators.

  • pseudo_count (Optional[float]) – Used to re-weight the member variable sufficient statistics in estimation.

  • name (Optional[str]) – Set a name to the MixtureEstimator object.

  • keys (Tuple[Optional[str], Optional[str]]) – Set keys for the weights and component distributions.

accumulator_factory()

Create SequenceEncodableStatisticAccumulator object.

Return type:

MixtureAccumulatorFactory

estimate(nobs, suff_stat)

Estimate SequenceEncodableProbabilityDistribution for sufficient statistics.

Parameters:
  • nobs (Optional[float]) – Weighted number of observations.

  • suff_stat (Tuple[int, np.ndarray, np.ndarray, np.ndarray]) – Sufficient statistics for dirichlet distribution.

Return type:

MixtureDistribution

Returns:

SequenceEncodableProbabilityDistribution

MixtureSampler

class dml.stats.mixture.MixtureSampler(dist, seed=None)

MixtureSampler used to generate samples from instance of MixtureDistribution.

dist

MixtureDistribution to draw samples from.

Type:

MixtureDistribution

rng

Seeded RandomState for sampling.

Type:

RandomState

comp_samplers

List of DistributionSampler objects for each mixture component.

Type:

Sequence[DistributionSamplers]

sample(size=None)

Draw iid samples from a mixture distribution.

The data type drawn from ‘comp_samplers’ is type T, corresponding to the data type of the mixture components.

If size is None, a single sample (of data type T) is drawn and returned. If size is not None, ‘size’-iid mixture samples are drawn and returned as a Sequence with data type List[T].

Parameters:

size (Optional[int]) – Number of iid samples to draw.

Return type:

Union[Sequence[Any], Any]

Returns:

Data type T or Sequence[T].