Mixture Distribution
Mixture distributions are useful when a statistical population contains two or more subpopulations that are unobserved. DMixLearn allows for the specification of any model for the components of the mixture distribution. Assuming we have an observation x of data type T (any heterogenous form), the data generating process for a K-component mixture model is given by
where \(\pi_k\) representing the probability of x being drawn from component distribution \(f_k(x \vert \theta_k)\). For more details see Mixture Distribution.
MixtureDistribution
- class dml.stats.mixture.MixtureDistribution(components, w, name=None, keys=(None, None))
MixtureDistribution object defined by component distributions and weights.
- components
List of component distributions (data type T).
- Type:
Sequence[SequenceEncodableProbabilityDistribution]
- w
Mixture weights assigned from args (w).
- Type:
ndarray[float]
- zw
True if a weight is 0.0, else False.
- Type:
ndarray[bool]
- log_w
Log of weights (w). set to -np.inf, where zw is True.
- Type:
ndarray[float]
- num_components
Number of components in MixtureDistribution instance.
- Type:
int
- name
String name to MixtureDistribution object.
- Type:
Optional[str]
- keys
Set keys for the weights and component distributions.
- Type:
Tuple[Optional[str], Optional[str]]
- __init__(components, w, name=None, keys=(None, None))
MixtureDistribution object.
- Parameters:
components (Sequence[SequenceEncodableProbabilityDistribution]) – Component distributions.
w (ndarray[float]) – Mixture weights, must sum to 1.0.
name (Optional[str]) – Assign string name to MixtureDistribution object.
keys (Tuple[Optional[str], Optional[str]]) – Set keys for the weights and component distributions.
- component_log_density(x)
Evaluate component-wise log-density of Mixture distribution at observation x.
Returns num_components dim array with \(\log{\left(f_k(x)\right)}\) in each entry.
- Parameters:
x (T) – Single observation from mixture distribution. T is data type of components.
- Returns:
Component-wise log-density at x.
- Return type:
np.ndarray
- density(x)
Evaluate density of Mixture distribution at observation x.
See log_density for details.
- Parameters:
x (T) – Single observation from mixture distribution. T is data type of components.
- Returns:
Density at x.
- Return type:
float
- dist_to_encoder()
Create DataSequenceEncoder object for SequenceEncodableProbabilityDistribution instance.
- Return type:
MixtureDataEncoder- Returns:
DataSequenceEncoder
- estimator(pseudo_count=None)
Create a ParameterEstimator for corresponding SequenceEncodableProbabilityDistribution.
- Parameters:
pseudo_count (Optional[float]) – Regularize sufficient statistics in estimation step.
- Return type:
- Returns:
ParameterEstimator
- log_density(x)
Evaluate log-density of Mixture distribution at observation x.
\[\log{f(x)} = \log{\left(\sum_{k=1}^{K} f_k(x) \pi_k\right)}.\]- Parameters:
x (T) – Single observation from mixture distribution. T is data type of components.
- Returns:
Log-density at x.
- Return type:
float
- posterior(x)
Obtain the posterior distribution for each mixture component at observation x.
\[f(z=k \vert x ) = \frac{f_k(x) \pi_k}{\sum_{k=1}^{K} f_k(x) \pi_k}\]- Parameters:
x (T) – Single observation from mixture distribution. T is data type of components.
- Returns:
Posterior distribution at observation x.
- Return type:
np.ndarray
- sampler(seed=None)
Create a DistributionSampler object for a given ProbabilityDistribution.
- Parameters:
seed (Optional[int]) – Set seed for drawing samples from distribution.
- Return type:
- seq_component_log_density(x)
Vectorized evaluation of component_log_density.
- Parameters:
x (MixtureEncodedDataSequence) – EncodedDataSequence for mixture component.
- Returns:
2-d numpy array of floats having shape (sz,K), where sz is the number of iid obs in encoded sequence x, and K is the number of mixture components.
- Return type:
np.ndarray
- seq_log_density(x)
Vectorized evaluation of the log density.
- Parameters:
x (EncodedDataSequence) – EncodedDataSequence for corresponding SequenceEncodedProbabilityDistribution.
- Return type:
ndarray- Returns:
np.ndarray
- seq_posterior(x)
Vectorized evaluation of posterior.
- Parameters:
x (MixtureEncodedDataSequence) – EncodedDataSequence for mixture component.
- Returns:
Posterior of each observation in encoded sequence.
- Return type:
np.ndarray
MixtureEstimator
- class dml.stats.mixture.MixtureEstimator(estimators, fixed_weights=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))
MixtureEstimator object used to estimate MixtureDistribution from aggregated sufficient statistics.
- estimators
Sequence of ParameterEstimator objects for the mixture components.
- Type:
Sequence[ParameterEstimator]
- fixed_weights
Treat mixture weights as fixed values. Must sum to 1.0.
- Type:
Optional[np.ndarray]
- suff_stat
Weights of the mixture. Must sum to 1.0.
- Type:
Optional[np.ndarray]
- pseudo_count
Used to re-weight the member variable sufficient statistics in estimation.
- Type:
Optional[float]
- name
Name for MixtureEstimator object.
- Type:
Optional[str]
- keys
Keys for the weights and component distributions.
- Type:
Tuple[Optional[str], Optional[str]]
- __init__(estimators, fixed_weights=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))
MixtureEstimator object.
- Parameters:
estimators (Sequence[ParameterEstimator]) – Sequence of ParameterEstimator objects for the mixture components.
fixed_weights (Optional[Union[Sequence[float], np.ndarray]]) – Set fixed values for mixture weights.
suff_stat (Optional[np.ndarray]) – Numpy array of floats with length equal to length of estimators.
pseudo_count (Optional[float]) – Used to re-weight the member variable sufficient statistics in estimation.
name (Optional[str]) – Set a name to the MixtureEstimator object.
keys (Tuple[Optional[str], Optional[str]]) – Set keys for the weights and component distributions.
- accumulator_factory()
Create SequenceEncodableStatisticAccumulator object.
- Return type:
MixtureAccumulatorFactory
- estimate(nobs, suff_stat)
Estimate SequenceEncodableProbabilityDistribution for sufficient statistics.
- Parameters:
nobs (Optional[float]) – Weighted number of observations.
suff_stat (Tuple[int, np.ndarray, np.ndarray, np.ndarray]) – Sufficient statistics for dirichlet distribution.
- Return type:
- Returns:
SequenceEncodableProbabilityDistribution
MixtureSampler
- class dml.stats.mixture.MixtureSampler(dist, seed=None)
MixtureSampler used to generate samples from instance of MixtureDistribution.
- dist
MixtureDistribution to draw samples from.
- Type:
- rng
Seeded RandomState for sampling.
- Type:
RandomState
- comp_samplers
List of DistributionSampler objects for each mixture component.
- Type:
Sequence[DistributionSamplers]
- sample(size=None)
Draw iid samples from a mixture distribution.
The data type drawn from ‘comp_samplers’ is type T, corresponding to the data type of the mixture components.
If size is None, a single sample (of data type T) is drawn and returned. If size is not None, ‘size’-iid mixture samples are drawn and returned as a Sequence with data type List[T].
- Parameters:
size (Optional[int]) – Number of iid samples to draw.
- Return type:
Union[Sequence[Any],Any]- Returns:
Data type T or Sequence[T].