Hierarchical Mixture Distribution
A Hierarchical Mixture Model is a statistical model that combines multiple layers of mixture models to capture complex data distributions. It is particularly useful in scenarios where data can be grouped into subpopulations, each of which may follow its own mixture distribution. This model is often referred to as a “mixture of mixtures.” The data format required for using the Hierarchical mixture requries a sequence of observations of the form \(\boldsymbol{X}_i = (X_{i,1}, ..., X_{i, n_i})\). This can be thought of as independent draws from a topic, where the topic distribution is also a mixture model.
Feature |
Symbol |
Description |
|---|---|---|
Outer-State Distribution |
\(\boldsymbol{\pi}\) |
Represents the distribution over the \(K_1\) outer states. |
Inner-State Probabilities |
\(\boldsymbol{\tau}_{k}\) |
Distribution over the \(K_2\) inner-states. |
Component Distributions |
\(f_{k}(x)\) |
The probability distribution of the observed data given a specific inner state. |
Length Distribution |
\(g(\cdot)\) |
Distribution for the lengths of each observation. |
The generative process for data \(\boldsymbol{X}_i = (X_{i,1}, ..., X_{i, n_i})\) is as follows,
HierarchicalMixtureDistribution
- class dml.stats.hmixture.HierarchicalMixtureDistribution(topics, mixture_weights, topic_weights, len_dist=NullDistribution(name=None), name=None, keys=(None, None))
HierarchicalMixtureDistribution object defining a hierarchical mixture distribution.
- topics
Topic distributions shared in hierarchical mixture distribution.
- Type:
Sequence[SequenceEncodableProbabilityDistribution]
- num_topics
Number of topic distributions (i.e. sets number of inner-mixture weights).
- Type:
int
- num_mixtures
Number of weights in outter-mixture (i.e. sets numer of top-layer mixture weights.)
- Type:
int
- w
1-d numpy array of outer-mixture weights. Should sum to 1.
- Type:
np.ndarray
- log_w
Numpy array of the log of w above.
- Type:
np.ndarray
- taus
2-d array of dimension (num_mixtures by num_topics).
- Type:
np.ndarray
- log_taus
2-d array of the log of tau above.
- Type:
np.ndarray
- len_dist
Distribution for the sequence length on topics. Defaults to the NullDistribution if None is passed.
- name
Name for object instance.
- Type:
Optional[str]
- keys
Keys for the weights and topics.
- Type:
Tuple[Optional[str], Optional[str]]
- __init__(topics, mixture_weights, topic_weights, len_dist=NullDistribution(name=None), name=None, keys=(None, None))
HierarchicalMixtureDistribution object.
- Parameters:
topics (Sequence[SequenceEncodableProbabilityDistribution]) – Topic distributions shared in hierarchical mixture distribution.
mixture_weights (Union[List[float], np.ndarray]) – One-d array of floats for weights on components of mixtures. Should sum to 1.0.
topic_weights (Union[List[List[float]], np.ndarray]) – 2-d array with rows containing weights for each component mixture distribution. All rows should sum to 1.0.
len_dist (Optional[SequenceEncodableProbabilityDistribution]) – Distribution for the length on the sequence distribution for the component mixtures
name (Optional[str]) – Set name for object instance.
keys (Optional[Tuple[Optional[str], Optional[str]]]) – Set keys for the weights and topics.
- component_log_density(x)
Evaluate the component-wise log-density for an observation from a hierarchical mixture model.
- Parameters:
x (Sequence[T]) – An observation from a hierarchical mixture model.
- Returns:
Numpy array length of ‘num_mixtures’.
- Return type:
np.ndarray
- density(x)
Evaluate the density of an observation from hierarchical mixture distribution.
- Parameters:
x (Sequence[T]) – A sequence of type data type T’s.
- Returns:
Density evaluated at x.
- Return type:
float
- dist_to_encoder()
Create DataSequenceEncoder object for SequenceEncodableProbabilityDistribution instance.
- Return type:
HierarchicalMixtureDataEncoder- Returns:
DataSequenceEncoder
- estimator(pseudo_count=None)
Create a ParameterEstimator for corresponding SequenceEncodableProbabilityDistribution.
- Parameters:
pseudo_count (Optional[float]) – Regularize sufficient statistics in estimation step.
- Return type:
- Returns:
ParameterEstimator
- log_density(x)
Evaluate the log density of an observation from hierarchical mixture distribution.
Note: Observation is a sequence.
- Parameters:
x (Sequence[T]) – A sequence of type data type T’s.
- Returns:
Log-density evaluated at x.
- Return type:
float
- posterior(x)
Compute the posterior over the mixture components for the outer-mixture at observed value x.
- Parameters:
x (Sequence[T]) – An observed sequence of data type T.
- Returns:
Numpy array of length ‘num_mixtures’.
- Return type:
np.ndarray
- sampler(seed=None)
Create a DistributionSampler object for a given ProbabilityDistribution.
- Parameters:
seed (Optional[int]) – Set seed for drawing samples from distribution.
- Return type:
- seq_component_log_density(x)
Vectorized evaluation of the outer-mixture component-wise log-density for an encoded sequence x.
This returns a numpy array with shape (rv[0], ‘num_mixtures’).
Note
This density is a Mixture of Sequence of Mixture, so the data must be bin-counted as last step in code.
- Parameters:
x (HierarchicalMixtureEncodedDataSequence) – EncodedDataSequence for Hierarchical mixture observations.
- Returns:
Numpy array of dimensions ‘rv[0]’ by ‘num_mixtures’, containing the log-density for each component of the outer mixture.
- Return type:
np.ndarray
- seq_log_density(x)
Vectorized evaluation of the log density.
- Parameters:
x (EncodedDataSequence) – EncodedDataSequence for corresponding SequenceEncodedProbabilityDistribution.
- Return type:
ndarray- Returns:
np.ndarray
- seq_posterior(x)
Vectorized evaluation of the posterior over each outer-mixture component for an encoded sequence x.
- Parameters:
x (HierarchicalMixtureEncodedDataSequence) – EncodedDataSequence for Hierarchical mixture observations.
- Returns:
dimension (x[0], ‘num_mixtures’) containing posteriors for each observation.
- Return type:
np.ndarray
- to_mixture()
Returns a MixtureDistribution object created from object instance.
- Return type:
HierarchicalMixtureEstimator
- class dml.stats.hmixture.HierarchicalMixtureEstimator(estimators, num_mixtures, len_estimator=<dml.stats.null_dist.NullEstimator object>, len_dist=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))
- HierarchicalMixtureEstimator object for estimating hierarchical mixture distribution for aggregated
sufficient statistics.
Note: If pseudo_count is passed, the mixture weights are re-weighted in estimation. If attribute suff_stat is set, a suff_stat is re-weighted and combined with new sufficient statistics in estimation.
- num_components
Number of topic distributions (inner-mixture).
- Type:
int
- num_mixtures
Number of outer-mixture components.
- Type:
int
- estimators
ParameterEstimator objects for the topics.
- Type:
Sequence[ParameterEstimator]
- pseudo_count
Re-weight ‘suff_stat’ above in estimation.
- Type:
Optional[float]
- suff_stat
2-d numpy array of dimension (num_components, num_mixtures). Represents the inner-mixture weights.
- Type:
np.ndarray
- len_estimator
Estimator for the length of inner mixture sequences.
- Type:
Optional[ParameterEstimator]
- keys
Keys for weights and topics, passed to accumulator factory with call to ‘accumulator_factory()’.
- Type:
Optional[Tuple[Optional[str], Optional[str]]]
- len_dist
Fix the length on inner-mixture sequence distribution.
- Type:
Optional[SequenceEncodableProbabilityDistribution]
- name
Name for object instance.
- Type:
Optional[str]
- __init__(estimators, num_mixtures, len_estimator=<dml.stats.null_dist.NullEstimator object>, len_dist=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))
HierarchicalMixtureEstimator object.
- Parameters:
estimators (Sequence[ParameterEstimator]) – ParameterEstimator objects for the topics.
num_mixtures (int) – Number of outer-mixture components.
len_estimator (Optional[ParameterEstimator]) – Estimator for the length of inner mixture sequences.
len_dist (Optional[SequenceEncodableProbabilityDistribution]) – Fix the length on inner-mixture sequence distribution.
suff_stat (np.ndarray) – 2-d numpy array of dimension (num_components, num_mixtures). Represents the inner-mixture weights.
pseudo_count (Optional[float]) – Re-weight ‘suff_stat’ above in estimation.
name (Optional[str]) – Set a name to object instance.
keys (Optional[Tuple[Optional[str], Optional[str]]]) – Set keys for weights and topics.
- accumulator_factory()
Create SequenceEncodableStatisticAccumulator object.
- Return type:
HierarchicalMixtureEstimatorAccumulatorFactory
- estimate(nobs, suff_stat)
Estimate SequenceEncodableProbabilityDistribution for sufficient statistics.
- Parameters:
nobs (Optional[float]) – Weighted number of observations.
suff_stat (Tuple[int, np.ndarray, np.ndarray, np.ndarray]) – Sufficient statistics for dirichlet distribution.
- Return type:
- Returns:
SequenceEncodableProbabilityDistribution
HierarchicalMixtureSampler
- class dml.stats.hmixture.HierarchicalMixtureSampler(dist, seed=None)
HierarchicalMixtureSampler object for sampling from a hierarchical mixture model.
- rng
RandomState object with seed set is passed as arg.
- Type:
RandomState
- dist
HierarchicalMixtureDistribution instance to sample from.
- sampler
Convert ‘dist’ to a MixtureDistribution for sampling.
- Type:
MixtureDistributionSampler
- sample(size=None)
Generate samples from distribution.
- Parameters:
size (Optional[int]) – Number of samples to generate.
- Return type:
Union[Sequence[Any],Any]- Returns:
Samples from distribution.