Hierarchical Mixture Distribution

A Hierarchical Mixture Model is a statistical model that combines multiple layers of mixture models to capture complex data distributions. It is particularly useful in scenarios where data can be grouped into subpopulations, each of which may follow its own mixture distribution. This model is often referred to as a “mixture of mixtures.” The data format required for using the Hierarchical mixture requries a sequence of observations of the form \(\boldsymbol{X}_i = (X_{i,1}, ..., X_{i, n_i})\). This can be thought of as independent draws from a topic, where the topic distribution is also a mixture model.

Hierarchical Mixture Model Features
Feature	Symbol	Description
Outer-State Distribution	\(\boldsymbol{\pi}\)	Represents the distribution over the \(K_1\) outer states.
Inner-State Probabilities	\(\boldsymbol{\tau}_{k}\)	Distribution over the \(K_2\) inner-states.
Component Distributions	\(f_{k}(x)\)	The probability distribution of the observed data given a specific inner state.
Length Distribution	\(g(\cdot)\)	Distribution for the lengths of each observation.

The generative process for data \(\boldsymbol{X}_i = (X_{i,1}, ..., X_{i, n_i})\) is as follows,

\[\begin{split}\begin{array}{ll} n_i &\sim g(\cdot) \\ Z_i & \sim \boldsymbol{\pi} \\ U_{i, j} \vert Z_i & \sim \boldsymbol{\tau}_{Z_i} \\ X_{i, j} \vert U_{i, j} & \sim f_{U_{i, j}}(\cdot) \end{array}\end{split}\]

HierarchicalMixtureDistribution

class pysp.stats.hmixture.HierarchicalMixtureDistribution(topics, mixture_weights, topic_weights, len_dist=NullDistribution(name=None), name=None, keys=(None, None))

HierarchicalMixtureDistribution object defining a hierarchical mixture distribution.

topics

Topic distributions shared in hierarchical mixture distribution.

Type:: Sequence[SequenceEncodableProbabilityDistribution]

num_topics

Number of topic distributions (i.e. sets number of inner-mixture weights).

Type:: int

num_mixtures

Number of weights in outter-mixture (i.e. sets numer of top-layer mixture weights.)

Type:: int

w

1-d numpy array of outer-mixture weights. Should sum to 1.

Type:: np.ndarray

log_w

Numpy array of the log of w above.

Type:: np.ndarray

taus

2-d array of dimension (num_mixtures by num_topics).

Type:: np.ndarray

log_taus

2-d array of the log of tau above.

Type:: np.ndarray

len_dist

Distribution for the sequence length on topics. Defaults to the NullDistribution if None is passed.

Type:: SequenceEncodableProbabilityDistribution

name

Name for object instance.

Type:: Optional[str]

keys

Keys for the weights and topics.

Type:: Tuple[Optional[str], Optional[str]]

__init__(topics, mixture_weights, topic_weights, len_dist=NullDistribution(name=None), name=None, keys=(None, None))

HierarchicalMixtureDistribution object.

Parameters:

topics (Sequence[SequenceEncodableProbabilityDistribution]) – Topic distributions shared in hierarchical mixture distribution.
mixture_weights (Union[List[float], np.ndarray]) – One-d array of floats for weights on components of mixtures. Should sum to 1.0.
topic_weights (Union[List[List[float]], np.ndarray]) – 2-d array with rows containing weights for each component mixture distribution. All rows should sum to 1.0.
len_dist (Optional[SequenceEncodableProbabilityDistribution]) – Distribution for the length on the sequence distribution for the component mixtures
name (Optional[str]) – Set name for object instance.
keys (Optional[Tuple[Optional[str], Optional[str]]]) – Set keys for the weights and topics.

component_log_density(x)

Evaluate the component-wise log-density for an observation from a hierarchical mixture model.

Parameters:: x (Sequence[T]) – An observation from a hierarchical mixture model.
Returns:: Numpy array length of ‘num_mixtures’.
Return type:: np.ndarray

density(x)

Evaluate the density of an observation from hierarchical mixture distribution.

Parameters:: x (Sequence[T]) – A sequence of type data type T’s.
Returns:: Density evaluated at x.
Return type:: float

dist_to_encoder()

Create DataSequenceEncoder object for SequenceEncodableProbabilityDistribution instance.

Return type:: HierarchicalMixtureDataEncoder
Returns:: DataSequenceEncoder

estimator(pseudo_count=None)

Create a ParameterEstimator for corresponding SequenceEncodableProbabilityDistribution.

Parameters:: pseudo_count (Optional[float]) – Regularize sufficient statistics in estimation step.
Return type:: HierarchicalMixtureEstimator
Returns:: ParameterEstimator

log_density(x)

Evaluate the log density of an observation from hierarchical mixture distribution.

Note: Observation is a sequence.

Parameters:: x (Sequence[T]) – A sequence of type data type T’s.
Returns:: Log-density evaluated at x.
Return type:: float

posterior(x)

Compute the posterior over the mixture components for the outer-mixture at observed value x.

Parameters:: x (Sequence[T]) – An observed sequence of data type T.
Returns:: Numpy array of length ‘num_mixtures’.
Return type:: np.ndarray

sampler(seed=None)

Create a DistributionSampler object for a given ProbabilityDistribution.

Parameters:: seed (Optional[int]) – Set seed for drawing samples from distribution.
Return type:: HierarchicalMixtureSampler

seq_component_log_density(x)

Vectorized evaluation of the outer-mixture component-wise log-density for an encoded sequence x.

This returns a numpy array with shape (rv[0], ‘num_mixtures’).

Note

This density is a Mixture of Sequence of Mixture, so the data must be bin-counted as last step in code.

Parameters:: x (HierarchicalMixtureEncodedDataSequence) – EncodedDataSequence for Hierarchical mixture observations.
Returns:: Numpy array of dimensions ‘rv[0]’ by ‘num_mixtures’, containing the log-density for each component of the outer mixture.
Return type:: np.ndarray

seq_log_density(x)

Vectorized evaluation of the log density.

Parameters:: x (EncodedDataSequence) – EncodedDataSequence for corresponding SequenceEncodedProbabilityDistribution.
Return type:: ndarray
Returns:: np.ndarray

seq_posterior(x)

Vectorized evaluation of the posterior over each outer-mixture component for an encoded sequence x.

Parameters:: x (HierarchicalMixtureEncodedDataSequence) – EncodedDataSequence for Hierarchical mixture observations.
Returns:: dimension (x[0], ‘num_mixtures’) containing posteriors for each observation.
Return type:: np.ndarray

to_mixture()

Returns a MixtureDistribution object created from object instance.

Return type:: MixtureDistribution

HierarchicalMixtureEstimator

class pysp.stats.hmixture.HierarchicalMixtureEstimator(estimators, num_mixtures, len_estimator=<pysp.stats.null_dist.NullEstimator object>, len_dist=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))

HierarchicalMixtureEstimator object for estimating hierarchical mixture distribution for aggregated: sufficient statistics.

Note: If pseudo_count is passed, the mixture weights are re-weighted in estimation. If attribute suff_stat is set, a suff_stat is re-weighted and combined with new sufficient statistics in estimation.

num_components

Number of topic distributions (inner-mixture).

Type:: int

num_mixtures

Number of outer-mixture components.

Type:: int

estimators

ParameterEstimator objects for the topics.

Type:: Sequence[ParameterEstimator]

pseudo_count

Re-weight ‘suff_stat’ above in estimation.

Type:: Optional[float]

suff_stat

2-d numpy array of dimension (num_components, num_mixtures). Represents the inner-mixture weights.

Type:: np.ndarray

len_estimator

Estimator for the length of inner mixture sequences.

Type:: Optional[ParameterEstimator]

keys

Keys for weights and topics, passed to accumulator factory with call to ‘accumulator_factory()’.

Type:: Optional[Tuple[Optional[str], Optional[str]]]

len_dist

Fix the length on inner-mixture sequence distribution.

Type:: Optional[SequenceEncodableProbabilityDistribution]

name

Name for object instance.

Type:: Optional[str]

__init__(estimators, num_mixtures, len_estimator=<pysp.stats.null_dist.NullEstimator object>, len_dist=None, suff_stat=None, pseudo_count=None, name=None, keys=(None, None))

HierarchicalMixtureEstimator object.

Parameters:

estimators (Sequence[ParameterEstimator]) – ParameterEstimator objects for the topics.
num_mixtures (int) – Number of outer-mixture components.
len_estimator (Optional[ParameterEstimator]) – Estimator for the length of inner mixture sequences.
len_dist (Optional[SequenceEncodableProbabilityDistribution]) – Fix the length on inner-mixture sequence distribution.
suff_stat (np.ndarray) – 2-d numpy array of dimension (num_components, num_mixtures). Represents the inner-mixture weights.
pseudo_count (Optional[float]) – Re-weight ‘suff_stat’ above in estimation.
name (Optional[str]) – Set a name to object instance.
keys (Optional[Tuple[Optional[str], Optional[str]]]) – Set keys for weights and topics.

accumulator_factory()

Create SequenceEncodableStatisticAccumulator object.

Return type:: HierarchicalMixtureEstimatorAccumulatorFactory

estimate(nobs, suff_stat)

Estimate SequenceEncodableProbabilityDistribution for sufficient statistics.

Parameters:

nobs (Optional[float]) – Weighted number of observations.
suff_stat (Tuple[int, np.ndarray, np.ndarray, np.ndarray]) – Sufficient statistics for dirichlet distribution.

Return type:

HierarchicalMixtureDistribution

Returns:

SequenceEncodableProbabilityDistribution

HierarchicalMixtureSampler

class pysp.stats.hmixture.HierarchicalMixtureSampler(dist, seed=None)

HierarchicalMixtureSampler object for sampling from a hierarchical mixture model.

rng

RandomState object with seed set is passed as arg.

Type:: RandomState

dist

HierarchicalMixtureDistribution instance to sample from.

Type:: HierarchicalMixtureDistribution

sampler

Convert ‘dist’ to a MixtureDistribution for sampling.

Type:: MixtureDistributionSampler

sample(size=None)

Generate samples from distribution.

Parameters:: size (Optional[int]) – Number of samples to generate.
Return type:: Union[Sequence[Any], Any]
Returns:: Samples from distribution.