# multinomineq: Multinomial Models with Inequality Constraints

#### 2019-01-30 The R package multinomineq facilitates the Bayesian analysis of discrete/categorical data using multinomial models with inequality constraints. Essentially, these types of models allow to analyze discrete choice frequencies while assuming that the corresponding choice probabilities $$\theta$$ are constrained by linear inequality constraints (e.g., by the simple order $$\theta_{1} \leq \theta_{2}\leq \theta_{3}\leq \theta_{4}$$). In the following, this vignette shows how to define and test such models in R. The second section below provides a concise formal definition of the model class.

## Example: Testing the Description-Experience Gap

The following example shows how inequality constraints can be used to test the description-experience (DE) gap in risky decision making (Hertwig et al., 2004). In the experiment, participants had to choose one of two risky gambles in each of 6 different paired comparisons (e.g., a lottery in which you win 10€ with $$p=.20$$ versus a lottery in which you can win 20€ with $$p=.45$$). The description-experience gap states that people underweigh small probabilities $$p$$ when making decisions based on experience. We will conduct a reanalysis of Hertwig’s data set using the inequality-constrained multinomial model in Regenwetter & Robinson (2017).

First, it is necessary to install the R package multinomineq from Github. Note that it is necessary to install the developer R tools first (see https://support.rstudio.com/hc/en-us/articles/200486498-Package-Development-Prerequisites). Afterwards, multinomineq can be installed and loaded via:

### uncomment to install packages:
# devtools::install_github("danheck/multinomineq")

library("multinomineq")
library("coda")
set.seed(1234)

### Data structure

We reanalyze the data by Hertwig et al. (2004) who collected choice frequencies of $$n=25$$ participants for 6 paired comparisons of risky gambles. The observed choice frequencies can simply be summarized by the frequencies of choosing “Option H” (a specific lottery, cf. original article) in each of the 6 paired comparisons:

# total number of observations for each paired comparison:
n <- 25
# Frequencies of choosing "Option H" in Description condition:
HER_desc <- c(9, 16, 16, 7, 12, 16)
# Frequencies of choosing "Option H" in Experience condition:
HER_exp  <- c(22, 11, 7, 14, 5, 3)

### Vertex ($$V$$-)representation

First, it is shown how to use the vertex ($$V$$-)representation for inequality-constrained models. Essentially, the rows of the matrix $$V$$ list all patterns that are predicted by a substantive theory of interest. Each prediction either states that “Option H” is chosen (value 1) or not (value 0). For example, if the theory makes the prediction that “Option H” is chosen only in the 4th paired comparison but otherwise not, the matrix $$V$$ needs to include the following pattern in one row: c(0,0,0,1,0,0). Similarly, all possible predictions need to be combined to form the matrix $$V$$. Inequality-constrained models assume a mixture distribution over all these possible predicted patterns (e.g., due to heterogeneity across participants or trials). Geometrically, the corresponding parameter space thus represents the convex hull of the predicted patterns/vertices (i.e., all possible linear combinations of the rows of $$V$$ when using nonnegative weights).

For the description-experience gap, Regenwetter & Robinson (2017) used a specific decision making model (Cumulative Prospect Theory) to generate all possible predictions. The matrix $$V$$ lists all of these predictions when assuming that participants underweigh small probabilities $$p$$ of winning money. The following matrix contains all predicted patterns for the 6 risky gambles:

# Underweighting of small probabilities CPT:
# (1 = choose "Option H"; 0 = do not choose "Option H")
V <- matrix(c(0, 0, 0, 0, 0, 0,  # first predicted pattern
0, 0, 0, 1, 0, 0,  # second predicted pattern
0, 0, 1, 1, 0, 0,  # ...
1, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0,
1, 0, 1, 1, 0, 0,
1, 1, 0, 0, 0, 0,
1, 1, 0, 0, 1, 0,
1, 1, 0, 1, 1, 0,
1, 1, 0, 0, 1, 1,
1, 1, 0, 1, 0, 0,
1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 0,
1, 1, 1, 1, 1, 0,
1, 1, 1, 1, 1, 1),
ncol = 6, byrow = TRUE)

Given this representation of the theoretical predictions, it is not easy to check whether a given vector of choice probabilities for the 6 gambles is inside the convex hull of the predicted patterns. As a remedy, one can use the function inside() to test this, or the function find_inside() to find a probability vector that is inside the convex hull of the predictions:

# define a specific vector of choice probabilities:
p_observed <- c(.25, .48, .93, .10, .32, .50)

# check whether p is inside the convex hull defined by V:
inside(p_observed, V = V)
#>  FALSE

# to find a (random) probability that is inside the convex hull:
find_inside(V = V, random = TRUE)
#>  0.8396019 0.7133954 0.2367023 0.4547242 0.4952754 0.4340429

### Inequality ($$Ab$$-)representation

Instead of listing the vertices (or edges) of the inequality-constrained parameter space, we may instead define a set of linear inequality constraints on the parameters explicitly. For this purpose, we specify a matrix $$A$$ and a vector $$b$$. The model only allows parameter vectors $$\theta$$ that satisfy the inequalities defined by the matrix equation: $$A \cdot \theta \leq b$$.

For the description-experience gap, one can show that instead of listing the predictions by the matrix $$V$$ as shown above, one can equivalently describe the linear inequalities as follows:

A <- matrix(c( 0,   0,   -1,    0,    0,    0,  # first inequality
0,   0,    0,    0,    0,   -1,  # second inequality
-1,   1,    0,    0,    0,    0,  # ...
0,  -1,    0,    0,    1,    0,
0,   0,    0,    0,   -1,    1,
0,   0,    1,   -1,    0,    0,
0,   0,    0,    1,    0,    0,
1,   0,    0,    0,    0,    0),
ncol = 6, byrow = TRUE)

b <- c(0, 0, 0, 0, 0, 0, 1, 1)

For example, the third row of $$A$$ and the third value of $$b$$ specify that $$-1 \cdot \theta_1 + 1 \cdot \theta_2 + 0 \cdot \theta_3+ 0 \cdot \theta_4+ 0 \cdot \theta_5+ 0 \cdot \theta_6 \leq 0$$ (which is equivalent to $$\theta_2 \leq \theta_1$$).

Using the matrix $$A$$ and the vector $$b$$, one can use standard matrix multiplication in R (%*%) to check whether a given vector of choice probabilities satisfies all of the inequality constraints:

p_observed <- c(.25, .48, .93, .10, .32, .50)
all(A %*% p_observed <= b)
#>  FALSE

# corresponding function in multinomineq:
inside(p_observed, A, b)
#>  FALSE

# find a point that satisfies the constraints:
find_inside(A, b, random = TRUE)
#>  0.4849912 0.4236391 0.4196148 0.4196248 0.4236291 0.4236191

Note that it is sometimes possible to convert the vertex ($$V$$-)representation (convex hull) to the inequality ($$Ab$$-)representation using specific software. To do this in R, it is necessary to install the package rPorta from Github:

### (not run):
devtools::install_github("TasCL/rPorta")
Ab <- V_to_Ab(V)
Ab 

### Posterior sampling

In a Bayesian framework, inequality-constrained multinomial models can be fitted by drawing posterior samples for the parameter vector of choice probabilities $$\theta$$. If the model is defined in terms of a matrix $$V$$ with predicted patterns, we obtain $$M=2,000$$ posterior samples as follows:

# fit data from Description and Experience condition:
p_D <- sampling_binom(k = HER_desc, n = n, V = V, M = 2000, progress = FALSE)
p_E <- sampling_binom(k = HER_exp, n = n, V = V, M = 2000, progress = FALSE)

If the model is defined in terms of the inequality $$Ab$$-representation, we obtain posterior samples in a similar way by replacing the argument V=V by the two arguments A=A, b=b:

# fit data from Description and Experience condition:
p_D <- sampling_binom(k = HER_desc, n = n, A = A, b = b,
M = 2000, progress = FALSE)
p_E <- sampling_binom(k = HER_exp, n = n, A = A, b = b,
M = 2000, progress = FALSE)

If possible, the $$Ab$$-representation should be used because the sampling is much more efficient than for the $$V$$-representation. Irrespective of how the posterior samples have been obtained, we can use standard methods in Bayesian analysis to check the convergence of the MCMC sampler and obtain summary statistics of the posterior distribution:

# check convergence of MCMC sampler (should look like hairy caterpillars):
plot(p_D)  # summarize posterior samples:
summary(p_D)
#>
#> Iterations = 11:2000
#> Thinning interval = 1
#> Number of chains = 1
#> Sample size per chain = 1990
#>
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#>
#>        Mean      SD Naive SE Time-series SE
#> [1,] 0.5981 0.05581 0.001251       0.003777
#> [2,] 0.5607 0.05368 0.001203       0.004074
#> [3,] 0.4417 0.06958 0.001560       0.004190
#> [4,] 0.4865 0.07054 0.001581       0.004138
#> [5,] 0.5098 0.05610 0.001258       0.003800
#> [6,] 0.4633 0.06159 0.001381       0.003376
#>
#> 2. Quantiles for each variable:
#>
#>        2.5%    25%    50%    75%  97.5%
#> var1 0.4862 0.5614 0.5977 0.6350 0.7078
#> var2 0.4490 0.5267 0.5595 0.5956 0.6657
#> var3 0.3060 0.3924 0.4422 0.4881 0.5776
#> var4 0.3552 0.4399 0.4853 0.5329 0.6282
#> var5 0.3913 0.4731 0.5114 0.5459 0.6188
#> var6 0.3453 0.4214 0.4638 0.5069 0.5832

Similarly, we can use posterior-predictive checks to test whether an inequality-constrained model fits the data adequately (i.e., whether the constraints hold empirically). Essentially, this method computes Pearson’s $$X^2$$-statistic to measure the discrepancy both for the observed and the posterior-predicted frequencies. If the model does not fit the data, the posterior-predicted $$p$$-value ppp will become very small (and it will be close to 0.50 if the model fits). Hence, we usually apply a criterion such as ppp < .05 to reject models that do not fit the data, as in the following example:

# posterior-predictive p-value for Description condition:
ppp_binom(prob = p_D, k = HER_desc, n = n)
#>       X2_obs      X2_pred          ppp
#> 10.287736410  3.010374952  0.004020101
ppp_binom(prob = p_D, k = HER_exp, n = n)
#>    X2_obs   X2_pred       ppp
#> 17.688104  2.937646  0.000000

### Bayes factor

Often, one is interested in quantifying the evidence in favor of the inequality constraints. We can use Bayesian model selection to compare the inequality-constrained model $$M_0$$ against the unconstrained model $$M_u$$ that does not restrict the choice probabilities $$\theta$$ (also called encompassing model). The Bayes factor bf_0u quantifies the evidence in favor of the inequality-constrained model $$M_0$$ versus the unconstrained model $$M_u$$. Similarly, bf_u0 is the inverse of bf_0u and provides the evidence for the unconstrained model. Moreover, the Bayes factor bf_00' compares the inequality-constrained model $$M_0$$ against the model $$M_{0'}$$ which has a parameter space defined as the complement of the parameter space of $$M_0$$.

Bayes factors are computed as follows:

# compute Bayes factor using the V-representation
bf_D <- bf_binom(k = HER_desc, n = n, V = V, M = 10000, progress = FALSE)
bf_E <- bf_binom(k = HER_exp,  n = n, V = V, M = 10000, progress = FALSE)
# compute Bayes factor using the Ab-representation
bf_D <- bf_binom(HER_desc, n = n, A = A, b = b, M = 10000, progress = FALSE)
bf_E <- bf_binom(HER_exp,  n = n, A = A, b = b, M = 10000, progress = FALSE)

bf_D
#>         bf           se        ci.5%       ci.95%
#> bf_0u    0 3.608596e-03 1.200689e-05 9.795163e-03
#> bf_u0  Inf 3.608004e+07 1.020912e+02 8.328551e+04
#> bf_00'   0 3.539881e-03 1.179143e-05 9.617549e-03
bf_E
#>                 bf          se        ci.5%       ci.95%
#> bf_0u   36.1269036 2.598473041  32.18187465  40.67857138
#> bf_u0    0.0276802 0.001987101   0.02458297   0.03107339
#> bf_00' 122.8414969 9.369839417 108.56683879 139.11544600

Note that multinomineq also provides an error of approximation for the Bayes factor to account for random sampling error. The matrix that is provided as output shows the standard error se for the Bayes factor and the 90% credibility interval ci.5% and ci.95%.

##### Increased efficiency for Bayes factor computations

Within the function bf_binom, the encompassing Bayes factor is computed in two steps based on the unconstrained multinomial model $$M_u$$:

1. Draw prior parameter samples from $$M_u$$ and count the proportion of samples $$c$$ inside the inequality-constrained parameter space.
2. Draw posterior parameter samples from $$M_u$$ and count the proportion of samples $$f$$ inside the inequality-constrained parameter space.

Based on these two proportions, the Bayes factor in favor of the inequality constraints is approximated as $$B_{0u}=f/c$$. It might be convenient to perform these computations in separate steps in R for some models and datasets. For example, the prior constant $$c$$ is sometimes known analytically. Moreover, when analyzing multiple datasets or participants, it is more efficient to estimate the prior constant $$c$$ only once because $$c$$ is identical for all analyses:

# proportion of *prior* samples satisfying inequality constraints
cnt <- count_binom(k = 0, n = 0, A = A, b = b, M = 50000, progress = FALSE)
cnt
#> Number of samples satisfying the inequality constraints:
#>      count     M steps
#> [1,]  1037 50000     8
#>
#> To extract the proporiton of samples, use:
#>    attr(count, "proportion") = 0.02074 (SE = 0.00064279).

# proportion of *prior* samples satisfying inequality constraints
# (dataset: Experience)
cnt_E <- count_binom(k = HER_exp, n = n, A = A, b = b, M = 10000, progress = FALSE)
cnt_E
#> Number of samples satisfying the inequality constraints:
#>      count     M steps
#> [1,]  7191 10000     8
#>
#> To extract the proporiton of samples, use:
#>    attr(count, "proportion") = 0.7191 (SE = 0.004491353).

# obtain Bayes factor (including error of approximation)
count_to_bf(posterior = cnt_E, prior = cnt)
#>                  bf          se        ci.5%       ci.95%
#> bf_0u   34.67213115 1.105057000  32.93231334  36.51122429
#> bf_u0    0.02884161 0.000919178   0.02738884   0.03036531
#> bf_00' 120.87230740 4.688334259 113.52968049 128.88221949

Since Bayes factors can be hard to compute for some models and data sets, multinomineq provides a stepwise procedure that aims at increasing the efficiency. The user can specify a minimum number of samples cmin that must satisy a subset of the inequality constraints. The algorithm keeps sampling until this threshold is reached. The argument steps determines at which rows of the matrix $$A$$ the inequalities are split into subsets of increasing size. For example, steps = c(3, 5, 7) means that the first subset contains only the first 3 inequalities, the second subset the first 5 inequalities etc.:

# use a stepwise counting approach for the "Description" condition:
# ("cmin" = minimum of samples that satisfy constraints before sampling stops)
cnt_D <- count_binom(k = HER_desc, n = n, A = A, b = b,
M = 5000, cmin = 1, steps = c(3, 5, 7), progress = FALSE)
cnt_D
#> Number of samples satisfying the inequality constraints:
#>      count    M steps
#> [1,]   125 5000     3
#> [2,]    84 5000     5
#> [3,]    25 5000     7
#> [4,]  5000 5000     8
#>
#> To extract the proporiton of samples, use:
#>    attr(count, "proportion") = 2.1e-06 (SE = 5.337439e-07).

# obtain Bayes factor (including error of approximation)
count_to_bf(posterior = cnt_D, prior = cnt)
#>                  bf           se        ci.5%       ci.95%
#> bf_0u  1.012536e-04 2.530872e-05 6.689430e-05 1.489683e-04
#> bf_u0  9.876190e+03 2.589882e+03 6.712837e+03 1.494896e+04
#> bf_00' 9.915382e-05 2.479517e-05 6.549815e-05 1.458469e-04

### Data format for multinomial frequencies

In the present example, we only observed binomial data, which can easily be described by the frequency of choosing one of two options. If more than two choice options are provided for a comparison, a different data format has to be used for multinomial data. In the binomial case, we ommitted the frequencies of not choosing a specific option because we knew the total number $$n$$. For multinomial data, one needs to specify all choice frequencies $$k_{ij}$$. Moreover, one needs to specify a vector options that defines how many choice options are available for each item type (e.g., options = c(3,3,) means that there are two scenarios, each with three possible options).

Using this notation, the 6 binomial choice frequencies above can equivalently be specified by the multinomial data format as follows:

# binomial data format
n <- 25
HER_desc <- c(9, 16, 16, 7, 12, 16)

# multinomial data format  ("k" must be a vector!)
k_multinom <- c(9, 16,   # first binary gamble
16, 9,   # second binary gamble
16, 9,   # ...
7, 17,
12, 13,
16, 9)
options <- c(2, 2, 2, 2, 2, 2)  # 2 options for each type

The binomial format can also be translated to the multinomial format automatically:

mn <- binom_to_multinom(HER_desc, n)
mn
#> $k #> p1_1 p1_2 p2_1 p2_2 p3_1 p3_2 p4_1 p4_2 p5_1 p5_2 p6_1 p6_2 #> 9 16 16 9 16 9 7 18 12 13 16 9 #> #>$options
#>  2 2 2 2 2 2

The multinomial data format is used as input for the analysis as follows:

posterior <- sampling_multinom(k = mn$k, options = mn$options, A = A, b = b)
bayesfactor <- bf_multinom(k = k_multinom, options = options, A = A, b = b)
bayesfactor
#>         bf           se        ci.5%       ci.95%
#> bf_0u    0 5.662915e-03 1.542493e-05 1.555550e-02
#> bf_u0  Inf 2.935595e+08 6.428594e+01 6.483043e+04
#> bf_00'   0 5.527939e-03 1.503289e-05 1.515908e-02

## Formal definition of multinomial models with inequality constraints

In the following, we define inequality-constrained multinomial models for discrete data. We use the term “item type” to refer to a category system $$i$$ that is modeled by a multinomial distribution with a fixed total number of observations $$n_i$$. For an item type $$i$$, we label the observed frequencies as $$k_{ij}$$.

#### Parameters and Likelihood

The parameter vector of interest refers to the choice probabilities: $\theta = (\theta_{11},\dots,\theta_{1 (J_1-1)},\theta_{21},\dots, \theta_{I (J_I-1)})$ Note that the last probability within each item type is ommited because probabilities have to sum to one. Hence, for $$n_i=3$$ choice options there are only 2 free parameters.

Using this notation, we can define a product-multinomial likelihood function: $p(k \mid \theta) = \prod_{i=1}^I {n_i \choose k_{i1} ,\dots ,k_{i J_i} } \prod_{j=1}^{J_i - 1} \theta_{ij}^{k_{ij}} \left(1- \sum_{r=1}^{J_i-1} \theta_{ir}\right)^{k_{J_i}}$

#### Linear Inequality Constraints

Inequality constraints can be defined in two different, equivalent ways. First, one may define a list of inequality constraints that have to hold for the choice probabilities, for instance: \begin{align*} \theta_{11} \leq \theta_{21} \, \text{ and } \, \theta_{21} \leq \theta_{31} \end{align*} These inequalities can be expressed in vector notation as $$A \,\theta \leq b$$ by defining a matrix $A$ and a vector $b$ as follows: $A = \pmatrix{1 & -1 & 0 \\ 0 & 1 & -1} \,\, \text{ and }\,\, b = \pmatrix{0 \\ 0}$ To formalize this notation, we define the constrained parameter space $$\Omega_0$$ via a matrix $$A$$ and a vector $$b$$ that specify a list of inequalities that have to hold: $\Omega_0 = \left\{\theta \in \Omega \, \middle | \, A \, \theta \leq b \right\}$

Alternatively, one may list all the vertices $$v^{(s)}$$ defining the constrained parameter space in the rows of a matrix $$V$$. In the example above, we have $v^{(1)} = \pmatrix{0\\0\\0} \,,\, v^{(2)} = \pmatrix{0\\0\\1} \,,\, v^{(3)} = \pmatrix{0\\1\\1} \,,\, v^{(1)} = \pmatrix{1\\1\\1}$ These vectors can be convenietly summarizes in a matrix $V$ with one vertex per row: $V = \pmatrix{0 & 0 & 0\\ 0 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 1}$ In general, the restricted parameter space $$\Omega_0$$ is defined as the convex hull of the set of vertices $$v^{(s)}$$: $\Omega_0 = \left\{ \theta = \sum_{s=1}^S \alpha_s v^{(s)} \,\middle |\, \alpha_s \geq 0 \text{ for all } s=1,\dots, S \text{ and }\sum_{s=1}^S \alpha_s=1 \right\}$

#### Prior & Posterior

As a prior distribution for the parameters $$\theta$$, we assume independent Dirichlet distributions for the different item types $$i$$ with shape parameters $$\alpha_{ij}$$. Note that we only allow parameters that are inside the constrained parameter space $$\Omega_0$$: $p(\theta) = \frac {1}{c} \, \mathbb I_{\Omega_0}(\theta) \, \prod_{i=1}^I \prod_{j=1}^{J_i} \theta_{ij}^{\alpha_{ij}-1}$ Here, $$\mathbb I_{\Omega_0}(\theta)$$ is the indicator function which is one if $$\theta\in\Omega_0$$ and zero otherwise. Moreover, the normalizing constant $$c$$ ensures that the prior density function integrates to one.

Finally, the analytical solution of the posterior distribution is also a product-Dirichlet distribution, truncated to the constrained parameter space: $p(\theta \mid k) = \frac {1}{f} \, \mathbb I_{\Omega_0}(\theta) \, \prod_{i=1}^I \prod_{j=1}^{J_i} \theta_{ij}^{k_{ij} + \alpha_{ij}-1}$ Similarly as before, $$f$$ is the normalizing constant of the posterior density function.

#### Reference

For a more detailed introduction with technical details, see:

• Heck, D. W., & Davis-Stober, C. P. (2019). Multinomial models with linear inequality constraints: Overview and improvements of computational methods for Bayesian inference. Manuscript under revision. https://arxiv.org/abs/1808.07140