# Identifiability notebook

## Goal

Get a sense of how well discounting models of intertemporal choice and compensatory models can mimic each other. Choose a representative model from each family and compare them using the parametric bootstrapping procedure of Wagenmakers, Ratcliff, Gomez, and Iverson (2004). If the models indeed seem able to mimic each other given realistic data, try exploring the parameter spaces of each model to see if there are particular regions where the models are in fact distinguishable.

## Models

A natural choice for discounting is the generalized hyperbolic discounting model, instantiated using Equation 15 of Loewenstein and Prelec (1992) and the logit-rho choice function I used for Builder:

$y = \operatorname{ilogit}(ρ [r_L (1 + αd_L)^{-β/α} - r_S (1 + αd_S)^{-β/α} ]) .$

Here, α ∈ (0, ∞) is the curvature (with small α tending towards an exponential function and large α tending towards a step function), and β ∈ [0, ∞) is the discount rate (with larger β s, for fixed α, meaning more discounting).

$κ(w(d_L) - w(d_S)) = v(r_L) - v(r_S) .$

Making a choice function out of that:

$y = \operatorname{ilogit}(ρ[v(r_L) - v(r_S) - κ(w(d_L) - w(d_S))]) .$

At this stage, we don't want to transform rewards or delays, so let's set v and w to the identity function:

$y = \operatorname{ilogit}(ρ[r_L - r_S - κ(d_L - d_S)]) .$

This is equivalent to the model.diff.rho of Builder. Works for me.

## Scoring rules

The problem of evaluating how well statements of probability match up with categorical outcomes is what motivates the use of scoring rules. Often, scoring rules are required to be proper, that is, give the best mean score when the probability-stating entity gives the correct probabilities.

A somewhat surprising fact is that, at least in the case of binary outcomes, squared error (the Brier rule) is proper, but absolute error is not. For example, the probability guess that minimizes mean squared error for an event with probability .4 is .4, but the probability guess that minimizes mean absolute error for such an event is 0. This suggests that you don't want to evaluate probabilistic predictions of a binary event with absolute error (provided you're aggregating scores with the mean, anyway).

The foregoing seems to imply that when evaluating probability guesses for multiple events, you want to aggregate Brier scores across events with the mean and not take the square root to get RMSE, because that would reduce absolute error in the case of a single event.

## Parametric boostrapping

```pb = cached(ddply(obsts, .(s), .progress = "text", function(slice)
paraboot(slice, model.ghlp, model.diff, n = 1000)))
```

First, per-subject density plots.

```ggplot(pb) +
geom_densitybw(aes(dMSE, fill = model), alpha = 1/2) +
facet_wrap(~ s, scales = "free_y") +
xlab("MSE (greater value means better fit of model.diff)") +
scale_fill_discrete(name = "Generating model", labels = c(
model.diff = "Diff",
model.ghlp = "GHM")) +
scale_y_continuous(breaks = NULL) +
scale_x_continuous(breaks = seq(-.02, .01, .01)) +
coord_cartesian(xlim = c(-.02, .02)) +
theme_bw(base_size = 12) + no.gridlines()
```

Here are all subject–model pairs for which the mean fit for the non-generating model was greater than the mean fit of the generating model.

```rd(d = 6, ss(ddply(pb, .(s, model), function(s) c(MdMSE = mean(s\$dMSE))),
model == "model.diff" & MdMSE <= 0 |
model == "model.ghlp" & MdMSE >= 0))
```
s model MdMSE
7 s50 model.diff -0.005421
14 s53 model.ghlp 0.000001
30 s61 model.ghlp 0.000217
31 s62 model.diff -0.000099
39 s67 model.diff -0.000389
47 s71 model.diff -0.000448
49 s72 model.diff -0.000173
55 s75 model.diff -0.000619

Now let's look at the conditional probabilities for each subject. In the below plot, the x-axis specifies the probability that model.ghlp fit better given that model.ghlp was the generating model, and similarly for the y-axis with model.diff.

```d = dcast(formula = s ~ model, value.var = "p",
ddply(pb, .(s, model), function(slice) data.frame(p =
mean((if (slice\$model == "model.diff") 1 else -1) *
slice\$dMSE > 0))))
ggplot(d) +
geom_point(aes(model.ghlp, model.diff)) +
coord_equal(xlim = c(.5, 1), ylim = c(.5, 1))
```

The points are clustered horizontally on the right, but are about uniform vertically. It looks like model.ghlp's data is less likely to be better fit by model.diff than the reverse.

The correlation isn't big:

```round(digits = 3,
cor(d\$model.ghlp, d\$model.diff, method = "kendall"))
```
value
0.204

Collapsing across subjects, the means of the conditional probabilities are:

```round(digits = 3, c(model.ghlp = mean(d\$model.ghlp),
model.diff = mean(d\$model.diff)))
```
value
model.ghlp 0.801
model.diff 0.680

This makes sense seeing as model.ghlp has one more parameter than model.diff.

## Parameter-space exploration

### With Diff as the generating model

Let's start by just fitting all the subjects with model.diff to get a sense of what reasonable parameter values are.

```fits.diff = ordf(mse, `_data` = ddply(obsts, .(s), function(slice)
{f = fit.model(slice, model.diff)
as.data.frame(t(c(f\$par, mse = f\$mse)))}))
row.names(fits.diff) = char(fits.diff\$s)
fits.diff = ss(fits.diff, select = -s)
rd(rbind(fits.diff,
mapcols(fits.diff, function(v) quantile(v, c(.025, .975)))))
```
kappa rho mse
s53 15.904 0.043 0.009
s62 1.467 0.309 0.024
s61 0.004 2.636 0.030
s48 24.448 0.007 0.037
s76 7.127 0.028 0.038
s77 724.508 0.001 0.040
s56 1.134 0.211 0.042
s64 2.346 0.092 0.052
s66 1.456 0.115 0.054
s71 0.216 0.659 0.061
s73 0.778 0.207 0.061
s59 0.562 0.306 0.061
s67 0.128 0.555 0.065
s68 11.608 0.018 0.067
s55 0.886 0.179 0.069
s74 0.740 0.262 0.070
s63 4.865 0.022 0.071
s69 0.900 0.164 0.075
s54 2.217 0.051 0.076
s83 1.039 0.147 0.080
s84 0.512 0.193 0.085
s70 0.895 0.119 0.089
s82 0.371 0.250 0.100
s60 0.875 0.082 0.107
s52 0.729 0.119 0.113
s57 1.093 0.081 0.114
s51 0.801 0.097 0.122
s81 0.361 0.172 0.126
s72 0.022 2.647 0.127
s75 0.073 0.244 0.128
s58 0.419 0.148 0.142
s49 0.697 0.066 0.165
s79 0.445 0.113 0.169
s50 0.000 0.079 0.190
s80 0.436 0.059 0.211
s30 0.642 0.024 0.230
2.5% 0.004 0.006 0.022
97.5% 111.955 2.637 0.214

All right, now, which portions of the parameter space are "interesting", in the sense of not resulting in degenerate choice patterns for the Builder test set?

```llsgrid.diff = cached(local(
{grid = expand.grid(
kappa = exp(seq(-20, 10, len = 150)),
rho = exp(seq(-15, 20, len = 150)))
ddply(grid, .(kappa, rho), .progress = "text", function(row)
{pd = preddata(builder.test.ts, model.diff, num(row))
data.frame(expected.lls = mean(pd\$pll), sdilogit = sd(ilogit(pd\$pll)))})}))
```

First, a graph of expected proportion of LLs chosen.

```ggplot(llsgrid.diff) +
geom_tile(aes(log(kappa), log(rho), fill = expected.lls)) +
scale_x_continuous(breaks = seq(-20, 10, 5), expand = c(0, 0)) +
scale_y_continuous(breaks = seq(-15, 20, 5), expand = c(0, 0))
```

Now a figure showing the standard deviation of the inverse logits of the choice probabilities across the test set. The higher this number, the more heterogeneity there is across the test set; when it's 0, the choice probability is constant.

```ggplot(llsgrid.diff) +
geom_tile(aes(log(kappa), log(rho), fill = sdilogit)) +
scale_fill_continuous(low = "black", high = "white", limits = c(0, .12), na.value = "red") +
scale_x_continuous(breaks = seq(-20, 10, 5), expand = c(0, 0)) +
scale_y_continuous(breaks = seq(-15, 20, 5), expand = c(0, 0))
```

So we have roughly three regions where the model is interesting enough that checking for mimicry makes sense, corresponding to the white bar and the two gray bars. Or we could just examine the intersection point and the space around it. Yeah, that's what I'll do.

```msegrid = function(fitting.model, n.replicates = 5)
{grid = expand.grid(
kappa = exp(seq(-7, 7, len = 150)),
rho = exp(seq(-8, 5, len = 150)))
cbind(grid, mse = aaply(1 : nrow(grid), 1, .progress = "text", function(i)
median(replicate(n.replicates,
# I take the median of MSEs from several replicates,
# instead of just using one, in order to reduce
# noise in the plots.
fit.model(model = fitting.model, ts =
gendata(builder.test.ts, model.diff, num(grid[i,])))\$mse))))}

plot.msegrid = function(g) ggplot(g) +
geom_tile(aes(log(kappa), log(rho), fill = mse)) +
scale_x_continuous(breaks = -7:7, expand = c(0, 0)) +
scale_y_continuous(breaks = -8:5, expand = c(0, 0)) +
inset.legend(c(1, 1))

msegrid.diff = cached(msegrid(model.diff))
msegrid.ghlp = cached(msegrid(model.ghlp))
```
```plot.msegrid(msegrid.diff) +
scale_fill_continuous(
high = "black", low = "white", na.value = "red",
limits = c(0, .6))
```

This seems to generally make sense. The upper part of the plot is white because rho is high and therefore there's little noise in the data; the lower part is dark gray because rho is low and therefore the data is mostly noise. I'm not sure what to make of the very pale gray spike.

```plot.msegrid(msegrid.ghlp) +
scale_fill_continuous(
high = "black", low = "white", na.value = "red",
limits = c(0, .6))
```

This looks weirder, which makes sense considering that the generating and fitting models no longer match. In general, ghlp has trouble with that central column.

Now for the difference grid.

```plot.msegrid(transform(msegrid.ghlp, mse = mse - msegrid.diff\$mse)) +
geom_point(data = data.frame(fits.diff),
aes(log(kappa), log(rho)), shape = "x", size = 5) +
coord_cartesian(xlim = c(-7, 7))
```

model.diff generally does better, particularly in the central column. It's nice that the parameter vectors estimated from the subjects are in regions of reasonable favor towards diff. (Note that one subject, whose estimated kappa was very small, is omitted.)

### With GH as the generating model

```fits.ghlp = ordf(mse, `_data` = ddply(obsts, .(s), function(slice)
{f = fit.model(slice, model.ghlp)
data.frame(
log.curve = log(f\$par["curve"]),
log.k = log(f\$par["k"]),
log.rho = log(f\$par["rho"]),
mse = f\$mse)}))
row.names(fits.ghlp) = char(fits.ghlp\$s)
fits.ghlp = ss(fits.ghlp, select = -s)
rd(d = 2, rbind(fits.ghlp,
mapcols(fits.ghlp, function(v) quantile(v, c(.025, .975))),
mapcols(fits.ghlp, range)))
```
log.curve log.k log.rho mse
s53 -10.06 -1.15 -1.08 0.01
s62 -16.85 -2.73 -0.82 0.02
s61 -13.00 -8.32 0.93 0.03
s56 -18.94 -3.04 -0.77 0.04
s77 -9.31 -1.01 -1.49 0.04
s76 -14.93 -1.66 -1.45 0.04
s48 -13.63 -1.54 -1.55 0.04
s66 -10.66 -2.85 -1.29 0.05
s64 -2.86 -1.93 -1.39 0.05
s59 -18.47 -3.75 -0.33 0.05
s73 -13.25 -3.39 -1.01 0.06
s71 -4.83 -4.38 0.14 0.06
s55 -17.31 -3.28 -1.06 0.06
s68 -15.94 -1.51 -1.65 0.07
s74 -20.68 -3.47 -0.81 0.07
s67 -2.41 -4.11 2.61 0.07
s83 -15.77 -3.10 -1.31 0.07
s69 -4.85 -3.13 -1.14 0.07
s84 -18.56 -3.86 -0.85 0.08
s54 -2.05 -1.76 -1.58 0.08
s63 -16.73 -2.04 -1.79 0.08
s82 -20.39 -4.18 -0.53 0.08
s70 -20.00 -3.27 -1.34 0.09
s51 2.16 0.37 -0.52 0.09
s60 -16.41 -3.22 -1.60 0.10
s52 -15.59 -3.38 -1.49 0.11
s57 -19.30 -2.95 -1.71 0.11
s72 -18.92 -5.36 -1.18 0.11
s75 -17.60 -5.37 -1.20 0.12
s81 -3.09 -3.45 -0.96 0.13
s58 -6.22 -3.96 -1.33 0.14
s49 -16.49 -3.40 -1.91 0.16
s79 -19.02 -3.95 -1.50 0.16
s50 0.56 -18.15 -2.53 0.19
s80 -18.57 -3.89 -2.09 0.20
s30 -1.76 -2.34 -2.45 0.23
2.5% -20.42 -9.55 -2.46 0.02
97.5% 0.76 -0.83 1.14 0.20
39 -20.68 -18.15 -2.53 0.01
40 2.16 0.37 2.61 0.23

Now, I can't very well use heatmaps and so on with a three-dimensional parameter space, so we have to more or less drop a parameter. Christian suggested dropping rho, since it is theoretically the least important parameter; makes sense to me. Let's freeze it at the median (which is the same thing as the antilog of the median of the logs).

```rho.med = exp(median(fits.ghlp\$log.rho))
c(
"median of logs" = median(fits.ghlp\$log.rho),
"median" = rho.med)
```
value
median of logs -1.2955832
median 0.2737382

Now let's get point estimates again with rho fixed at rho.med:

```model.ghlp.mr = model.ghlp.frho(rho.med)
fits.ghlp.mr = ordf(mse, `_data` = ddply(obsts, .(s), function(slice)
{f = fit.model(slice, model.ghlp.mr)
data.frame(
log.curve = log(f\$par["curve"]),
log.k = log(f\$par["k"]),
mse = f\$mse)}))
row.names(fits.ghlp.mr) = char(fits.ghlp.mr\$s)
fits.ghlp.mr = ss(fits.ghlp.mr, select = -s)
rd(d = 2, rbind(fits.ghlp.mr,
mapcols(fits.ghlp.mr, function(v) quantile(v, c(.025, .975))),
mapcols(fits.ghlp.mr, range)))
```
log.curve log.k mse
s53 -7.59 -1.00 0.01
s62 -16.89 -2.61 0.02
s56 -20.44 -3.00 0.04
s77 -8.82 -1.14 0.04
s76 -11.17 -1.67 0.04
s48 -11.17 -1.57 0.04
s61 0.07 -Inf 0.05
s66 -19.31 -2.85 0.05
s64 -2.25 -1.83 0.05
s73 -20.19 -3.36 0.06
s59 -19.38 -3.64 0.06
s55 -18.83 -3.25 0.06
s68 -1.15 -1.07 0.07
s74 -21.40 -3.40 0.07
s83 -17.47 -3.10 0.07
s69 -13.84 -3.19 0.07
s67 -20.41 -5.04 0.08
s54 -0.92 -1.39 0.08
s84 -19.75 -3.80 0.08
s70 -18.18 -3.28 0.09
s63 -11.76 -2.01 0.09
s71 -8.38 -4.62 0.09
s82 -20.71 -4.12 0.09
s51 0.82 -0.59 0.10
s60 -6.02 -3.24 0.11
s52 -5.41 -3.36 0.11
s72 -19.55 -5.38 0.11
s57 -14.58 -3.12 0.11
s75 -19.41 -5.38 0.12
s81 -3.74 -3.66 0.13
s58 -5.86 -3.94 0.14
s79 -19.43 -3.97 0.16
s49 -3.00 -3.01 0.16
s50 137.35 129.03 0.21
s80 -5.53 -3.93 0.21
s30 1.18 -0.52 0.25
2.5% -20.80 -Inf 0.02
97.5% 18.20 15.67 0.22
39 -21.40 -Inf 0.01
40 137.35 129.03 0.25

Removing outliers:

```d = ss(fits.ghlp.mr, (log.curve < 100 | log.k < 100) & is.finite(log.k))
rd(d = 3, rbind(
mapcols(d, function(v) quantile(v, c(.025, .975))),
mapcols(d, range)))
```
log.curve log.k mse
2.5% -20.834 -5.382 0.022
97.5% 0.883 -0.582 0.220
-21.397 -5.384 0.009
1.175 -0.524 0.246
```llsgrid.ghlp.mr = cached(local(
{grid = expand.grid(
curve = exp(seq(-25, 10, len = 150)),
k = exp(seq(-8, 4, len = 150)))
ddply(grid, .(curve, k), .progress = "text", function(row)
{pd = preddata(builder.test.ts, model.ghlp.mr, num(row))
data.frame(expected.lls = mean(pd\$pll), sdilogit = sd(ilogit(pd\$pll)))})}))
```
```ggplot(llsgrid.ghlp.mr) +
geom_tile(aes(log(curve), log(k), fill = sdilogit)) +
scale_fill_continuous(low = "black", high = "white", limits = c(0, .12), na.value = "red") +
scale_x_continuous(breaks = seq(-25, 10, 5), expand = c(0, 0)) +
scale_y_continuous(breaks = seq(-8, 4, 2), expand = c(0, 0))
```
```msegrid.mr = function(fitting.model, n.replicates = 5)
{grid = expand.grid(
curve = exp(seq(-25, 10, len = 150)),
k = exp(seq(-8, 4, len = 150)))
cbind(grid, mse = aaply(1 : nrow(grid), 1, .progress = "text", function(i)
median(replicate(n.replicates,
# I take the median of MSEs from several replicates,
# instead of just using one, in order to reduce
# noise in the plots.
fit.model(model = fitting.model, ts =
gendata(builder.test.ts, model.ghlp.mr, num(grid[i,])))\$mse))))}

plot.msegrid.mr = function(g) ggplot(g) +
geom_tile(aes(log(curve), log(k), fill = mse)) +
scale_x_continuous(breaks = seq(-25, 10, 5), expand = c(0, 0)) +
scale_y_continuous(breaks = seq(-8, 4, 2), expand = c(0, 0)) +
inset.legend(c(1, 0))

msegrid.mr.diff = cached(msegrid.mr(model.diff))
msegrid.mr.ghlp = cached(msegrid.mr(model.ghlp))
```
```plot.msegrid.mr(msegrid.mr.ghlp) +
scale_fill_continuous(
high = "black", low = "white", na.value = "red",
limits = c(0, .3), breaks = seq(0, .3, .1))
```
```plot.msegrid.mr(msegrid.mr.diff) +
scale_fill_continuous(
high = "black", low = "white", na.value = "red",
limits = c(0, .6))
```

The difference grid:

```plot.msegrid.mr(transform(msegrid.mr.diff, mse = mse - msegrid.mr.ghlp\$mse)) +
geom_point(data = data.frame(fits.ghlp.mr),
aes(log.curve, log.k), shape = "x", size = 5) +
coord_cartesian(xlim = c(-25, 5), ylim = c(-8, 2))
```

## Writeup

### Intro

Each of the previous studies compared two discounting models to each other. In this study, we compared the most general of the discounting models, GH, with an attribute-based model, Diff. One might expect that even if discounting models are highly capable of mimicking each other, a model that uses an entirely different approach to intertemporal choice may not be so capable of mimicking a discounting model, nor vice versa.

### Method

We parametrized GH with the log-odds of choosing LL set to

$ρ [r_L (1 + αd_L)^{-β/α} - r_S (1 + αd_S)^{-β/α} ]$

with α ≥ 0, β > 0, and ρ > 0. We parametrized Diff with the log-odds of choosing LL set to

$ρ [r_L - r_S - κ(d_L - d_S)]$

with κ > 0 and ρ > 0. We estimated model parameters by minimizing MSE with GNU R's function `optim` using its default `Nelder-Mead` method, along with some explicit logic to ensure that parameters were chosen to obtain MSE 0 when the input data was all-SS or all-LL. As our set of quartets for parameter-space exploration, we used Arfer and Luhmann's (2015) standard set of 100.

### Results and discussion

fig--g/pb-densities shows density plots of parametrically bootstrapped model fit (MSE) for each subject and each generating model. For most subjects and models, the mean fit favors the generating model. (The exceptions are subjects s53 and s61, for whom Diff achieves better mean fit when GH is the generating model, and subjects s50, s62, s67, s71, s72, and s75, for whom the reverse holds.) Still, one can see substantial visible overlap between bootstrap distributions. Collapsing across subjects, we find that when GH is the generating model, it achieves better fit than Diff in 80% of replicates, whereas when Diff is the generating model, it achieves better fit than GH in 68% of replicates. Thus, discrimination on the basis of model fit usually decides correctly, but has a non-negligible error rate, and is biased towards choosing GH. This bias is consistent with the models' different numbers of parameters: GH has 3 parameters whereas Diff has 2.

To explore mimicry with Diff as the generating model across its parameter space, we first needed to choose the rectangular section of the parameter space to inspect. Per-subject point estimates for original data had $\log ρ$ ranging from −8 to 1 and $\log κ$ ranging from −6 to 7, excepting one subject (s50) with an estimated $\log κ$ of −16. fig--g/diff-sd is a heatmap of the SD of inverse logits of choice probabilities across the Arfer and Luhmann (2015) quartets given Diff with the given parameter pair as the generating model. The darker each pixel of the heatmap, the closer the data is to being degenerate, with all choice probabilities being equal. Thus, the nontrivial portions of the parameter space (relative to the Arfer and Luhmann quartets) are given by the lighter regions. On this basis, we decided to inspect $\log ρ$ from −8 to 5 and $\log κ$ from −7 to 7. fig--g/dmsegrid is a heatmap of the difference of model fits of GH and Diff for data generated by Diff with each parameter pair, where positive MSE indicates greater fit by Diff. The original estimate for each subject is marked by a cross (all subjects are visible except s50). Observe that in nontrivial portions of the parameter space, Diff indeed achieves better fit, and subjects tend to cluster in areas where Diff achieves better fit.

To explore mimicry with GH as the generating model across its own parameter space, we again needed to choose a rectangular section of the parameter space to inspect. Since GH has 3 parameters, this required dropping a dimension. We opted to drop ρ because it is the least theoretically important parameter. We did this by making per-subject point estimates with all parameters free to vary, and calculating the median of the estimated ρ values, which was 0.274. We then redid these estimates with ρ fixed at the median, yielding $\log α$ ranging from −22 to 2 and $\log β$ ranging from −6 to 0, excepting one subject (s50) with estimates of 137 and 129 (respectively) and another subject (s61) with a $\log β$ estimate of −∞ (i.e., with β = 0). fig--g/ghlp-mr-sd is a heatmap similar to fig--g/diff-sd, and fig--g/dmsegrid-mr is similar to fig--g/dmsegrid but with positive values meaning greater fit for GH. (Note that although all data was generated with a single value of ρ, ρ was free to vary when GH was fitted for fig--g/dmsegrid-mr.) The overall impression is similar to that given in the parameter-space exploration of Diff, but stronger. The only large advantages in terms of fit are in GH's favor, and subjects' estimates tend to be in regions of such advantages.

In summary, the parametric-bootstrapping analyses as well as the two parameter-space explorations all consistently suggest that Diff and GH are not always easy to distinguish, but on average, model fit strongly favors the generating model.

## References

Arfer, K. B., & Luhmann, C. C. (2015). The predictive accuracy of intertemporal-choice models. British Journal of Mathematical and Statistical Psychology. Advance online publication. doi:10.1111/bmsp.12049. Retrieved from http://arfer.net/projects/builder/paper

Loewenstein, G., & Prelec, D. (1992). Anomalies in intertemporal choice: Evidence and an interpretation. Quarterly Journal of Economics, 107(2), 573–597. doi:10.2307/2118482

Scholten, M., & Read, D. (2013). Time and outcome framing in intertemporal tradeoffs. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39(4), 1192–1212. doi:10.1037/a0031171

Wagenmakers, E.-J., Ratcliff, R., Gomez, P., & Iverson, G. J. (2004). Assessing model mimicry using the parametric bootstrap. Journal of Mathematical Psychology, 48(1), 28–50. doi:10.1016/j.jmp.2003.11.004