How do people choose between a smaller reward available sooner and a larger reward available later? Past research has evaluated models of intertemporal choice by measuring goodness of fit or identifying which decision-making anomalies they can accommodate. An alternative criterion for model quality, which is partly antithetical to these standard criteria, is predictive accuracy. In the current study, we compared four representative models of intertemporal choice. In a preliminary analysis, parameter estimates and goodness of fit measures were derived from a single dataset. The results favored more complex models. Our second analysis, by contrast, used separate datasets for training and testing models, allowing predictive accuracy to be assessed. Training sets were adaptively constructed to precisely estimate model parameters for each subject. Results indicated that a highly general discounting model performed no better than a traditional exponential discounting model, and that attribute-based models performed substantially worse than discounting models. We suggest that exponential discounting is a happy medium between models that are too simple and models that are too complex. Our results support the enduring utility of discounting models, and exemplify the value of predictive accuracy as a criterion for model comparison.
People frequently need to choose between one outcome available soon (the smaller sooner, or SS, option) and a more desirable outcome available later (the larger later, or LL, option). Deciding whether to indulge in a dessert or stick to a diet, to splurge on an impulse buy or save up for a more desirable item, and to relax or study for an upcoming exam, can all be characterized as intertemporal choices. Accordingly, laboratory measures of individual differences in intertemporal preferences have been associated with variables such as body mass index (Sutter, Kochaer, Rützler, & Trautmann, 2010), credit-card debt (Meier & Sprenger, 2010), heroin addiction (Madden, Petry, Badger, & Bickel, 1997), and diagnoses of attention-deficit disorder (Demurie, Roeyers, Baeyens, & Sonuga‐Barke, 2012).
Extant models of intertemporal choice
Despite the real-world relevance of laboratory measures of intertemporal choice, the question of how to quantitatively model such choices, even in simple laboratory tasks, is far from settled (see Doyle, 2013 for a broad survey). Samuelson (1937) influentially proposed that people discount rewards based on the delay until their receipt. According to this proposal, $100 to be received a month in the future is treated as equivalent to some smaller amount of money available now, its immediacy equivalent. A discounting function f maps delays to numbers in [0, 1] called discount factors, which ultimately determine the immediacy equivalent of any delayed reward. For example, would yield the immediacy equivalent of $100 delayed by 1 month. In a binary forced-choice task, it is assumed that people will prefer the option with the greater discounted value.
The most thoroughly studied discounting function, from both descriptive and normative perspectives, is an exponential function. An exponential discounting function is of the form where the discount rate, k ∈ [0, ∞), is a free parameter typically assumed to vary across subjects. Among the normatively appealing features of exponential discounting is dynamic consistency: given a pair of delayed rewards, an exponential discounter's preference will never reverse merely from the passage of time (Koopmans, 1960). However, humans have been found to be dynamically inconsistent: someone who prefers $110 in a year and a month over $100 in a year may also prefer $100 today over $110 in a month, suggesting a reversal of preferences over the course of a year (e.g., Green, Fristoe, & Myerson, 1994; Kirby & Herrnstein, 1995). The importance of dynamic inconsistency is suggested by its real-world consequences. On New Year's Eve, one might prefer starting an exercise routine to slacking off on the following Monday, but then, when Monday arrives, one's mind may change. Indeed, dynamic inconsistency has been identified as the econometric manifestation of failures of self-control (Rachlin, 1995; Luhmann & Trimber, 2013).
An alternative discounting function is a hyperbolic function of the form where the discount rate, k ∈ [0, ∞), is a free parameter (Chung & Herrnstein, 1967; Mazur, 1987). Unlike exponential discounting, hyperbolic discounting allows for dynamically inconsistent preferences. Perhaps for this reason, hyperbolic discounting functions have consistently achieved better fits to laboratory data than exponential discounting functions (e.g., Kirby & Maraković, 1995; Myerson & Green, 1995; Madden, Bickel, & Jacobs, 1999; McKerchar et al., 2009). However, a longstanding emphasis on falsifying exponential discounting has led to insufficiently critical appraisal of hyperbolic discounting. Simply because the hyperbolic model outperforms the exponential model does not imply that the former is satisfactory in general. Indeed, there is evidence that people do not discount hyperbolically, either. For example, hyperbolic discounting implies effectively increasing patience as the front-end delay between two options increases, but Luhmann (2013) found that observed increases in patience were systematically smaller than predicted under a hyperbolic model. Myerson and Green (1995), on the other hand, found their data were better described by a model with increases in patience larger than predicted under a hyperbolic model. Clearly, a good model of self-control needs to characterize precisely how people are dynamically inconsistent.
If people are dynamically inconsistent, but not as dictated by hyperbolic discounting, a natural response would be a model in which the degree of dynamic inconsistency is controlled by a separate free parameter. In the generalized hyperbolic model of Loewenstein and Prelec (1992) (see also Benhabib, Bisin, & Schotter, 2010), which is of the form dynamic inconsistency is controlled by α ∈ (0, ∞). (The other parameter, β ∈ [0, ∞), is analogous to the ks above.) Observe that the generalized hyperbolic model includes both exponential and hyperbolic discount functions as special cases. As α tends to 0, generalized hyperbolic discounting approaches exponential discounting (i.e., ), and an ordinary hyperbolic discounter with fixed k is equivalent to a generalized hyperbolic discounter with α = k and β = 1/k. Other values of α and β correspond to dynamic inconsistency greater than that of hyperbolic discounting, or to degrees intermediate between exponential and hyperbolic.
Recently, it has been questioned whether people use discount functions at all. That is, are intertemporal decisions in fact made by computing and comparing immediacy equivalents? Scholten and Read (e.g., Scholten & Read, 2010; Scholten & Read, 2013) have investigated attribute-based models as alternatives to discounting. In an attribute-based model, alternatives are compared along various dimensions and the decision is then made by aggregating over these comparisons. Perhaps the simplest credible attribute-based model is what we will call the difference model. Representing preference for LL as positive and preference for SS as negative, the difference model is specified as The difference model uses weighting parameters a, b ∈ [0, ∞) to judge whether the improvement in reward offered by LL, rL − rS, compensates for the extra delay, tL − tS. The difference model may be regarded as the "arithmetic discounting" model of Doyle and Chen (2012) generalized to the case of two delayed rewards. A more complex attribute-based model is the basic tradeoff model of Scholten and Read (2010), which (in the case of gains only, i.e., positive rewards) can be instantiated as The free parameters γ and τ weight rewards and delays, respectively, analogously to a and b in the difference model. The basic tradeoff model accommodates anomalies addressed by some discounting models, such as dynamic inconsistency. It also accommodates inseparability of rewards and delays, which can be observed in humans but which no discounting model permits (Scholten & Read, 2010).
Limitations of past work
As discussed in the foregoing, prior work has focused on identifying anomalous choice patterns and designing models that can accommodate such anomalies better than other models (e.g., Rachlin & Green, 1972; Rodriguez & Logue, 1988; Scholten & Read, 2010). This anomaly-centered approach is analogous to the heuristics-and-biases program of Kahneman and Tversky (Tversky & Kahneman, 1973; Tversky & Kahneman, 1981). The trend is towards ever more complex and inclusive models. The goal of maximally inclusive models, however, is at odds with the goal of predictive accuracy. The question of predictive accuracy (or, in more psychometric terms, predictive validity) is: how well can a given model predict people's intertemporal choices? More inclusive models are more vulnerable to overfitting, since they are more likely to mistake noise for signal. They are also less efficient (in the sense that they require more data to estimate parameters with comparable precision), because they attempt to learn a more complex data-generating process. Both overfitting and lack of efficiency threaten predictive accuracy.
Why should we care about predictive accuracy? For one thing, predictive accuracy is important for practical reasons. A model must be accurate to be useful in applications: if a model cannot correctly predict people's behavior, then it is uninformative. And the more accurate a model, the more useful it is.
Predictive accuracy is also valuable for basic research because it represents a measure of model quality that penalizes overfitting. Consider, by contrast, the usual procedure in studies that compare models of intertemporal choice, in which model parameters are estimated and model performance is compared with the same data (e.g., Kirby & Maraković, 1995; Myerson & Green, 1995; Madden et al., 1999; Doyle & Chen, 2012; contrast with Scholten & Read, 2013 and Toubia, Johnson, Evgeniou, & Delquié, 2013). Under this procedure, models will not be penalized for mistaking noise in the data as reliable patterns. In fact, accounting for noise in the dataset will be rewarded, such that more inclusive (e.g., more complex) models will be favored. Such an arrangement is contrary to parsimony and makes comparisons between models of differing complexity difficult.
A common approach to sidestepping these problems when comparing models is to use a measure of performance that includes some penalty for complexity. Two prominent examples are the Akaike information criterion (AIC; Akaike, 1973) and the Bayesian information criterion (Schwarz, 1978). Minimum description length (Rissanen, 1978) is a more sophisticated alternative. These techniques have two limitations. First, they are limited by the particular measure of complexity employed by the chosen procedure. For example, the usual instantiation of AIC measures the complexity of a model by its count of parameters, ignoring other ways in which models differ. Second, there exist asymptotic guarantees that justify these techniques as countermeasures to overfitting (such as the equivalence of AIC to cross-validation under certain conditions; Fang, 2011), but no such guarantees apply to finite samples.
The only direct way to assess predictive accuracy is to use separate training data and testing data. That is, for a given model, parameters should be estimated using one dataset (the training test) and performance should be assessed using a separate dataset (the test set). This procedure corresponds to how science generally values theories that make new predictions over those that only explain what is already known.
Even putting aside the issue of overfitting, there are a number of smaller ways in which the stimuli used in most intertemporal-choice studies are less than ideal for assessing model accuracy. For example, participants are typically only offered choices between immediate and delayed rewards (e.g., Kirby & Maraković, 1995; Myerson & Green, 1995; Peters, Miedl, & Büchel, 2012; Scholten & Read, 2013), never having to choose between two delayed rewards (cf. Rachlin & Green, 1972; Read & Read, 2004; McClure, Laibson, Loewenstein, & Cohen, 2004; Luhmann & Trimber, 2013). While it is not clear whether choices between an immediate reward and a delayed reward or between two delayed rewards are more common or more important in nature, excluding delayed–delayed choices entirely precludes testing models' predictions about them. Delays are also typically limited to a handful of unique values across many trials, and they are frequently non-round quantities, such as 17 days, which may be susceptible to mental rounding. These features of stimuli make it harder to argue that models are being tested with appropriately representative data.
Fixed datasets are also problematic for model comparison. A given dataset may be more informative for estimating the parameters of some models than the parameters of others. As a simplistic example, suppose the models to be compared are a "reward model", which uses only reward amounts and ignores delay lengths, and a "delay model", which uses only delay lengths and ignores reward amounts. A dataset with a wide variety of amounts and only a few distinct delays would be more helpful for estimating the parameters of the reward model than the delay model. More subtly, a given dataset may be more informative for some subjects than others. A dataset with a variety of delays clustered near 0 could estimate patience more precisely for impatient subjects than patient subjects. Given substantial individual differences in intertemporal choice (e.g., Beck & Triplett, 2009), this issue is hard to dismiss. In summary, observed differences in model performance may reflect the choice of stimuli more than the overall quality of the models being compared.
The present study
The goal of the present study was to compare four representative models of intertemporal choice. Subjects completed a standard test set of 100 binary forced choices that covered a variety of delays and rewards. Then each subject was randomly assigned one of the four models, and completed a training set of 50 trials that were constructed in real time to adaptively estimate the model's parameters based on that subject's choices.
We present two model-comparison analyses. In the first, as is conventional in the intertemporal-choice literature, we estimate model parameters and assess model performance using the same dataset, namely, the test set. In the second analysis, we estimate parameters with the training set and assess performance with the test set. In both analyses, we quantify accuracy in two ways, as the number of correctly predicted choices and as the error between predicted choice probabilities and observed choices.
We estimate model parameters and predict choices using Bayesian methods. In Bayesian statistical inference (for an introduction, see Kruschke, 2013 and Kruschke, 2011), model parameters are treated as random variables, allowing one to make probability statements such as "θ is 70% likely to be greater than 3". Bayesian methods are especially appropriate for our study because they provide a general, theoretically straightforward means of predictive inference, in which a posterior probability distribution for future observations is automatically induced by the observed data, the model, and the initial assumed ("prior") distributions of the model parameters. In fact, all but one chapter of Geisser's (1993) monograph on prediction are devoted to Bayesian methods.
One difficulty with Bayesian methods is the need to choose priors. Gelman, Carlin, Stern, and Rubin (2004) recommend that in any data-analysis problem, one should try different priors, as well as experiment with the rest of the model structure, and observe the effects on the inferences of interest. This strategy was not available to us because the current study required us to make inferences automatically in real time. Instead, we experimented with our models using informal cross-validation analyses of data from a previous, unpublished study. The final models we present here have independent uniform prior distributions for all parameters, but we shaped the prior distribution of choice probabilities indirectly by how we parametrized the models (e.g., by taking the logarithm of a parameter as part of computing the likelihood).
We do not use hierarchical models. That is, the modeling for each subject is carried out independently, rather than using information from other subjects. This is because we were interested in the performance of our models for inference using only individual participants' data.
Our models are regression models and therefore assume that, conditional on all parameters and predictors, choices are iid. Note, however, that our conclusions in this study do not entirely depend on this assumption: if this assumption were incorrect, the predictive performance of our models would be impaired accordingly. On the other hand, the between-subject comparisons we present as our final data analysis do of course assume independent sampling of subjects.
All our models have a similar overall form, which is a type of logistic regression. Information about the two choices is input to the model through tL and tS, the delays in days, and rL and rS, the rewards in dollars. The model uses these four predictors and some model-specific parameters to produce a preference on the scale of (−∞, +∞), with positive numbers representing preference for LL and negative numbers representing preference for SS. Then the logistic function is used to map this preference value to [0, 1], and the output represents the probability that the subject chooses LL. We use the notation to mean that Y is a Bernoulli-distribution random variable with parameter θ; that is, Y is realized as 1 with probability θ and 0 with probability 1 − θ. We use the notation to mean that the parameter θ has a uniform prior density ranging from a to b.
We evaluate four models of intertemporal choice: exponential discounting (Samuelson, 1937), generalized hyperbolic discounting (Loewenstein & Prelec, 1992; Benhabib et al., 2010), the difference model (as described above), and a variant of the basic tradeoff model (Scholten & Read, 2013).
The exponential discounting model, hereafter Exp, has likelihood and priors . By construction, the parameter v30 determines the discount factor at 30 days (e.g., $100 delayed by 30 days would be equivalent to 100v30 dollars available immediately). The benchmark value of 30 days is a compromise between very short and very long delays, at which the uniform prior distribution of discount rates would be less plausible. The other parameter, ρ, is a noise parameter. When ρ is near 1, the subject is effectively deterministic, and almost always chooses the option with the higher discounted value. When ρ is near 0, the subject effectively ignores amounts and delays, and is just as likely to choose each option. The constant of 10 is to allow for a variety of different amounts of noise while keeping ρ itself in the round range of (0, 1).
The generalized hyperbolic discounting model, hereafter GH, has likelihood where α = 10/(1 − a) − 10 and b = (30v30α − 30)−1, and priors , , and . The prior ranges of a and v30 would be (0, 1), but they have been narrowed slightly to avoid floating-point underflow of the expression v30α. The parameters ρ and v30 are interpreted as before. The only new parameter, a, controls the curvature of the discount function, which determines the degree of dynamic inconsistency; see Figure 1.
The difference model, hereafter Diff, has likelihood
with priors and . Thus, this model is essentially a generalized linear model (with logistic link) that uses the difference between rewards and the difference between delays as predictors. ρ again functions as a noise parameter. d determines the tradeoff between dollars of reward and days of delay. When d = 0, a $1 increase in the difference of rewards is treated as equivalent to a 1-day decrease in the difference of delays. When d is near 1, the subject weights delay differences so heavily that she always chooses SS, and when d is near −1, the subject weights reward differences so heavily that she always chooses LL.
The variant of the basic tradeoff model, hereafter BT, has likelihood where and τ = γe10d, and priors and . ρ and d play roughly analogous roles as in Diff. The model equation is derived from Equation 5 of Scholten and Read (2010) for the case of gains only, with the undefined functions QT|X and QX|T set to the identity function, and the suggested values of w and v from Equations 9 and 10 of Scholten & Read, 2010.
Notice that Exp, Diff, and BT have two parameters each, whereas GH has three.
Here we describe how we fit our models to the subjects' data. We fit models both during the experiment, as part of the adaptive procedure, and after the data were collected to produce our final results. For simplicity, we fit models in the same way in both cases although the time constraints of the former case did not apply to the latter.
Three of the models (Exp, GH, and Diff) were fit with the Markov chain Monte Carlo (MCMC) sampler Stan (http://mc-stan.org). MCMC is a numerical method that approximately provides random samples from the posterior distribution of the parameters of a Bayesian model. We ran seven chains at a time, each initialized with a random draw from the model's prior. For each chain, 150 adaptive burn-in iterations (250 for GH) were discarded and 50 non-adaptive sampling iterations were kept. Convergence was declared when all parameters' Gelman-Rubin diagnostics (Gelman & Rubin, 1992) fell below 1.1. When the chains failed to converge, fitting was attempted again with twice as many burn-in iterations and twice as many sampling iterations. This was repeated as necessary, but only up to a preset maximum number of rounds to keep computation time low.1 If the convergence criterion still was not achieved, we used the posterior samples from the last run anyway. In any case, we were left with 50 samples from each chain (for a total of 350) for each subject (if fitting had progressed past round 1, the 50 samples we used were evenly spaced among the samples collected in the last round).
The BT model experienced convergence difficulties with MCMC, so we fit it with grid approximation instead. The grid had 40,000 points with each dimension (d and ρ) varying between 200 evenly spaced points. We drew 150 samples from the grid per subject.
The test set comprised 100 quartets that were randomly drawn from a standard pool of 33,600 unique quartets. All subjects were given the same test set, although the order of presentation was randomized per subject. The standard pool was constructed as a Cartesian product of eligible SS rewards, SS delays, LL rewards, and LL delays.
- Delays took the following values: 1, 2, 3, 4, 5, and 10 days; 1, 2, 3, and 6 weeks; and 1, 2, 3, and 4 months.2 A week was modeled as 7 days and a month was modeled as 30.4375 days, although all delays were displayed to subjects in the units shown here. For SS delays only, the value of 0 (displayed as "today") was also used. LL delays were constrained to be strictly greater than SS delays.
- SS rewards varied from $5 to $100 in $5 increments.
- LL rewards were defined relative to the corresponding SS reward. They could be $1 greater or between $5 and $80 greater in $5 increments. For the purpose of constructing the test set, $5 and $10 were represented twice as frequently as each of the other amounts to increase the number of nontrivial decisions, in which the difference between rewards was not so great as to make LL clearly more desirable.
After subjects completed the test set, they were randomly assigned one of the four models and completed 50 additional quartets to adaptively estimate the parameters of the selected model. We now describe the algorithm we used to decide which quartet to present to a subject given that subject's choices in the adaptive procedure so far. Conceptually, each iteration of the algorithm begins by identifying a region of the parameter space that is minimally credible given the data already observed (step 1 below), then finds the two "most different" parameter vectors in this region (steps 2 and 3) and selects the quartet that best differentiates these vectors (steps 4 and 5). Here is a complete description of each iteration:3
- Sample from the posterior distribution of the parameter vector θ. We do this using MCMC or grid approximation as described above. Because this is a finite sample in which regions of θ are represented in proportion to their credibility, very improbable values of θ are unlikely to be included.
- Scale each parameter linearly so the maximum in the sample is 1 and the minimum is 0. The purpose of this operation is to weight all the parameters equally for the next step.
- Find the two values of θ in the sample with greatest Euclidean distance. Call them t1 and t2.
- Loop over each of the 33,600 quartets in the standard pool (described in the section "Test set" above), excluding the previous 10 quartets presented to this subject in the adaptive procedure. For each quartet q, compute dq = |p(L | θ = t1) − p(L | θ = t2)|, where p(L) is the probability of choosing LL; that is, the difference between the probability of choosing LL assuming that θ = t1 and the probability of choosing LL assuming that θ = t2.
- Select the quartet q that maximizes dq and present this quartet to the subject.
Informal exploration suggests that, given a simulated decision-maker operating according to any of our four models and realistic parameter values, our procedure indeed identifies the true values with reasonable posterior precision within 50 trials. We have not proved our procedure is optimal, but it is similar in spirit to that of Toubia et al. (2013). One major difference from Toubia et al. is that we performed our adaptive computations in real time instead of a priori, since we used 50 adaptive trials per subject (compared to 20 in Toubia et al.) and a lookup table with 249 (about 600 trillion) entries would have been unfeasible.
Of 207 subjects, most (159) were users of Amazon Mechanical Turk who lived in the United States, and the remainder (48) were students at Stony Brook University run in laboratory. Slightly over half (106, or 51%) of subjects were female. The median age was 27 (95% equal-tailed interval 18–65). Students were compensated with course credit, and Mechanical Turk users were compensated with $0.50 (65 subjects) or $0.25 (94 subjects).
After providing informed consent, subjects completed the test set with the 100 items presented in shuffled order. Interspersed within the 100 items were three catch trials in fixed positions:
- Trial 4 was $25 in 5 days versus $20 in 1 week.
- Trial 51 was $60 today versus $60 in 3 days.
- The last trial was $55 in 2 weeks versus $40 in 3 weeks.
These catch trials were not used for fitting or evaluating models.
After the test set, each subject was randomly assigned one of the four models—47 to Exp, 52 to GH, 43 to Diff, and 45 to BT—and completed the adaptive training set. Subjects were not notified of the transition from testing to training. The adaptive procedure was always administered after the test set to prevent carryover effects from the adaptive procedure (particularly, the subject's assigned model) into the test set.
Every 40 trials, the task program said "Feel free to take a break before continuing". Subjects could continue by clicking a button.
To quantify the accuracy of our models, we use two complementary metrics: the number of correctly predicted choices and root mean square error (RMSE). The number of correctly predicted choices is coarse—it is insensitive to, for example, the models' choice functions, that is, the way they translate choice preferences to choice probabilities—and therefore robust and easy to interpret. RMSE, on the other hand, provides a stronger test of the models' exact predictions by favoring predicted choice probabilities that are closer to the observed choices. For example, when the observed choice is LL, a predicted choice probability of .9 is treated as more accurate than a predicted choice probability of .8. RMSE can be interpreted as the average distance between the observed choices and the predicted choice probabilities.
Once each subject's "accuracy scores" (correct-prediction count and RMSE) are computed, there is the question of how to actually compare the models. We use confidence intervals rather than significance tests to allow for continuous rather than discrete judgments of relative model performance. To avoid making strong assumptions about population distributions, we use bootstrapping, a frequentist nonparametric method (see Efron & Tibshirani, 1993 for an introduction). We bootstrap by producing, for each model, an accuracy distribution, which is the resampling distribution of that model's set of per-subject accuracy scores. We report 95% confidence intervals for each model's mean accuracy score. Each confidence interval is computed by taking the quantiles of the accuracy distribution. We also report the confidence that each model's mean accuracy score exceeds the mean accuracy score of each of the other models. Each of these confidence figures is computed by first calculating the difference distribution of the accuracy distributions. Then we compute the proportion of the difference distribution that is greater than 0, and this proportion is taken as the confidence.
We conduct two analyses. In the first, we estimate model parameters and assess model performance using the same dataset, namely, the test set. In the second, we estimate parameters with the training set and assess performance with the test set. It would be possible in the first, but not the second, to perform within-subject comparisons of the models. We avoid this for the sake of comparability of the analyses.
Subjects were excluded if they chose the dominated option for at least two of the catch trials (n = 1), completed the test set in less than 4 minutes (n = 11), or chose LL for every test trial (n = 9, including one subject who also took less than 4 minutes). Thus, 187 subjects were included in the following analyses.
During the adaptive procedure, we limited the total number of MCMC iterations per trial, as described in "Model-fitting" above. Across 6,850 adaptive trials, 205 (3%) were aborted before convergence. Most subjects had fewer than 5 aborts (out of 50 adaptive trials), but one Diff subject had 5, four Diff subjects had 6, and three GH subjects had 6, 9, and 16, respectively.
For the first, "test-to-test" analysis, we estimate model parameters and evaluate models using the same dataset, namely, the test set. This analysis allows using the full sample of subjects with every model, and it is representative of model evaluation methods employed in past research, but, by our earlier arguments, it is not valid as a test of predictive accuracy.
Figure 4 shows the mean number of correct predictions for each model. Table 1 shows the confidence for each comparison of means. Observe that every model's mean achieves approximately 85 correct predictions out of the 100 total trials. The overall ranking has GH the most accurate, followed by BT, Exp, and Diff. However, the confidences associated with each of these comparisons are not high; only the comparison between GH and Diff achieves a confidence of 95%.
Figure 5 displays root mean square error (RMSE) between predicted probabilities and observed choices. All models have mean RMSE around .3. Table 2 shows the confidence associated with each comparison. The ranking of models is the same, albeit this time with somewhat higher confidences.
The foregoing analyses suggest that model performance was, roughly speaking, proportional to model complexity. The sophisticated BT model and the three-parameter GH model outperformed the simpler Exp and Diff models. As mentioned above, a conventional approach to comparing models of differing complexity is to use a measure of model performance that includes a complexity penalty, such as the AIC (Akaike, 1973). We calculated the AIC for each subject under each model as the maximal log likelihood minus the number of parameters. Despite the most complex model (viz., GH) having performed best earlier, the AIC yields similar results (see Figure 6 and Table 3).
Overall, this analysis suggests the four models are of roughly equivalent quality. If the models are to be ranked at all, the correct ranking is GH, BT, Exp, Diff, with GH achieving the best performance. However, confidence is generally low, and the magnitudes of the differences between models are not large.
For the second, "train-to-test" analysis, we estimate model parameters using data from the adaptive training set, then evaluate predictions for choices in the test set.
Before comparing the models' performance to each other, it is worth checking that the models' overall performance was minimally acceptable. Without this initial evaluation, the winning model might be better characterized as the least inadequate of four poor models rather than the best of four reasonable models. To this end, we computed for each subject d = c − m, where c is the number of correctly predicted trials, and m, being the count of that subject's modal choice in the test set, is a measure of baseline performance.4 The requirement that d > 0 is a stronger condition than above-chance performance (i.e., that c > 50). We indeed obtained d > 0 for 84% of subjects, while d was exactly 0 for 11% and negative for 4%. Improvement over m was not trivial: the median of d was 10 trials (compared to a theoretical maximum of 50). These results suggest that our overall method, including our choice of models and the adaptive procedure itself, resulted in models, training data, and parameter estimates capable of predicting unseen data with reasonable accuracy.
Figure 7 shows the mean number of correct predictions for each model, and Table 4 shows the confidence associated with each comparison of means. Compared to the test-to-test analysis, the confidence intervals for each mean are wider, not least because the sample size for each mean is a quarter as large. Critically, the ranking of the four models has changed from that observed in the test-to-test analysis, and differences between models are more dramatic, because the models are now being penalized for mistaking noise as signal. The discounting models, Exp and GH, are most accurate, with 86% and 84% (respectively) of choices predicted correctly, while BT is at 81% and Diff is at 74%. While we cannot state unequivocally that Exp outperformed GH, we can safely conclude that GH did not outperform Exp. This finding highlights a strength of separating test sets and training sets, which is that nested models can be compared properly. In the test-to-test analysis, by contrast, GH, being strictly more inclusive than Exp, was certain to perform at least as well, no matter how unhelpful its extra parameter may be for predicting new data.
Our findings for RMSE are similar to those for numbers of correct predictions. As can be seen in Figure 8 and Table 5, Exp and GH had the least error (with Exp slightly and not confidently ahead of GH), BT performed worse, and Diff performed worst.
It is instructive to compare the model performance obtained from the current analysis (train-to-test) to that obtained from the previous one (test-to-test). In general, performance is worse, which is to be expected considering that under the train-to-test analysis, the models are predicting new, unobserved data and therefore are penalized for mistaking noise in the training data as information. However, not all models have been affected equally. In particular, Exp, which (along with GH) performed best under train-to-test, experienced the smallest drop in performance. In contrast, Diff, which was the worst performer in both studies, experienced the greatest drop in performance. BT also saw a substantial drop. These observations suggest that Diff, BT, and GH are more vulnerable to overfitting than Exp.
Estimates of discount-function curvature obtained from GH are show in Figure 9. Note, however, that GH's lack of improvement over Exp implies these estimates are not necessarily meaningful.
We first compared four models of intertemporal choice with a conventional approach: parameters were estimated and performance was assessed with the same data. A highly general discounting model (Loewenstein & Prelec, 1992) and a sophisticated non-discounting model (Scholten & Read, 2013) outperformed two simpler models, although differences in performance were generally small. In our second analysis, we evaluated the predictive accuracy of the same set of models by estimating parameters and assessing performance with separate datasets. Furthermore, training datasets were adaptively constructed to estimate each model's parameters precisely. We found that the generalized hyperbolic model was no more accurate at predicting choices than the classic exponential discounting model, and both of these models were substantially more accurate than the non-discounting models.
The nature of prediction
If inclusiveness and predictive accuracy are at odds with each other, which should researchers pursue? More generally, should the goal of science be explanation or prediction? We suggest there are good reasons to prefer prediction. The value of, for example, construing the goal of psychology as "the prediction and control of behavior" (Watson, 1913) is that the scientific merit of a psychological theory becomes a measurable quality of practical value. It is desirable for a theory to explain what is already known, but only insofar as these explanations allow behavior to be predicted. The danger of pursuing post hoc explanation for its own sake is that such a practice may end up with theories that can explain anything and predict nothing. A philosophy of pursuing prediction is consistent with both falsifiability (since predictions are what make theories falsifiable) and Occam's razor (since superfluous complexity decreases predictive power).
Considering the theoretical and practical appeal of prediction, one might ask why it has been largely neglected by research in intertemporal choice. It turns out that the preference for parameter estimation over prediction is widespread across many scientific disciplines.
Prediction was the earliest and most prevalent form of statistical inference. This emphasis changed during the beginning of this century when the mathematical foundations of modern statistics emerged. Important issues such as sampling from well-deﬁned statistical models and clariﬁcation between statistics and parameters began to dominate the attention of statisticians. This resulted in a major shift of emphasis to parametric estimation and testing. (Geisser, 1993, p. xi)
We speculate that the mathematical and computational difficulty of prediction is responsible for its lack of popularity. Confidence intervals are difficult to understand, prediction intervals for a single future observation are harder, and tolerance intervals, probably the most powerful predictive tool in the frequentist toolbox, are harder still. Bayesian methods have their own issues of this kind, not the least of which is that MCMC has only become computationally feasible in recent decades.
Champions of prediction include Geisser (quoted above), Milton Friedman, de Finetti (1974), and, outside research, Silver (2012). Friedman (1953) wrote "…theory is to be judged by its predictive power for the class of phenomena which it is intended to 'explain'", but also "Truly important and significant hypotheses will be found to have 'assumptions' that are wildly inaccurate descriptive representations of reality, and, in general, the more significant the theory, the more unrealistic the assumptions (in this sense)." This latter point relates to a surprising feature of prediction, which is that it is possible for models with false assumptions to be more accurate than more realistic models. Domingos and Pazzani (1997) show that, even with simulated data generated from a known, complex model, a simpler model can perform better than the true model with realistic amounts of training data. It follows that although our study has vindicated exponential discounting as a tool for predicting choices, it has not shown that exponential discounting is, in fact, how people make decisions. On the contrary, the "wildly inaccurate" assumptions of exponential discounting may have been a large part of its success. The poor performance of the difference model, on the other hand—which predicts people will treat the difference between today and tomorrow the same as twenty years from now and twenty years and a day from now—underscores the obvious point that inaccurate assumptions do not suffice for accurate prediction. We suggest that exponential discounting represents a successful compromise between models that are too simple-minded to characterize decision-making, and models whose sophistication leads to overfitting and inefficiency.
This conflict between realistic and accurate models is also relevant to the growing popularity of decision neuroscience. Glimcher (2011) motivates the use of neuroscience to study decision-making by pointing out that a theory that makes neural, as opposed to merely behavioral, predictions can be tested more richly, "with both neurobiological and behavioral tools" (p. 132). Glimcher makes a strong, straightforward case that neuroscience must be a part of any research program seeking a comprehensive description of how people make decisions. He also suggests that neurally grounded theories will be helpful for predicting behavior. However, neurally realistic models are likely to be more complex than purely behavioral models, which may make them inferior for prediction of behavior given realistic amounts of training data. Thus, if the goal is to predict behavior, enthusiasm for decision neuroscience should be tempered with awareness that its progress may not help this goal.
We hope future research continues to pursue the question of how to predict intertemporal choices, if only to help settle longstanding controversy as to which model is best. A focus on predictive accuracy could also help the study of choice under uncertainty, which, being another prominent domain of behavioral economics, is also subject to modeling controversies (e.g., Birnbaum, 2008; van Gelder, de Vries, & van der Pligt, 2009; for a recent study examining predictive accuracy, see Glöckner & Pachur, 2012).
Limitations and future directions
Our study had several limitations. First, our decision-making scenario was not incentive-compatible. That is, we did not give subjects financial motivation to answer our questions honestly. However, past studies have found little difference in subjects' intertemporal decision-making between tasks with real rewards and tasks with hypothetical rewards (e.g., Johnson & Bickel, 2002; Madden, Begotka, Raiff, & Kastern, 2003; Madden et al., 2004; Lagorio & Madden, 2005). Therefore, we believe that using real rewards would not have affected the results of our study. Furthermore, the good predictive preformance of some of our models implies that subjects' responses were not purely noise.
All of our models in this study used the same choice function (i.e., mapping from preferences to choice probabilities), namely the logistic function. Other choice functions have been proposed, and it is not clear which is best. Another commonly used choice function, the quantile function for the standard normal distribution, also known as the probit function, would presumably have yielded similar results to ours, since "in practice, the probit and logit models are quite similar" (Gelman et al., 2004, p. 417). It is more difficult to predict the results of using more exotic choice functions, such as one positing a per-subject constant probability of choosing at random. The same goes for abandoning choice functions per se and positing variability in preferences rather than variability in choices (Regenwetter & Davis-Stober, 2012; see also Loomes & Sugden, 1995). The concordance of our results for correct-choice counts and RMSE, which differ in their sensitivity to the choice function, is some evidence against our results being entirely dependent on the logistic function. Future studies comparing the predictive validity of a more homogeneous set of models should address the issue of choice rules more directly.
Past research on intertemporal choice has often focused on the distinction between exponential and hyperbolic discounting. It may then seem strange that we did not include the classic hyperbolic model (Mazur, 1987) in our analyses. Traditionally, the hyperbolic model has been endorsed on the grounds of being a correct description of how people discount, but more recent studies have falsified it (Myerson & Green, 1995; Read, Frederick, & Airoldi, 2012; Luhmann, 2013). Even the most prominent advocates of hyperbolic models now endorse more complex generalizations thereof (Myerson & Green, 1995; McKerchar et al., 2009; McKerchar, Green, & Myerson, 2010). Partly for this reason, we included the generalized hyperbolic model. The similar performance of the generalized hyperbolic model and the exponential model under the train-to-test analysis suggests that a hyperbolic model would have performed similarly to both, since a hyperbolic model is nested within the generalized hyperbolic model.
Our philosophy in designing our test set was to be theoretically agonstic: we aimed for diversity over the representation of any theoretically relevent stimuli. Still, it is reasonable to ask how our results would have differed with different test sets. For example, only 21 of the 100 quartets in our test set had an immediate option (i.e., had an SS delay of 0). If immediate options had been represented more heavily, perhaps we would have seen a different ranking of models. Immediate outcomes are thought to have special importance (Mischel, Shoda, & Rodriguez, 1989); for example, the quasi-hyperbolic model (Laibson, 1997) treats delays of 0 differently from positive delays. Researchers may disagree as to which test set is best, and it would not be clear how to reconcile different results obtained from studies with different test sets. The only real solution is to move out of the laboratory and examine external validity. After all, predicting behavior in laboratory tasks is only interesting insofar as laboratory behavior relates to real-life behavior. Ultimately, research on decisions about eating, spending, and drug use should use the model that can best predict the decisions of interest.
Another fruitful avenue for future research is to examine additional models of intertemporal choice. Even if one infers from the present study that discounting models are better than non-discounting models, there remains the question of which discounting model is best. Investigators could examine other, less prominent models (Doyle, 2013) or try to rehabilitate the generalized hyperbolic model (whose performance apparently suffers from overfitting or inefficiency), perhaps by using more informative priors.
Finally, although we have emphasized the ways in which explanatory and predictive goals can be at odds with each other, we hope that future research considers both. The differing, almost antagonistic perspectives of these goals mean they may actually serve complementary purposes in the comparison and refinement of models and theories.
Akaike, H. (1973). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. doi:10.1109/TAC.1974.1100705
Beck, R. C., & Triplett, M. F. (2009). Test–retest reliability of a group-administered paper–pencil measure of delay discounting. Experimental and Clinical Psychopharmacology, 17, 345–355. doi:10.1037/a0017078
Benhabib, J., Bisin, A., & Schotter, A. (2010). Present-bias, quasi-hyperbolic discounting, and fixed costs. Games and Economic Behavior, 69(2), 205–223. doi:10.1016/j.geb.2009.11.003
Birnbaum, M. H. (2008). Evaluation of the priority heuristic as a descriptive model of risky decision making: Comment on Brandstätter, Gigerenzer, and Hertwig (2006). Psychological Review, 115(1), 253–260. doi:10.1037/0033-295X.115.1.253
Chung, S.-H., & Herrnstein, R. J. (1967). Choice and delay of reinforcement. Journal of the Experimental Analysis of Behavior, 10(1), 67–74. doi:10.1901/jeab.1967.10-67
de Finetti, B. (1974). Theory of probability. New York, NY: Wiley. ISBN 0-471-20141-3.
Demurie, E., Roeyers, H., Baeyens, D., & Sonuga‐Barke, E. (2012). Temporal discounting of monetary rewards in children and adolescents with ADHD and autism spectrum disorders. Developmental Science, 15(6), 791–800. doi:10.1111/j.1467-7687.2012.01178.x
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(1, 3), 103–130. doi:10.1023/A:1007413511361
Doyle, J. R. (2013). Survey of time preference, delay discounting models. Judgment and Decision Making, 8(2), 116–135. Retrieved from http://journal.sjdm.org/12/12309/jdm12309.html
Doyle, J. R., & Chen, C. H. (2012). The wages of waiting and simple models of delay discounting. doi:10.2139/ssrn.2008283
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York, NY: Chapman & Hall. ISBN 978-0-412-04231-7.
Fang, Y. (2011). Aymptotic equivalence between cross-validations and Akaike information criteria in mixed-effects models. Journal of Data Science, 9, 15–21.
Friedman, M. (1953). The methodology of positive economics. In Essays in positive economics (pp. 3–43). Chicago: University of Chicago Press.
Geisser, S. (1993). Predictive inference: An introduction. New York, NY: Chapman & Hall. ISBN 978-0-412-03471-8.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457-472. doi:10.1214/ss/1177011136
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis (2nd ed.). Boca Raton, FL: Chapman & Hall/CRC. ISBN 1-58488-388-X.
Glimcher, P. W. (2011). Because, not as if. In Foundations of neuroeconomic analysis (pp. 125–139). Oxford, England: Oxford University Press. ISBN 978-0-19-974425-1.
Glöckner, A., & Pachur, T. (2012). Cognitive models of risky choice: Parameter stability and predictive accuracy of prospect theory. Cognition, 123(1), 21–32. doi:10.1016/j.cognition.2011.12.002
Green, L., Fristoe, N., & Myerson, J. (1994). Temporal discounting and preference reversals in choice between delayed outcomes. Psychonomic Bulletin and Review, 1(3), 383–389. doi:10.3758/BF03213979
Johnson, M. W., & Bickel, W. K. (2002). Within-subject comparison of real and hypothetical money rewards in delay discounting. Journal of the Experimental Analysis of Behavior, 77(2), 129–146. doi:10.1901/jeab.2002.77-129
Kirby, K. N., & Herrnstein, R. J. (1995). Preference reversals due to myopic discounting of delayed reward. Psychological Science, 6(2), 83–89. doi:10.1111/j.1467-9280.1995.tb00311.x
Kirby, K. N., & Maraković, N. N. (1995). Modeling myopic decisions: Evidence for hyperbolic delay-discounting within subjects and amounts. Organizational Behavior and Human Decision Processes, 64(1), 22–30. doi:10.1006/obhd.1995.1086
Koopmans, T. C. (1960). Stationary ordinal utility and impatience. Econometrica, 28(2), 287–309.
Kruschke, J. K. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. San Diego, CA: Elsevier Academic Press. ISBN 978-0-12-381485-2.
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. doi:10.1037/a0029146
Lagorio, C. H., & Madden, G. J. (2005). Delay discounting of real and hypothetical rewards III: Steady-state assessments, forced-choice trials, and all real rewards. Behavioural Processes, 69(2), 173–187. doi:10.1016/j.beproc.2005.02.003
Laibson, D. (1997). Golden eggs and hyperbolic discounting. Quarterly Journal of Economics, 112(2), 443–477. doi:10.1162/003355397555253
Loewenstein, G., & Prelec, D. (1992). Anomalies in intertemporal choice: Evidence and an interpretation. Quarterly Journal of Economics, 107(2), 573–597. doi:10.2307/2118482
Loomes, G., & Sugden, R. (1995). Incorporating a stochastic element into decision theories. European Economic Review, 39(3–4), 641–648. doi:10.1016/0014-2921(94)00071-7
Luhmann, C. C. (2013). Discounting of delayed rewards is not hyperbolic. Journal of Experimental Psychology: Learning, Memory, and Cognition. Advance online publication. doi:10.1037/a0031170
Luhmann, C. C., & Trimber, E. M. (2013). Fighting temptation: The relationship between executive control and self control. Submitted.
Madden, G. J., Begotka, A. M., Raiff, B. R., & Kastern, L. L. (2003). Delay discounting of real and hypothetical rewards. Experimental and Clinical Psychopharmacology, 11(2), 139–145. doi:10.1037/1064-12220.127.116.11
Madden, G. J., Bickel, W. K., & Jacobs, E. A. (1999). Discounting of delayed rewards in opioid-dependent outpatients: Exponential or hyperbolic discounting functions? Experimental and Clinical Psychopharmacology, 7(3), 284–293. doi:10.1037/1064-1218.104.22.1684
Madden, G. J., Petry, N. M., Badger, G. J., & Bickel, W. K. (1997). Impulsive and self-control choices in opioid-dependent patients and non-drug-using control participants: Drug and monetary rewards. Experimental and Clinical Psychopharmacology, 5(3), 256–262. doi:10.1037/1064-1222.214.171.1246
Madden, G. J., Raiff, B. R., Lagorio, C. H., Begotka, A. M., Mueller, A. M., Hehli, D. J., & Wegener, A. A. (2004). Delay discounting of potentially real and hypothetical rewards II: Between- and within-subject comparisons. Experimental and Clinical Psychopharmacology, 12(4), 251–261. doi:10.1037/1064-12126.96.36.199
Mazur, J. E. (1987). An adjusting procedure for studying delayed reinforcement. In M. L. Commons, J. E. Mazur, J. A. Nevin, & H. Rachlin (Eds.), The effect of delay and of intervening events on reinforcement value (pp. 55–73). Hillsdale, NJ: Lawrence Erlbaum. ISBN 0-89859-800-1.
McClure, S. M., Laibson, D. I., Loewenstein, G., & Cohen, J. D. (2004). Separate neural systems value immediate and delayed monetary rewards. Science, 306(5695), 503–507. doi:10.1126/science.1100907
McKerchar, T. L., Green, L., & Myerson, J. (2010). On the scaling interpretation of exponents in hyperboloid models of delay and probability discounting. Behavioural Processes, 84(1), 440–444. doi:10.1016/j.beproc.2010.01.003
McKerchar, T. L., Green, L., Myerson, J., Pickford, T. S., Hill, J. C., & Stout, S. C. (2009). A comparison of four models of delay discounting in humans. Behavioural Processes, 81(2), 256–259. doi:10.1016/j.beproc.2008.12.017
Meier, S., & Sprenger, C. (2010). Present-biased preferences and credit card borrowing. American Economic Journal: Applied Economics, 2(1), 193–210. doi:10.1257/app.2.1.193
Mischel, W., Shoda, Y., & Rodriguez, M. I. (1989). Delay of gratification in children. Science, 244(4907), 933–938. doi:10.1126/science.2658056
Myerson, J., & Green, L. (1995). Discounting of delayed rewards: Models of individual choice. Journal of the Experimental Analysis of Behavior, 64(3), 263–276. doi:10.1901/jeab.1995.64-263
Peters, J., Miedl, S. F., & Büchel, C. (2012). Formal comparison of dual-parameter temporal discounting models in controls and pathological gamblers. PLOS ONE. doi:10.1371/journal.pone.0047225
Rachlin, H. (1995). Self-control: Beyond commitment. Behavioral and Brain Sciences, 18(1), 109–159. doi:10.1017/S0140525X00037602
Rachlin, H., & Green, L. (1972). Commitment, choice and self-control. Journal of the Experimental Analysis of Behavior, 17(1), 15–22. doi:10.1901/jeab.1972.17-15
Read, D., & Read, N. L. (2004). Time discounting over the lifespan. Organizational Behavior and Human Decision Processes, 94(1), 22–32. doi:10.1016/j.obhdp.2004.01.002
Read, D., Frederick, S., & Airoldi, M. (2012). Four days later in Cincinnati: Longitudinal tests of hyperbolic discounting. Acta Psychologica, 140(2), 177–185. doi:10.1016/j.actpsy.2012.02.010
Regenwetter, M., & Davis-Stober, C. P. (2012). Behavioral variability of choices versus structural inconsistency of preferences. Psychological Review, 119(2), 408–416. doi:10.1037/a0027372
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–471. doi:10.1016/0005-1098(78)90005-5
Rodriguez, M. L., & Logue, A. W. (1988). Adjusting delay to reinforcement: Comparing choice in pigeons and humans. Journal of Experimental Psychology: Animal Behavior Processes, 14(1), 105–117. doi:10.1037/0097-7403.14.1.105
Samuelson, P. A. (1937). A note on measurement of utility. Review of Economic Studies, 4(2), 155–161. doi:10.2307/2967612
Scholten, M., & Read, D. (2010). The psychology of intertemporal tradeoffs. Psychological Review, 117(3), 925–944. doi:10.1037/a0019619. Retrieved from http://repositorio.ispa.pt/bitstream/10400.12/580/1/rev-117-3-925.pdf
Scholten, M., & Read, D. (2013). Time and outcome framing in intertemporal tradeoffs. Journal of Experimental Psychology: Learning, Memory, and Cognition. Advance online publication. doi:10.1037/a0031171
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. doi:10.1214/aos/1176344136
Silver, N. (2012). The signal and the noise: Why so many predictions fail—but some don't. New York, NY: Penguin Press. ISBN 978-1-59420-411-1.
Sutter, M., Kochaer, M. G., Rützler, D., & Trautmann, S. T. (2010). Impatience and uncertainty: Experimental decisions predict adolescents' field behavior (Discussion Paper No. 5404). Institute for the Study of Labor. Retrieved from http://ftp.iza.org/dp5404.pdf
Toubia, O., Johnson, E., Evgeniou, T., & Delquié, P. (2013). Dynamic experiments for estimating preferences: An adaptive method of eliciting time and risk parameters. Management Science, 59(3), 613–640. doi:10.1287/mnsc.1120.1570
Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and probability. Cognitive Psychology, 5, 207–232. doi:10.1016/0010-0285(73)90033-9
Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211, 453–458. doi:10.1126/science.7455683
van Gelder, J.-L., de Vries, R. E., & van der Pligt, J. (2009). Evaluating a dual-process model of risk: Affect and cognition as determinants of risky choice. Journal of Behavioral Decision Making, 22(1), 45–61. doi:10.1002/bdm.610
Watson, J. B. (1913). Psychology as the behaviorist views it. Psychological Review, 20(2), 158–177. doi:10.1037/h0074428
The maximum was 9 rounds for Exp and Diff and 6 for GH. The difference was to roughly equate for differences in computation time, since models needed to be fitted in real time for the adaptive procedure. In the analyses reported in our results sections, the maximum was reached only once, under train-to-test for a subject who was assigned GH.
A programming oversight included an additional delay value, 8 weeks, in the test set but not in the pool used by the adaptive procedure.
In the case of the first training trial, we deviated from this description slightly: t1 and t2 were simply drawn at random from the prior distribution of θ, and the algorithm proceeded from step 4. We used this special base case only for convenience of implementation.
It is easy to prove that m is equivalent to the mean performance of a naive model that always predicts the choice it has seen most often under leave-one-out cross-validation with the test set, except in the edge case when m = 50.