Donkey notebook

Created 25 Sep 2018 • Last modified 4 Oct 2021

Task planning

I want a bare-bones trivial decision-making task where people have a repeated choice between two meaningless and maximally equivalent options (a little like Buridan's ass, which, curiously, never seems to have gotten a psychology experiment in humans). By studying the prediction of decisions in this task, I hope to learn more about how to predict variation in choices that arises from something other than content. Then perhaps I can e.g. break the ceiling of 85% accuracy I observed in Arfer and Luhmann (2015).

It seems like there's a lot of analytic approaches for predicting binary sequences. Let's get at least a little data to start with. Let's run the task on Mechanical Turk for convenience.

I was thinking the task could use the mouse, and I could record mouse movements and use mouse movements before making previous choices as a rich predictor. The trouble is that you can't move the cursor in JavaScript, so I can't force the cursor to a neutral position after each choice. Even if I could, the physical mouse would still be a bit to the side, of course. Asking subjects to perfectly recenter the mouse, and not starting the next trial till they did, would probably drive subjects a little crazy.

Instead, let's use the keyboard, and ask subjects to position their fingers over both of the two allowed keys. "F" and "J" seem like good choices.

Suppose we get 200 choices. For each subject, the first 100 choices will be purely training data, while I'll evaluate predictions on the second 100. I may use online learning methods that effectively treat every trial the same, but I'll be most interested in performance after the method has already had 100 trials to train with. During the task, I'll show the subject how many choices are left to make, but I won't give feedback on which choices they made.

Previous studies

There have been many previous experiments on human random-number generation. They're not too useful for my purposes because there's no easy way to get the data, and most tasks explicitly ask for random choices instead of letting subjects choose arbitrarily.


(len (.unique ($ trials sn)))
(.mean ($ trials chose-key1))

First tries

Since the base rates are near 50%, a trivial model will do poorly.

(setv N 200)
(setv n-train (// N 2))
(setv n-test (- N n-train))
(setv d (.copy trials))
(setv ($ d train) (< ($ d trial) n-train))

The first few subjects look like this (O means pressing "f", x means pressing "j"):

(lfor [X x] (.groupby (ss d (.isin $sn [0 1 2])) "sn")
  [(pretty-choice-seq (ss x $train "chose-key1"))])

The quantiles of base rates in the test halves are:

(setx qs-base (.quantile :q [.25 .5 .75] (wcby
  (ss d (~ $train)) $sn
  (max [(.sum $chose-key1) (.sum (~ $chose-key1))]))))
I value
0.25 51
0.50 56
0.75 62


Let's try a model that assumes that a given choice is a probabilistic function of the previous n trials. The trouble with this is of course when predicting more than n trials of the test set, it will need to use its own past predictions as predictors. When we didn't get to train on the n-tuple in question, we'll predict the mode.

(defn supermarkov-train [lookbehind train]
  (setv mode (>= (.mean train) .5))
  (setv train (tuple train))
  (setv memory {})
  (for [i (range lookbehind (len train))]
    (setv present (get train i))
    (setv past (cut train (- i lookbehind) i))
    (when (not-in past memory)
      (setv (get memory past) [0 0]))
    (+= (get memory past present) 1))
  {"lookbehind" lookbehind
    "mode" mode  "memory" memory  "train" train})

(defn supermarkov-predict [model test-n]
  (setv preds (,))
  (do-n test-n
    (setv past (cut
      (+ (get model "train") preds)
      (- (get model "lookbehind"))))
    (+= preds (, (if (in past (get model "memory"))
        (setv [f t] (get model "memory" past))
          [(> t f) True]
          [(> f t) False]
          [True (get model "mode")]))
      (get model "mode")))))
(setx qs-supermarkov (.quantile :q [.25 .5 .75] (wcby d $sn
  (max (lfor n [2 3 4 5]
    (.sum (=
      (get $chose-key1 (~ $train))
        (supermarkov-train n (get $chose-key1 $train))
I value
0.25 50.0
0.50 54.5
0.75 64.0

Whoops, that's a bit worse than the trivial model.

Regression on trial numbers

Let's try regression models using the trial number and transformations thereof as predictors.

(defn go []
  (.reset-index (wcby (ss d $train) $sn
    (print (first $sn))
    (setv i (np.array (list (range N))))
    (pd.DataFrame (dict
      :trial (geta i (: n-train N))
      :pred (if (= (len (.unique $chose-key1)) 1)
         (* [(first $chose-key1)] n-test)
           (setv X (sklearn.preprocessing.scale (np.column-stack [
             i (** i 2) (** i 3)
             (% i 2) (% i 3) (% i 5)
             (np.sin i) (np.cos i)])))
           (setv model (sklearn.linear-model.LogisticRegressionCV
             :Cs 100 :cv 10 :penalty "l2" :scoring "accuracy"))
           (.fit model (geta X (: n-train)) $chose-key1)
           (.predict model (geta X (: n-train N))))))))))

At least sometimes, this model produces some nontrivial predictions, thanks to its modular and trigonometric terms:

(setv ooo (go))
(lfor [X x] (.groupby (ss ooo (.isin $sn [0 1 3])) "sn")
  [(pretty-choice-seq ($ x pred))])

How accurate is it?

(setx qs-trial-nums (.quantile :q [.25 .5 .75] (wcby (.merge d ooo) $sn
  (.sum (= $pred $chose-key1)))))
I value
0.25 47.75
0.50 52.00
0.75 59.00

Worse still.

Regression on run length

A Stack Exchange answer suggests a model for the probability of an option that only depends on how many times in a row that option has been selected previously.

(defn rlm-encode [values]
  (setv accums [0 0])
  (setv hist (+ [[0 0]] (lfor
    x values
    :do (setv (get accums (not x)) 0)
    :do (+= (get accums x) 1)
    (.copy accums))))
  (setv hist (np.array (cut hist 0 -1)))
  (np.column-stack [
    (** hist 2)
    (> (geta hist : 0) 0)]))

(defn go []
  (.reset-index (wcby (ss d $train) $sn (pd.DataFrame (dict
    :trial (range n-train N)
    :pred (if (= (len (.unique $chose-key1)) 1)
      (* [(first $chose-key1)] n-test)
        (print (first $sn))
        (setv scaler (sklearn.preprocessing.StandardScaler :with-mean F))
        (setv model (sklearn.linear-model.LogisticRegressionCV
          :Cs 100 :cv 10 :penalty "l2" :scoring "accuracy"))
        (.fit model
          (.fit-transform scaler (rlm-encode $chose-key1))
        (setv preds (.tolist $chose-key1))
        (do-n n-test
          (.append preds (first (.predict model
            (geta (.transform scaler (rlm-encode preds)) None -1 :)))))
        (cut preds n-train))))))))
(setv ooo (go))
(lfor [X x] (.groupby (ss ooo (.isin $sn [0 1 2])) "sn")
  [(pretty-choice-seq ($ x pred))])
(setx qs-run-lens (.quantile :q [.25 .5 .75] (wcby (.merge d ooo) $sn
  (.sum (= $pred $chose-key1)))))
I value
0.25 48
0.50 50
0.75 54

Again, not a success.

Switch to interpolation

When you think about it, it's not surprising that autoregressive-like models, that is, models where the output depends on the previous output, would do badly in this situation. I'm trying to predict trials 100 through 199, so if I mispredict trial 110, that will substantially hurt my ability to predict trial 111, and hence trial 112, and so on. One mistake is enough to throw me off the whole scent.

It's also not surprising that regression doesn't work well, considering the dangers of extrapolation. Each model, in training, only sees trial numbers 0 through 99. It gets no instructions or feedback to control its behavior when the input trial number is 100 or more, which is also the only kind of trial number tried in testing. So, for example, a tree-based model might use a rule like `trial > 97` just to refine its predictions for trials 98 and 99, but that rule will end up applying to all the testing data. I should only expect a hierarchical-style regression model, which is allowed to see all trial numbers for some subjects, to succeed on these later trial numbers.

To make things easier, and also more like Builder, let's try traditional random cross-validation, within subjects. If I do well here, I can use my winning model to inspire the construction of hierarchical models for the original problem.

Data setup

(setv n-folds 10)

(defmacro try-model [#* args]
  (setv args (dict (partition args)))
  (setv X-expr (get args :X)
    estimator-expr (get args :model)
    diagnostic-expr (.get args ':diagnostic))
  `((fn []
    (setv diagnostic {})
    (setv cv-out (pd.concat (gfor [sn chunk] (.groupby trials "sn")   (wc chunk
      (print "sn" (first $sn))
      (setv X ~X-expr)
      (setv estimator ~estimator-expr)
      (setv pred (np.empty-like $chose-key1))
      (for [[train test] (.split :X X (sklearn.model-selection.KFold n-folds
          :shuffle T :random-state (+ 100 sn)))]
        (setv y-train (geta (.to-numpy $chose-key1) train))
        (setv (geta pred test) (if (= (len (np.unique y-train)) 1)
          (* [(first y-train)] (len test))
            (setv model (.fit estimator (geta X train) y-train))
            ~(when diagnostic-expr
              `(setv (get diagnostic sn) ~diagnostic-expr))
            (.predict model (geta X test))))))
      (pd.DataFrame (dict
        :sn sn
        :trial $trial
        :chose-key1 $chose-key1
        :pred pred))))))
    [cv-out diagnostic])))

(defn show-accuracies [cv-out]
  (.quantile :q [.25 .5 .75] (wcby cv-out $sn
    (/ (.sum (= $chose_key1 $pred)) 2))))

Base rates

(setv [cv-out --] (try-model 
  :X (np.empty [(len $trial) 1])
  :model (sklearn.dummy.DummyClassifier "most_frequent")))
(setx qs-interp-trivial (show-accuracies cv-out))
I value
0.25 51.5
0.50 55.0
0.75 58.5

Logistic regression

(setv [cv-out diagnostic] (try-model 
  :X (sklearn.preprocessing.scale (.astype (np.column-stack [
    $trial (** $trial 2) (** $trial 3)]) np.float64))
  :model (sklearn.linear-model.LogisticRegression
    :C np.inf :solver "lbfgs")
  :diagnostic (np.append model.intercept_ model.coef_)))
(setx qs-interp-cubic (show-accuracies cv-out))
I value
0.25 47.500
0.50 54.000
0.75 59.625
(lfor [-- x] (.groupby (ss cv-out (.isin $sn [0 1 2])) "sn")
  [(pretty-choice-seq ($ x pred))])

The long runs of Trues and Falses in the predictions suggest an underfit model. A mere cubic can't capture lots of switching.

Regularized, with more terms

(setv [cv-out diagnostic] (try-model 
  :X (sklearn.preprocessing.scale (.astype :dtype np.float64 (np.column-stack [
    $trial (** $trial 2) (** $trial 3)
    (% $trial 2) (% $trial 3)])))
  :model (sklearn.linear-model.LogisticRegressionCV
    :Cs 10 :penalty "l2"
    :cv (sklearn.model-selection.KFold 10
      :shuffle T :random-state 133)
    :scoring "accuracy")
  :diagnostic model.C_))
(setx qs-interp-plusmod (show-accuracies cv-out))
I value
0.25 53.000
0.50 57.000
0.75 65.375

Nearest neighbors

(setv [cv-out diagnostic] (try-model
  :X (np.vstack (.to-numpy $trial))
  :model (sklearn.model-selection.GridSearchCV
    :estimator (sklearn.neighbors.KNeighborsClassifier)
    :param-grid {"n_neighbors" (seq 1 10)}
    :cv (sklearn.model-selection.KFold 10 :shuffle T :random-state 133)
    :scoring "accuracy")
  :diagnostic model.best_params_))
(setx qs-nn (show-accuracies cv-out))
I value
0.25 48.500
0.50 58.500
0.75 70.125
(lfor [-- x] (.groupby (ss cv-out (.isin $sn [0 1 2])) "sn")
  [(pretty-choice-seq ($ x pred))])

It no longer looks like we're underfit, but performance is still not fabulous.

(valcounts (gfor  d (.values diagnostic)  (get d "n_neighbors")))
I value
1 12
2 9
3 12
4 33
5 8
6 1
7 2
8 7
9 5
10 9

These are the counts of the selected ks.


This takes about an hour to run.

(import xgboost)
(setv [cv-out diagnostic]
  ;(with [o (open "/home/hippo/Temporary/donkey-xgb.pkl" "rb")]
  ;  (import pickle) (pickle.load o)))
   :X (np.vstack (.to-numpy $trial))
   :model (sklearn.model-selection.RandomizedSearchCV
     :estimator (xgboost.XGBRegressor
       :objective "binary:hinge"
       :random-state 1337
       :n-estimators 250)
     :param-distributions (dict
       :learning-rate [.01 .02 .05 .1 .2 .25 .5]
       :gamma (lfor  x (seq -7 6)  (** 2 x))
       :reg_lambda (lfor  x (seq -10 10)  (** 2 x))
       :reg_alpha (lfor  x (seq -10 10)  (** 2 x))
       :max_depth [3 6 9])
     :n-iter 50
     :random-state 123
     :cv (sklearn.model-selection.KFold 10
       :shuffle T :random-state 133)
     :scoring "accuracy")))
(setx qs-xgboost (show-accuracies cv-out))
I value
0.25 53.500
0.50 58.500
0.75 68.625


This form of the problem, too, is too hard. I'm barely improving upon the base rate. Either I have to find a way to get something out of response times (which I've ignored so far), or I have to move to hierarchical methods and hope that I can use what I learn from some subjects to help predict others subjects' choices. Response times sound unlikely to be very helpful, so let's try hierarchical methods.


If only for computational reasons, let's try predicting one test trial at a time.

For the code in this section, tuning parameters have been hand-adjusted, so they may be overfit.

(setv n-folds 10)

(defmacro try-model [#* args]
  (setv args (dict (partition args)))
    estimator-expr (.get args ':model)
    fit-fn (.get args ':fit)
    diagnostic-expr (.get args ':diagnostic)
    prob-for-pred (hy.eval (.get args ':prob-for-pred 'False))
    use-rts (.get args ':rt 'False))
  `((fn []

    (setv train-trials (list (range 100)))
    (setv d-choices (.pivot trials
      :index "sn" :columns "trial" :values "chose-key1"))
    (setv X (.to-numpy (geti d-choices : train-trials)))
    (when ~use-rts
      ; Add the standardized reaction times to X.
      (setv rts (.to-numpy (geti
        (.pivot trials :index "sn" :columns "trial" :values "rt")
        : train-trials)))
      (setv X (np.hstack [X (/
        (- rts (np.mean rts :axis 0))
        (* 2 (np.std rts :axis 0)))])))
    ;(setv X (.fit-transform :X X (sklearn.preprocessing.PolynomialFeatures
    ;  :degree 2 :interaction-only T :include-bias F)))

    (setv out (pd.concat :axis 1 (lfor test-trial [100 101 150] (do

      (setv y (.to-numpy (geti d-choices : test-trial)))
      (setv y-prob (np.full y.shape NaN))
      (setv y-pred (np.zeros-like y))
      (setv diagnostic [])

      (for [[fold-i [train test]] (enumerate (.split :X X
          (sklearn.model-selection.KFold n-folds
            :shuffle T :random-state 1234)))]
        (print (.format "Test trial {}, fold {}" test-trial fold-i))
        (np.random.seed (+ 123 fold-i))
        (setv model ~(if fit-fn
          `(~fit-fn (geta X train) (geta y train) fold-i)
          `(.fit ~estimator-expr (geta X train) (geta y train))))
        ~(when diagnostic-expr
          `(.append diagnostic ~diagnostic-expr))
        (setv probs (geta (.predict-proba model X) : -1))
        (setv (geta y-prob test) (geta probs test))
        (setv (geta y-pred test) ~(if prob-for-pred
          '(>= (geta y-prob test) .5)
          '(.predict model (geta X test))))
        (print (.format "Training accuracy: {:.02}"
          (np.mean (= (> (geta probs train) .5) (geta y train))))))

      (pd.Series :name test-trial (dict
        :base-MSE (round (.mean (** (- y (np.mean y)) 2)) 3)
        :MSE (round (.mean (** (- y y-prob) 2)) 3)
        :base-rate (max (.mean y) (- 1 (.mean y)))
        :accuracy (.mean (= y y-pred))))))))

    (setv "test_trial")
(setx result-between-lr (try-model
  :model (sklearn.linear-model.LogisticRegression
    :penalty "l1" :solver "saga" :max-iter 2,000
    :C (get {100 0.25  101 0.5  150 1e-6} test-trial))
  :diagnostic (+ (list model.intercept_) (list (.flatten model.coef_)))))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.222 0.179 0.250
base_rate 0.550 0.500 0.550
accuracy 0.660 0.730 0.530
(setx result-between-lr-rt (try-model
  :rt True
  :model (sklearn.linear-model.LogisticRegression
    :penalty "l1" :solver "saga" :max-iter 2,000
    :C (get {100 0.25  101 0.75  150 1e-6} test-trial))
  :diagnostic (+ (list model.intercept_) (list (.flatten model.coef_)))))
(setx result-between-nn (try-model
  :model (sklearn.neighbors.KNeighborsClassifier :n-neighbors 3)))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.291 0.313 0.296
base_rate 0.550 0.500 0.550
accuracy 0.610 0.570 0.560
(setx result-between-nn-rt (try-model
  :rt True
  :model (sklearn.neighbors.KNeighborsClassifier :n-neighbors 3)))
test_trial 100 101 150
base_MSE 0.248 0.25 0.248
MSE 0.301 0.27 0.306
base_rate 0.550 0.50 0.550
accuracy 0.570 0.60 0.550
(defmacro try-neural-net [#* kwargs]
  (setv kwargs (dict (partition kwargs)))
    :rt ~(.get kwargs ':rt)
    :fit (fn [X y fold-i]
      (import os) (setv (get os.environ "TF_CPP_MIN_LOG_LEVEL") "3")
        tensorflow.keras.models [Sequential]
        tensorflow.keras.layers [Dense Activation]
        tensorflow.keras [regularizers])
      (setv model (Sequential [
        (Dense ~(get kwargs :hidden-nodes) :activation "tanh"
          :kernel-regularizer (regularizers.l1 ~(get kwargs :penalty)))
        (Dense 1 :activation "sigmoid")]))
      (.compile model
        :optimizer "rmsprop"
        :loss "binary_crossentropy")
      (tensorflow.random.set-seed (+ 456 fold-i))
      (.fit model X y
        :verbose False
        :epochs ~(get kwargs :epochs)
        :batch-size 20)
    :prob-for-pred True))
(setx result-between-neural (try-neural-net
  :hidden-nodes 100 :epochs 100
  :penalty (get {100 .02  101 .01  150 1e6} test-trial)))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.242 0.224 0.255
base_rate 0.550 0.500 0.550
accuracy 0.600 0.680 0.550
(setx result-between-neural-rt (try-neural-net
  :hidden-nodes 200 :epochs 100 :rt True
  :penalty (get {100 .02  101 .02  150 1e6} test-trial)))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.238 0.227 0.254
base_rate 0.550 0.500 0.550
accuracy 0.610 0.700 0.550
(defclass OneColPredictor []
  (defn __init__ [self #** kw]
    (for [[k v] (.items kw)]
      (setattr self k v)))
  (defn predict-proba [self X]
      (np.where (geta X : self.column) self.p-true self.p-false)
      (, (get X.shape 0) 1))))

(setx result-between-onecol (try-model
  :fit (fn [X y fold-i]
    (setv column (max (range (get X.shape 1)) :key (λ
      (setv p (np.mean (= (geta X : it) y)))
      (max p (- 1 p)))))
    (OneColPredictor :column column
      :p-true (np.mean (get y (geta X : column)))
      :p-false (np.mean (get y (~ (geta X : column))))))
  :prob-for-pred True))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.231 0.187 0.274
base_rate 0.550 0.500 0.550
accuracy 0.680 0.770 0.520

Even in this, the easiest form of the problem, with or without reaction times, I'm not getting great accuracy. I also modified the above code to change the training set from the first 100 trials to trial 80 through 100, and that improved performance only a bit.

Now what?

The sample is not huge, especially when framing the problem hierarchically as predicting one test trail with all 100 training trials. However, considering the gap between the kinds of performance I'm getting (77% accuracy in trial 101; 55% accuracy in trial 150) and the kind I'm looking for (at least 95%), it's not realistic to think that a sample that's twice or ten times this size will suffice. An enormous sample, like over 10,000 subjects, could change things, but that's not something I can readily afford.

I think I made a mistake by not explicitly asking subjects not to use a randomization device. There's no telling how many of them did that. However, I can reasonably assume it was a minority, and that keeping subjects from doing this wouldn't suffice for good accuracy.

Decision prefixes

Is it true that if two people make the same binary choice on a large number of contiguous trials, then they're highly likely to make the same choice on the trial immediately after?

(require kodhy.macros [geti λ cbind])
(import kodhy.util [rd])

  (.groupby (pd.DataFrame retail-order-seqs) [0 1 2 3 4])
  (λ (cbind :n (len it)  :p (rd (.mean (geti it : 5))))))
i0 1 2 3 4 i5 n p
False False False False False 0 88 0.091
False False False False True 0 12 0.333
False False False True False 0 8 0.250
False False False True True 0 12 0.250
False False True False False 0 13 0.077
False False True False True 0 4 0.250
False False True True False 0 8 0.250
False False True True True 0 10 0.600
False True False False False 0 17 0.235
False True False False True 0 5 0.600
False True False True False 0 11 0.727
False True False True True 0 7 0.714
False True True False False 0 14 0.286
False True True False True 0 17 0.647
False True True True False 0 6 0.667
False True True True True 0 22 0.773
True False False False False 0 34 0.324
True False False False True 0 12 0.500
True False False True False 0 18 0.222
True False False True True 0 22 0.682
True False True False False 0 16 0.312
True False True False True 0 20 0.700
True False True True False 0 23 0.435
True False True True True 0 61 0.721
True True False False False 0 25 0.200
True True False False True 0 12 0.750
True True False True False 0 34 0.471
True True False True True 0 61 0.656
True True True False False 0 42 0.286
True True True False True 0 52 0.635
True True True True False 0 77 0.325
True True True True True 0 324 0.781

Maybe not—but maybe 5 choices isn't long enough of a sequence, and my notion of a binary choice here is pretty artificial: it's whether an order in the University of California, Irvine's "Online Retail" dataset included 10 or more distinct items. But most problematic, I guess, is that this idea of prefixes being predictive might work when all the decision problems being considered are the same for all trials in the prefix, as well as the $n$th trial, but not when different subjects are considering different problems—the problem characteristics are too influential on choices for these prefixes to predict the response with high accuracy.

The moral

Probably, people's decisions really are noisy. Whether the noise is deterministic (via sensitive dependence on initial conditions) or truly random, people's choices can include variation that is effectively unpredictable from previous choices of the same kind. One can imagine a continuum of predictability to decision problems with no features or constraints, as in this task, to problems where the subject is basically forced to make a certain choice. In between are the typical problems of psychology experiments and real life, where people have some amount of freedom to choose, but the features of the options, such as forseeable consequences, tend to push them towards one option or another.


Arfer, K. B., & Luhmann, C. C. (2015). The predictive accuracy of intertemporal-choice models. British Journal of Mathematical and Statistical Psychology, 68(2), 326–341. doi:10.1111/bmsp.12049. Retrieved from

Hagelbarger, D. W. (1956). SEER, a SEquence Extrapolating Robot. IRE Transactions on Electronic Computers, EC-5(1), 1–7. doi:10.1109/TEC.1956.5219783

Shannon, C. E. (1953). A mind-reading (?) machine. Bell Laboratories.