Donkey notebook

Created 25 Sep 2018 • Last modified 6 Mar 2020

Task planning

I want a bare-bones trivial decision-making task where people have a repeated choice between two meaningless and maximally equivalent options (a little like Buridan's ass, which, curiously, never seems to have gotten a psychology experiment in humans). By studying the prediction of decisions in this task, I hope to learn more about how to predict variation in choices that arises from something other than content. Then perhaps I can e.g. break the ceiling of 85% accuracy I observed in Arfer and Luhmann (2015).

It seems like there's a lot of analytic approaches for predicting binary sequences. Let's get at least a little data to start with. Let's run the task on Mechanical Turk for convenience.

I was thinking the task could use the mouse, and I could record mouse movements and use mouse movements before making previous choices as a rich predictor. The trouble is that you can't move the cursor in JavaScript, so I can't force the cursor to a neutral position after each choice. Even if I could, the physical mouse would still be a bit to the side, of course. Asking subjects to perfectly recenter the mouse, and not starting the next trial till they did, would probably drive subjects a little crazy.

Instead, let's use the keyboard, and ask subjects to position their fingers over both of the two allowed keys. "F" and "J" seem like good choices.

Suppose we get 200 choices. For each subject, the first 100 choices will be purely training data, while I'll evaluate predictions on the second 100. I may use online learning methods that effectively treat every trial the same, but I'll be most interested in performance after the method has already had 100 trials to train with. During the task, I'll show the subject how many choices are left to make, but I won't give feedback on which choices they made.

Previous studies

There have been many previous experiments on human random-number generation. They're not too useful for my purposes because there's no easy way to get the data, and most tasks explicitly ask for random choices instead of letting subjects choose arbitrarily.

Descriptives

(len (.unique ($ trials sn)))
100
(.mean ($ trials chose-key1))
0.512
(setv v (wcby trials $sn (.sum $chose-key1)))
(rectplot v)
(plt.xticks (+ [(.min v)]
  (lfor  n (seq 0 200 10)  :if (> n (.min v))  n)))
base-rates.png

Some first subjects

Let's try some models with the first few real subjects. Their first 100 trials look like this (O means pressing "f", x means pressing "k"):

(setv N 200)
(setv n-train (// N 2))
(setv n-test (- N n-train))
(setv d (ss trials (.isin $sn [0 1 2])))
(setv ($ d train) (< ($ d trial) n-train))
(lfor [X x] (.groupby d "sn")
  [(pretty-choice-seq (ss x $train "chose-key1"))])
OOxOOxOxOOOOOOOxxxxOOOxxxOOOxOxxxxxxOOOOOOOOOOxxxxOxOxOxOOOOOOxxxOOxxOOOOOxxxOxOxOxOxOxOxOxOOxxxxxxx
OOOxOxOOOOOOOOxOxOOxOxOOxOOOxOOOOOxxOxOOOOOxxOOxxxxxxxxxxxxxxxxxxxxxxxxOxOxOOOOxOxOOOxxOOOxOOOOOOxxO
xOOOxxxxOOxxxxOOxOOOOOxOOOxxxOOOxxxxxxxOOOOOxxxxxxOOOOOOOOOOOOOOOOOOOxxxOOxxxOOOOOxxxxOOOOOxxxxOOOOO
(wcby d [$sn $train] (.mean $chose-key1))
sn train value
0 False 0.50
0 True 0.54
1 False 0.53
1 True 0.54
2 False 0.51
2 True 0.59

Since the base rates are near 50%, a trivial model will do poorly.

Let's try a model that assumes that a given choice is a probabilistic function of the previous n trials. The trouble with this is of course when predicting more than n trials of the test set, it will need to use its own past predictions as predictors. When we didn't get to train on the n-tuple in question, we'll predict the mode.

(defn supermarkov-train [lookbehind train]
  (setv mode (>= (.mean train) .5))
  (setv train (tuple train))
  (setv memory {})
  (for [i (range lookbehind (len train))]
    (setv present (get train i))
    (setv past (cut train (- i lookbehind) i))
    (when (not-in past memory)
      (setv (get memory past) [0 0]))
    (+= (get memory past present) 1))
  {"lookbehind" lookbehind
    "mode" mode  "memory" memory  "train" train})

(defn supermarkov-predict [model test-n]
  (setv preds (,))
  (for [X (range test-n)]
    (setv past (cut
      (+ (get model "train") preds)
      (- (get model "lookbehind"))))
    (+= preds (, (if (in past (get model "memory"))
      (do
        (setv [f t] (get model "memory" past))
        (cond
          [(> t f) True]
          [(> f t) False]
          [True (get model "mode")]))
      (get model "mode")))))
  preds)
(wcby (ss d $train) $sn
  (pd.concat (lfor n [2 3 4 5]
    (cbind :n n
      :pred_sum (sum (supermarkov-predict
        (supermarkov-train n $chose-key1)
        n-test))))))
sn i1 n pred_sum
0 0 2 0
0 0 3 0
0 0 4 0
0 0 5 0
1 0 2 100
1 0 3 100
1 0 4 100
1 0 5 98
2 0 2 100
2 0 3 100
2 0 4 100
2 0 5 100

Hilariously, for various n, this model predicts False or True every time or nearly every time. So it behaves similarly to the trivial model.

Let's try regression models using the trial number and transformations thereof as predictors.

(setv ooo (wcby (ss d $train) $sn
  (setv i (np.array (list (range N))))
  (setv X (sklearn.preprocessing.scale (np.column-stack [
    i (** i 2) (** i 3)
    (% i 2) (% i 3) (% i 5)
    (np.sin i) (np.cos i)])))
  (setv model (sklearn.linear-model.LogisticRegressionCV
    :Cs 100 :cv 10 :penalty "l2" :scoring "accuracy"))
  (.fit model (geta X (: n-train)) $chose-key1)
  (pd.DataFrame (dict
    :trial (geta i (: n-train N))
    :pred (.predict model (geta X (: n-train N)))))))

At least sometimes, this model produces some nontrivial predictions, thanks to its modular and trigonometric terms:

(lfor [X x] (.groupby ooo "sn")
  [(pretty-choice-seq ($ x pred))])
xOOOxOxOOOxOxOOOxOxOOOxOxOOOxOxxxOxOxxxOOOxxxOOOxxxOxOxxxOxOxxxxxOxxxxxOxxxxxOxxxxxOxxxxxxxxxxxxxxxx
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

How accurate is it for that one subject?

(.mean (=
  (getl ooo 0 "pred")
  (.reset-index :drop T (ss d (& (= $sn 0) (~ $train)) "chose-key1"))))
0.42

Worse than trivial; oops.

Regression on run length

A Stack Exchange answer suggests a model for the probability of an option that only depends on how many times in a row that option has been selected previously.

(defn rlm-encode [values]
  (setv accums [0 0])
  (setv hist (np.array (+ [[0 0]] (lfor
    x values
    :do (setv (get accums (not x)) 0)
    :do (+= (get accums x) 1)
    (.copy accums)))))
  (np.column-stack [
    hist
    (** hist 2)
    (> (geta hist : 0) 0)]))

(setv ooo (wcby (ss d $train) $sn
  (setv scaler (sklearn.preprocessing.StandardScaler :with-mean F))
  (setv model (sklearn.linear-model.LogisticRegressionCV
    :Cs 100 :cv 10 :penalty "l2" :scoring "accuracy"))
  (setv X (.fit-transform scaler (rlm-encode $chose-key1)))
  (.fit model (geta X (: 0 n-train)) $chose-key1)
  (setv preds (.tolist $chose-key1))
  (for  [… (range n-test)]
    (.append preds (first (.predict model
      (geta (.transform scaler (rlm-encode preds)) None -1 :)))))
  (pd.DataFrame (dict
    :trial (range n-train N)
    :pred (cut preds n-train)))))

1
1
(lfor [X x] (.groupby ooo "sn")
  [(pretty-choice-seq ($ x pred))])
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
OOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
(.mean (=
  (getl ooo 1 "pred")
  (.reset-index :drop T (ss d (& (= $sn 0) (~ $train)) "chose-key1"))))
0.48

Again, not a success.

Switch to interpolation

When you think about it, it's not surprising that autoregressive-like models, that is, models where the output depends on the previous output, would do badly in this situation. I'm trying to predict trials 100 through 199, so if I mispredict trial 110, that will substantially hurt my ability to predict trial 111, and hence trial 112, and so on. One mistake is enough to throw me off the whole scent.

It's also not surprising that regression doesn't work well, considering the dangers of extrapolation. Each model, in training, only sees trial numbers 0 through 99. It gets no instructions or feedback to control its behavior when the input trial number is 100 or more, which is also the only kind of trial number tried in testing. So, for example, a tree-based model might use a rule like `trial > 97` just to refine its predictions for trials 98 and 99, but that rule will end up applying to all the testing data. I should only expect a hierarchical-style regression model, which is allowed to see all trial numbers for some subjects, to succeed on these later trial numbers.

To make things easier, and also more like Builder, let's try traditional random cross-validation, within subjects. If I do well here, I can use my winning model to inspire the construction of hierarchical models for the original problem.

Data setup

(setv n-folds 10)

(defmacro try-model [&rest args]
  (setv args (dict (partition args)))
  (setv X-expr (get args :X)
    estimator-expr (get args :model)
    diagnostic-expr (.get args ':diagnostic))
  `((fn []
    (setv diagnostic {})
    (setv cv-out (pd.concat (gfor [sn chunk] (.groupby trials "sn")   (wc chunk
      (print "sn" (first $sn))
      (setv X ~X-expr)
      (setv estimator ~estimator-expr)
      (setv pred (np.empty-like $chose-key1))
      (for [[train test] (.split :X X (sklearn.model-selection.KFold n-folds
          :shuffle T :random-state (+ 100 sn)))]
        (setv model (.fit estimator (geta X train)
          (geta (.to-numpy $chose-key1) train)))
        ~(when diagnostic-expr
          `(setv (get diagnostic sn) ~diagnostic-expr))
        (setv (geta pred test) (.predict model (geta X test))))
      (pd.DataFrame (dict
        :sn sn
        :trial $trial
        :chose-key1 $chose-key1
        :pred pred))))))
    [cv-out diagnostic])))

(defn show-accuracies [cv-out]
  (setv accuracies (wcby cv-out $sn
    (.mean (= $chose_key1 $pred))))
  (rd [(.min accuracies) (.mean accuracies) (.max accuracies)]))

Base rates

(setv [cv-out --] (try-model 
  :X (np.empty [(len $trial) 1])
  :model (sklearn.dummy.DummyClassifier "most_frequent")))
(show-accuracies cv-out)
0.395 0.567 0.95

Logistic regression

(setv [cv-out diagnostic] (try-model 
  :X (sklearn.preprocessing.scale (.astype (np.column-stack [
    $trial (** $trial 2) (** $trial 3)]) np.float64))
  :model (sklearn.linear-model.LogisticRegression
    :C np.inf :solver "lbfgs")
  :diagnostic (np.append model.intercept_ model.coef_)))
(show-accuracies cv-out)
0.37 0.572 0.95
(lfor [-- x] (.groupby (ss cv-out (.isin $sn [0 1 2])) "sn")
  [(pretty-choice-seq ($ x pred))])
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxOOxxxxxxxxOxxxxxxxxxxxxxxxxxxOxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxOxOxxxxxxxOOxxxxOOOOOxOOOOOOxOOOOOOOOOxOxOxOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxOOxOxOxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxOxxOxxxOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxxxxxxxxxxxxxxxxx
OxxxOxOOxxOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxOOxOOOxOOOOOxxOxOxOOx

The long runs of Trues and Falses in the predictions suggest an underfit model. A mere cubic can't capture lots of switching.

Regularized, with more terms

(setv [cv-out diagnostic] (try-model 
  :X (sklearn.preprocessing.scale (.astype :dtype np.float64 (np.column-stack [
    $trial (** $trial 2) (** $trial 3)
    (% $trial 2) (% $trial 3)])))
  :model (sklearn.linear-model.LogisticRegressionCV
    :Cs 10 :penalty "l2"
    :cv (sklearn.model-selection.KFold 10 :random-state 133)
    :scoring "accuracy")
  :diagnostic model.C_))
(show-accuracies cv-out)
0.42 0.61 0.95

Nearest neighbors

(setv [cv-out diagnostic] (try-model
  :X (np.vstack (.to-numpy $trial))
  :model (sklearn.model-selection.GridSearchCV
    :estimator (sklearn.neighbors.KNeighborsClassifier)
    :param-grid {"n_neighbors" (seq 1 10)}
    :cv (sklearn.model-selection.KFold 10 :shuffle T :random-state 133)
    :scoring "accuracy"
    :iid T)
  :diagnostic model.best_params_))
(show-accuracies cv-out)
0.36 0.653 0.97
(lfor [-- x] (.groupby (ss cv-out (.isin $sn [0 1 2])) "sn")
  [(pretty-choice-seq ($ x pred))])
OOOOxOxOOOOOOOOOxOOxOOOOxxOxOxxxxxxOxOOOOOOOOOxOxxxOxOxOOOOOOOOOxxxOOxOOOOOxxxOxOxxxOxOxOOOOxxxxxxxxxxxxxxxxxxOOOOOxOOOOOOOOOxOxxxxOxOxOOOOOxxxxOOOOxOxxxxxxxOxOOOOOOOOOxxxxxxxxxxxxxxxxOOxOOOOOOOOOOxOx
OOOOxOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxOOOOOOOOxxxxxxxxxxxxxxxxxxxxxxxxxxxxxOOOOOOxOOOOOOOOOOOOOOOOOOOOOOOOOxxxxxOxxxOxxxxxxxOOOOOOOOOOxOOOOOOOOOOOOOOOxOxxxxxxxxxOOOOOxxxxxxxxxxxxxxOOOxOOOOOOOOxOOOxxxxx
OxOOOxxxxOOxxxxOOxOOOOOxOOOxxxOOOxxxxxxOOOOxOxxxxxxOOOOOOOOOOOOOOOOOxOxxxxOxxOOOOOOxxxxOOOOOxxxxOOOOOxxxxxOOOxxxOOOxxxxxxxxxxxOOOOxOOxOOxOOOxxxxxOOOxOOxOOOOOxxxxOOOOOOOOOOOxxxxOOxOOxOOxOOxOOxxOOxOxxxx

It no longer looks like we're underfit, but performance is still not fabulous.

(valcounts (gfor  d (.values diagnostic)  (get d "n_neighbors")))
I value
1 5
2 3
3 4
4 9
5 3
8 2
9 2
10 2

These are the counts of the selected ks.

XGBoost

This takes about an hour to run.

(import xgboost)
(setv [cv-out diagnostic] (with [o (open "/home/hippo/Temporary/donkey-xgb.pkl" "rb")]
  (import pickle) (pickle.load o)))
; (try-model
;  :X (np.vstack (.to-numpy $trial))
;  :model (sklearn.model-selection.RandomizedSearchCV
;    :estimator (xgboost.XGBRegressor
;      :objective "binary:hinge"
;      :random-state 1337
;      :n-estimators 250)
;    :param-distributions (dict
;      :learning-rate [.01 .02 .05 .1 .2 .25 .5]
;      :gamma (lfor  x (seq -7 6)  (** 2 x))
;      :reg_lambda (lfor  x (seq -10 10)  (** 2 x))
;      :reg_alpha (lfor  x (seq -10 10)  (** 2 x))
;      :max_depth [3 6 9])
;    :n-iter 50
;    :random-state 123
;    :cv (sklearn.model-selection.KFold 10 :shuffle T :random-state 133)
;    :scoring "accuracy"
;    :iid T)))
(show-accuracies cv-out)
0.39 0.66 0.965

Conclusion

This form of the problem, too, is too hard. I'm barely improving upon the base rate. Either I have to find a way to get something out of response times (which I've ignored so far), or I have to move to hierarchical methods and hope that I can use what I learn from some subjects to help predict others subjects' choices. Response times sound unlikely to be very helpful, so let's try hierarchical methods.

Hierarchy

If only for computational reasons, let's try predicting one test trial at a time.

For the code in this section, tuning parameters have been hand-adjusted, so they may be overfit.

(setv n-folds 10)

(defmacro try-model [&rest args]
  (setv args (dict (partition args)))
  (setv
    estimator-expr (.get args ':model)
    fit-fn (.get args ':fit)
    diagnostic-expr (.get args ':diagnostic)
    prob-for-pred (eval (.get args ':prob-for-pred 'False))
    use-rts (.get args ':rt 'False))
  `((fn []

    (setv train-trials (list (range 100)))
    (setv d-choices (.pivot trials
      :index "sn" :columns "trial" :values "chose-key1"))
    (setv X (.to-numpy (geti d-choices : train-trials)))
    (when ~use-rts
      ; Add the standardized reaction times to X.
      (setv rts (.to-numpy (geti
        (.pivot trials :index "sn" :columns "trial" :values "rt")
        : train-trials)))
      (setv X (np.hstack [X (/
        (- rts (np.mean rts :axis 0))
        (* 2 (np.std rts :axis 0)))])))
    ;(setv X (.fit-transform :X X (sklearn.preprocessing.PolynomialFeatures
    ;  :degree 2 :interaction-only T :include-bias F)))

    (setv out (pd.concat :axis 1 (lfor test-trial [100 101 150] (do

      (setv y (.to-numpy (geti d-choices : test-trial)))
      (setv y-prob (np.full y.shape NaN))
      (setv y-pred (np.zeros-like y))
      (setv diagnostic [])

      (for [[fold-i [train test]] (enumerate (.split :X X
          (sklearn.model-selection.KFold n-folds
            :shuffle T :random-state 1234)))]
        (print (.format "Test trial {}, fold {}" test-trial fold-i))
        (np.random.seed (+ 123 fold-i))
        (setv model ~(if fit-fn
          `(~fit-fn (geta X train) (geta y train) fold-i)
          `(.fit ~estimator-expr (geta X train) (geta y train))))
        ~(when diagnostic-expr
          `(.append diagnostic ~diagnostic-expr))
        (setv probs (geta (.predict-proba model X) : -1))
        (setv (geta y-prob test) (geta probs test))
        (setv (geta y-pred test) ~(if prob-for-pred
          '(>= (geta y-prob test) .5)
          '(.predict model (geta X test))))
        (print (.format "Training accuracy: {:.02}"
          (np.mean (= (> (geta probs train) .5) (geta y train))))))

      (pd.Series :name test-trial (dict
        :base-MSE (round (.mean (** (- y (np.mean y)) 2)) 3)
        :MSE (round (.mean (** (- y y-prob) 2)) 3)
        :base-rate (max (.mean y) (- 1 (.mean y)))
        :accuracy (.mean (= y y-pred))))))))

    (setv out.index.name "test_trial")
    out)))
(try-model
  :model (sklearn.linear-model.LogisticRegression
    :penalty "l1" :solver "saga" :max-iter 2,000
    :C (get {100 0.25  101 0.5  150 1e-6} test-trial))
  :diagnostic (+ (list model.intercept_) (list (.flatten model.coef_))))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.222 0.172 0.250
base_rate 0.550 0.500 0.550
accuracy 0.660 0.740 0.530
(try-model
  :rt True
  :model (sklearn.linear-model.LogisticRegression
    :penalty "l1" :solver "saga" :max-iter 2,000
    :C (get {100 0.25  101 0.75  150 1e-6} test-trial))
  :diagnostic (+ (list model.intercept_) (list (.flatten model.coef_))))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.221 0.185 0.250
base_rate 0.550 0.500 0.550
accuracy 0.660 0.730 0.530
(try-model
  :model (sklearn.neighbors.KNeighborsClassifier :n-neighbors 3))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.291 0.301 0.291
base_rate 0.550 0.500 0.550
accuracy 0.610 0.590 0.580
(try-model
  :rt True
  :model (sklearn.neighbors.KNeighborsClassifier :n-neighbors 3))
test_trial 100 101 150
base_MSE 0.248 0.25 0.248
MSE 0.301 0.27 0.306
base_rate 0.550 0.50 0.550
accuracy 0.570 0.60 0.550
(defmacro try-neural-net [&rest kwargs]
  (setv kwargs (dict (partition kwargs)))
  `(try-model
    :rt ~(.get kwargs ':rt)
    :fit (fn [X y fold-i]
      (import os) (setv (get os.environ "TF_CPP_MIN_LOG_LEVEL") "3")
      (import
        tensorflow
        [tensorflow.keras.models [Sequential]]
        [tensorflow.keras.layers [Dense Activation]]
        [tensorflow.keras [regularizers]])
      (setv model (Sequential [
        (Dense ~(get kwargs :hidden-nodes) :activation "tanh"
          :kernel-regularizer (regularizers.l1 ~(get kwargs :penalty)))
        (Dense 1 :activation "sigmoid")]))
      (.compile model
        :optimizer "rmsprop"
        :loss "binary_crossentropy")
      (tensorflow.random.set-seed (+ 456 fold-i))
      (.fit model X y
        :verbose False
        :epochs ~(get kwargs :epochs)
        :batch-size 20)
      model)
    :prob-for-pred True))
(try-neural-net
  :hidden-nodes 100 :epochs 100
  :penalty (get {100 .02  101 .01  150 1e6} test-trial))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.225 0.213 0.255
base_rate 0.550 0.500 0.550
accuracy 0.660 0.690 0.550
(try-neural-net
  :hidden-nodes 200 :epochs 100 :rt True
  :penalty (get {100 .02  101 .02  150 1e6} test-trial))
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.224 0.218 0.255
base_rate 0.550 0.500 0.550
accuracy 0.650 0.690 0.550
(defclass OneColPredictor []
  (defn __init__ [self &kwargs kw]
    (for [[k v] (.items kw)]
      (setattr self k v)))
  (defn predict-proba [self X]
    (np.reshape
      (np.where (geta X : self.column) self.p-true self.p-false)
      (, (get X.shape 0) 1))))

(try-model
  :fit (fn [X y fold-i]
    (setv column (max (range (get X.shape 1)) :key (λ
      (setv p (np.mean (= (geta X : it) y)))
      (max p (- 1 p)))))
    (OneColPredictor :column column
      :p-true (np.mean (get y (geta X : column)))
      :p-false (np.mean (get y (~ (geta X : column))))))
  :prob-for-pred True)
test_trial 100 101 150
base_MSE 0.248 0.250 0.248
MSE 0.231 0.187 0.274
base_rate 0.550 0.500 0.550
accuracy 0.680 0.770 0.520

Even in this, the easiest form of the problem, with or without reaction times, I'm not getting great accuracy. I also modified the above code to change the training set from the first 100 trials to trial 80 through 100, and that improved performance only a bit.

Now what?

The sample is not huge, especially when framing the problem hierarchically as predicting one test trail with all 100 training trials. However, considering the gap between the kinds of performance I'm getting (77% accuracy in trial 101; 55% accuracy in trial 150) and the kind I'm looking for (at least 95%), it's not realistic to think that a sample that's twice or ten times this size will suffice. An enormous sample, like over 10,000 subjects, could change things, but that's not something I can readily afford.

I think I made a mistake by not explicitly asking subjects not to use a randomization device. There's no telling how many of them did that. However, I can reasonably assume it was a minority, and that keeping subjects from doing this wouldn't suffice for good accuracy.

Decision prefixes

Is it true that if two people make the same binary choice on a large number of contiguous trials, then they're highly likely to make the same choice on the trial immediately after?

(require [kodhy.macros [geti λ cbind]])
(import [kodhy.util [rd]])

(.apply
  (.groupby (pd.DataFrame retail-order-seqs) [0 1 2 3 4])
  (λ (cbind :n (len it)  :p (rd (.mean (geti it : 5))))))
i0 1 2 3 4 i5 n p
False False False False False 0 88 0.091
False False False False True 0 12 0.333
False False False True False 0 8 0.250
False False False True True 0 12 0.250
False False True False False 0 13 0.077
False False True False True 0 4 0.250
False False True True False 0 8 0.250
False False True True True 0 10 0.600
False True False False False 0 17 0.235
False True False False True 0 5 0.600
False True False True False 0 11 0.727
False True False True True 0 7 0.714
False True True False False 0 14 0.286
False True True False True 0 17 0.647
False True True True False 0 6 0.667
False True True True True 0 22 0.773
True False False False False 0 34 0.324
True False False False True 0 12 0.500
True False False True False 0 18 0.222
True False False True True 0 22 0.682
True False True False False 0 16 0.312
True False True False True 0 20 0.700
True False True True False 0 23 0.435
True False True True True 0 61 0.721
True True False False False 0 25 0.200
True True False False True 0 12 0.750
True True False True False 0 34 0.471
True True False True True 0 61 0.656
True True True False False 0 42 0.286
True True True False True 0 52 0.635
True True True True False 0 77 0.325
True True True True True 0 324 0.781

Maybe not—but maybe 5 choices isn't long enough of a sequence, and my notion of a binary choice here is pretty artificial: it's whether an order in the University of California, Irvine's "Online Retail" dataset included 10 or more distinct items. But most problematic, I guess, is that this idea of prefixes being predictive might work when all the decision problems being considered are the same for all trials in the prefix, as well as the $n$th trial, but not when different subjects are considering different problems—the problem characteristics are too influential on choices for these prefixes to predict the response with high accuracy.

The moral

Probably, people's decisions really are noisy. Whether the noise is deterministic (via sensitive dependence on initial conditions) or truly random, people's choices can include variation that is effectively unpredictable from previous choices of the same kind. One can imagine a continuum of predictability to decision problems with no features or constraints, as in this task, to problems where the subject is basically forced to make a certain choice. In between are the typical problems of psychology experiments and real life, where people have some amount of freedom to choose, but the features of the options, such as forseeable consequences, tend to push them towards one option or another.

References

Arfer, K. B., & Luhmann, C. C. (2015). The predictive accuracy of intertemporal-choice models. British Journal of Mathematical and Statistical Psychology, 68(2), 326–341. doi:10.1111/bmsp.12049. Retrieved from http://arfer.net/projects/builder/paper

Hagelbarger, D. W. (1956). SEER, a SEquence Extrapolating Robot. IRE Transactions on Electronic Computers, EC-5(1), 1–7. doi:10.1109/TEC.1956.5219783

Shannon, C. E. (1953). A mind-reading (?) machine. Bell Laboratories.