Donkey notebook

Created 25 Sep 2018 • Last modified 18 Jun 2019

Task planning

I want a bare-bones trivial decision-making task where people have a repeated choice between two meaningless and maximally equivalent options (a little like Buridan's ass, which, curiously, never seems to have gotten a psychology experiment in humans). By studying the prediction of decisions in this task, I hope to learn more about how to predict variation in choices that arises from something other than content. Then perhaps I can e.g. break the ceiling of 85% accuracy I observed in Arfer and Luhmann (2015).

It seems like there's a lot of analytic approaches for predicting binary sequences. Let's get at least a little data to start with. Let's run the task on Mechanical Turk for convenience.

I was thinking the task could use the mouse, and I could record mouse movements and use mouse movements before making previous choices as a rich predictor. The trouble is that you can't move the cursor in JavaScript, so I can't force the cursor to a neutral position after each choice. Even if I could, the physical mouse would still be a bit to the side, of course. Asking subjects to perfectly recenter the mouse, and not starting the next trial till they did, would probably drive subjects a little crazy.

Instead, let's use the keyboard, and ask subjects to position their fingers over both of the two allowed keys. "F" and "J" seem like good choices.

Suppose we get 200 choices. For each subject, the first 100 choices will be purely training data, while I'll evaluate predictions on the second 100. I may use online learning methods that effectively treat every trial the same, but I'll be most interested in performance after the method has already had 100 trials to train with. During the task, I'll show the subject how many choices are left to make, but I won't give feedback on which choices they made.

Some first subjects

Let's try some models with the first few real subjects. Their first 100 trials look like this (O means pressing "f", x means pressing "k"):

(setv N 200)
(setv n-train (// N 2))
(setv n-test (- N n-train))
(setv d (.copy trials))
(setv ($ d train) (< ($ d trial) n-train))
(lfor [X x] (.groupby d "sn")
  [(pretty-choice-seq (ss x $train "chose-key1"))])
OOxOOxOxOOOOOOOxxxxOOOxxxOOOxOxxxxxxOOOOOOOOOOxxxxOxOxOxOOOOOOxxxOOxxOOOOOxxxOxOxOxOxOxOxOxOOxxxxxxx
OOOxOxOOOOOOOOxOxOOxOxOOxOOOxOOOOOxxOxOOOOOxxOOxxxxxxxxxxxxxxxxxxxxxxxxOxOxOOOOxOxOOOxxOOOxOOOOOOxxO
xOOOxxxxOOxxxxOOxOOOOOxOOOxxxOOOxxxxxxxOOOOOxxxxxxOOOOOOOOOOOOOOOOOOOxxxOOxxxOOOOOxxxxOOOOOxxxxOOOOO
(wcby d [$sn $train] (.mean $chose-key1))
sn train value
0 False 0.50
0 True 0.54
1 False 0.53
1 True 0.54
2 False 0.51
2 True 0.59

Since the base rates are near 50%, a trivial model will do poorly.

Let's try a model that assumes that a given choice is a probabilistic function of the previous n trials. The trouble with this is of course when predicting more than n trials of the test set, it will need to use its own past predictions as predictors. When we didn't get to train on the n-tuple in question, we'll predict the mode.

(defn supermarkov-train [lookbehind train]
  (setv mode (>= (.mean train) .5))
  (setv train (tuple train))
  (setv memory {})
  (for [i (range lookbehind (len train))]
    (setv present (get train i))
    (setv past (cut train (- i lookbehind) i))
    (when (not-in past memory)
      (setv (get memory past) [0 0]))
    (+= (get memory past present) 1))
  {"lookbehind" lookbehind
    "mode" mode  "memory" memory  "train" train})

(defn supermarkov-predict [model test-n]
  (setv preds (,))
  (for [X (range test-n)]
    (setv past (cut
      (+ (get model "train") preds)
      (- (get model "lookbehind"))))
    (+= preds (, (if (in past (get model "memory"))
      (do
        (setv [f t] (get model "memory" past))
        (cond
          [(> t f) True]
          [(> f t) False]
          [True (get model "mode")]))
      (get model "mode")))))
  preds)
(wcby (ss d $train) $sn
  (pd.concat (lfor n [2 3 4 5]
    (cbind :n n
      :pred_sum (sum (supermarkov-predict
        (supermarkov-train n $chose-key1)
        n-test))))))
sn i1 n pred_sum
0 0 2 0
0 0 3 0
0 0 4 0
0 0 5 0
1 0 2 100
1 0 3 100
1 0 4 100
1 0 5 98
2 0 2 100
2 0 3 100
2 0 4 100
2 0 5 100

Hilariously, for various n, this model predicts False or True every time or nearly every time. So it behaves similarly to the trivial model.

Let's try regression models using the trial number and transformations thereof as predictors.

(setv ooo (wcby (ss d $train) $sn
  (setv i (np.array (list (range N))))
  (setv X (sklearn.preprocessing.scale (np.column-stack [
    i (** i 2) (** i 3)
    (% i 2) (% i 3) (% i 5)
    (np.sin i) (np.cos i)])))
  (setv model (sklearn.linear-model.LogisticRegressionCV
    :Cs 100 :cv 10 :penalty "l2" :scoring "accuracy"))
  (.fit model (geta X (: n-train)) $chose-key1)
  (pd.DataFrame (dict
    :trial (geta i (: n-train N))
    :pred (.predict model (geta X (: n-train N)))))))

At least sometimes, this model produces some nontrivial predictions, thanks to its modular and trigonometric terms:

(lfor [X x] (.groupby ooo "sn")
  [(pretty-choice-seq ($ x pred))])
xOOOxOxOOOxOxOOOxOxOOOxOxOOOxOxxxOxOxxxOOOxxxOOOxxxOxOxxxOxOxxxxxOxxxxxOxxxxxOxxxxxOxxxxxxxxxxxxxxxx
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

How accurate is it for that one subject?

(.mean (=
  (getl ooo 0 "pred")
  (.reset-index :drop T (ss d (& (= $sn 0) (~ $train)) "chose-key1"))))
0.42

Worse than trivial; oops.

Regression on run length

A Stack Exchange answer suggests a model for the probability of an option that only depends on how many times in a row that option has been selected previously.

(defn rlm-encode [values]
  (setv accums [0 0])
  (setv hist (np.array (+ [[0 0]] (lfor
    x values
    :do (setv (get accums (not x)) 0)
    :do (+= (get accums x) 1)
    (.copy accums)))))
  (np.column-stack [
    hist
    (** hist 2)
    (> (geta hist : 0) 0)]))

(setv ooo (wcby (ss d $train) $sn
  (setv scaler (sklearn.preprocessing.StandardScaler :with-mean F))
  (setv model (sklearn.linear-model.LogisticRegressionCV
    :Cs 100 :cv 10 :penalty "l2" :scoring "accuracy"))
  (setv X (.fit-transform scaler (rlm-encode $chose-key1)))
  (.fit model (geta X (: 0 n-train)) $chose-key1)
  (setv preds (.tolist $chose-key1))
  (for  [… (range n-test)]
    (.append preds (first (.predict model
      (geta (.transform scaler (rlm-encode preds)) None -1 :)))))
  (pd.DataFrame (dict
    :trial (range n-train N)
    :pred (cut preds n-train)))))

1
1
(lfor [X x] (.groupby ooo "sn")
  [(pretty-choice-seq ($ x pred))])
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
OOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOOOOxOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
(.mean (=
  (getl ooo 1 "pred")
  (.reset-index :drop T (ss d (& (= $sn 0) (~ $train)) "chose-key1"))))
0.48

Again, not a success.

Switch to interpolation

When you think about it, it's not surprising that autoregressive-like models, that is, models where the output depends on the previous output, would do badly in this situation. I'm trying to predict trials 100 through 199, so if I mispredict trial 110, that will substantially hurt my ability to predict trial 111, and hence trial 112, and so on. One mistake is enough to throw me off the whole scent.

It's also not surprising that regression doesn't work well, considering the dangers of extrapolation. Each model, in training, only sees trial numbers 0 through 99. It gets no instructions or feedback to control its behavior when the input trial number is 100 or more, which is also the only kind of trial number tried in testing. So, for example, a tree-based model might use a rule like `trial > 97` just to refine its predictions for trials 98 and 99, but that rule will end up applying to all the testing data. I should only expect a hierarchical-style regression model, which is allowed to see all trial numbers for some subjects, to succeed on these later trial numbers.

To make things easier, and also more like Builder, let's try traditional random cross-validation, within subjects. If I do well here, I can use my winning model to inspire the construction of hierarchical models for the original problem.

Data setup

(setv n-folds 10)

(defmacro try-model [&rest args]
  (setv args (dict (partition args)))
  (setv X-expr (get args :X)
    estimator-expr (get args :model)
    diagnostic-expr (.get args ':diagnostic))
  `((fn []
    (setv diagnostic {})
    (setv cv-out (pd.concat (gfor [sn chunk] (.groupby trials "sn")   (wc chunk
      (setv X ~X-expr)
      (setv estimator ~estimator-expr)
      (setv pred (np.empty-like $chose-key1))
      (for [[train test] (.split :X X (sklearn.model-selection.KFold n-folds
          :shuffle T :random-state (+ 100 sn)))]
        (print sn test)
        (setv model (.fit estimator (geta X train)
          (geta (.to-numpy $chose-key1) train)))
        ~(when diagnostic-expr
          `(setv (get diagnostic sn) ~diagnostic-expr))
        (setv (geta pred test) (.predict model (geta X test))))
      (pd.DataFrame (dict
        :sn sn
        :trial $trial
        :chose-key1 $chose-key1
        :pred pred))))))
    [cv-out diagnostic])))

(defn show-accuracies [cv-out]
  (setv accuracies (wcby cv-out $sn
    (.mean (= $chose_key1 $pred))))
  (rd [(.min accuracies) (.mean accuracies) (.max accuracies)]))

Base rates

(setv [cv-out --] (try-model 
  :X (np.empty [(len $trial) 1])
  :model (sklearn.dummy.DummyClassifier "most_frequent")))
(show-accuracies cv-out)
0.395 0.567 0.95

Logistic regression

(setv [cv-out diagnostic] (try-model 
  :X (sklearn.preprocessing.scale (.astype (np.column-stack [
    $trial (** $trial 2) (** $trial 3)]) np.float64))
  :model (sklearn.linear-model.LogisticRegression
    :C np.inf :solver "lbfgs")
  :diagnostic (np.append model.intercept_ model.coef_)))
(show-accuracies cv-out)
0.37 0.572 0.95
(lfor [-- x] (.groupby (ss cv-out (.isin $sn [0 1 2])) "sn")
  [(pretty-choice-seq ($ x pred))])
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxOOxxxxxxxxOxxxxxxxxxxxxxxxxxxOxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxOxOxxxxxxxOOxxxxOOOOOxOOOOOOxOOOOOOOOOxOxOxOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxOOxOxOxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxOxxOxxxOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxxxxxxxxxxxxxxxxx
OxxxOxOOxxOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxOOxOOOxOOOOOxxOxOxOOx

The long runs of Trues and Falses in the predictions suggest an underfit model. A mere cubic can't capture lots of switching.

Regularized, with more terms

(setv [cv-out diagnostic] (try-model 
  :X (sklearn.preprocessing.scale (.astype :dtype np.float64 (np.column-stack [
    $trial (** $trial 2) (** $trial 3)
    (% $trial 2) (% $trial 3))]))
  :model (sklearn.linear-model.LogisticRegressionCV
    :Cs 10 :penalty "l2"
    :cv (sklearn.model-selection.KFold 10 :random-state 133)
    :scoring "accuracy")
  :diagnostic model.C_))
(show-accuracies cv-out)
0.42 0.61 0.95

Nearest neighbors

(setv [cv-out diagnostic] (try-model
  :X (np.vstack (.to-numpy $trial))
  :model (sklearn.model-selection.GridSearchCV
    :estimator (sklearn.neighbors.KNeighborsClassifier)
    :param-grid {"n_neighbors" (seq 1 10)}
    :cv (sklearn.model-selection.KFold 10 :shuffle T :random-state 133)
    :scoring "accuracy")
  :diagnostic model.best_params_))
(show-accuracies cv-out)
0.36 0.653 0.97
(lfor [-- x] (.groupby (ss cv-out (.isin $sn [0 1 2])) "sn")
  [(pretty-choice-seq ($ x pred))])
OOOOxOxOOOOOOOOOxOOxOOOOxxOxOxxxxxxOxOOOOOOOOOxOxxxOxOxOOOOOOOOOxxxOOxOOOOOxxxOxOxxxOxOxOOOOxxxxxxxxxxxxxxxxxxOOOOOxOOOOOOOOOxOxxxxOxOxOOOOOxxxxOOOOxOxxxxxxxOxOOOOOOOOOxxxxxxxxxxxxxxxxOOxOOOOOOOOOOxOx
OOOOxOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOxOOOOOOOOxxxxxxxxxxxxxxxxxxxxxxxxxxxxxOOOOOOxOOOOOOOOOOOOOOOOOOOOOOOOOxxxxxOxxxOxxxxxxxOOOOOOOOOOxOOOOOOOOOOOOOOOxOxxxxxxxxxOOOOOxxxxxxxxxxxxxxOOOxOOOOOOOOxOOOxxxxx
OxOOOxxxxOOxxxxOOxOOOOOxOOOxxxOOOxxxxxxOOOOxOxxxxxxOOOOOOOOOOOOOOOOOxOxxxxOxxOOOOOOxxxxOOOOOxxxxOOOOOxxxxxOOOxxxOOOxxxxxxxxxxxOOOOxOOxOOxOOOxxxxxOOOxOOxOOOOOxxxxOOOOOOOOOOOxxxxOOxOOxOOxOOxOOxxOOxOxxxx

It no longer looks like we're underfit, but performance is still not much better than that of the trivial model.

(valcounts (gfor  d (.values diagnostic)  (get d "n_neighbors")))
I value
1 5
2 3
3 4
4 9
5 3
8 2
9 2
10 2

These are the counts of the selected ks.

References

Arfer, K. B., & Luhmann, C. C. (2015). The predictive accuracy of intertemporal-choice models. British Journal of Mathematical and Statistical Psychology, 68(2), 326–341. doi:10.1111/bmsp.12049. Retrieved from http://arfer.net/projects/builder/paper