Artiruno notebook

Created 5 Jul 2021 • Last modified 7 Jan 2024

Notes on decision aids

I'm particularly interested in software, and other means for making decisions in situations that aren't already fully quantitatively specified.

Multiattribute utility theory (MAUT)

Generally, multiattribute utility methods compute a utility for each item as the weighted sum of attribute utilities.

Simple multiattribute rating theory (SMART)

Edwards (1977) - Each item's utility is a weighted sum of the attributes. The weights are chosen by making the least important attribute 10 and choosing larger integers for the second least important, the third least important, etc. Preference functions are made for attributes by assuming linear preference from the lowest plausible value to the highest plausible value (when there aren't real units to use, the attribute is estimated on an abstract 0-to-100 scale).

Multi-attribute range evaluation (MARE)

Hodgett, Martin, Montague, and Talford (2014) - A variation on a simple weighted-sum method. The weighted-sum method normalizes attributes (by dividing each by its maximum), multiplies by attribute weights, and sums them to get a per-item score. In MARE, you specify up to three values per attribute (the minimum possible value, the most likely value, and the maximum possible value), and then compute the range of possible weighted sums for each item.

Swing weighting

A way to decide on attribute weighting for a multi-attribute utility method. You find the best and worst values of each the K attributes among the available items, and then construct K + 1 hypothetical items: one with the worst value on each attribute, and one for each attribute k that has the best value on k and the worst value on all other attributes. Then you rank these items. Then you rate each item 0 to 100. These ratings, normalized, are used as the weights of the corresponding attributes.

Analytic Hierarchy Process (AHP)

The decision-maker compares each pair of alternatives on each attribute on a 1-to-9 scale. To weight the attributes, they compare each pair of attributes on a 1-to-9 scale. Then, they can compute the priority for each item.

It's called the AHP because attributes can be optionally combined into categories of attributes, which can themselves be combined into higher-order categories, and so on.

Outranking methods

PROMETHEE

The criteria on which all the options are evaluated are either ordinal (like a rating scale) or continuous (like a price in dollars).

For each continuous criterion, the decision-maker chooses a maximum difference on the attribute about which he would be indifferent (e.g., $10 of price) and a minimum difference that saturates importance on that criterion (e.g., $1,000 of price). Between these points, the criterion is weighted linearly.

For ordinal criteria, one traditionally uses the 0–1 "usual" preference function, in which any difference at all is regarded as saturating for that criterion (so a difference of 2 steps is no more important than a difference of 1 step).

Each criterion gets a weight. Visual PROMETHEE encourages you to see how decisions change as you change the weights, rather than using a single fixed set of weights.

The PROMETHEE I method produces two metrics for each option O, one that averages the preference for O over all others, and another that averages the preference for all others over O. PROMETHEE 2 combines these by taking the difference.

Verbal decision analysis (VDA)

I've yet to find any VDA paper that specifies an algorithm for which comparisons to ask the decision-maker to make.

ZAPROS

ZAPROS III (Larichev, 2001): All attributes are ordinal. The decision-maker is asked questions like "Would you rather change from an item that's best on attribute A to an item with the second-best value on A, or from an item best on B to an item second-best on B?". Then the decision-maker is confronted with any inconsistent judgments they've made and asked to correct them. This done, a partial order can be constructed on all possible items. No weights or other numeric judgments ever need to be provided.

ZAPROS-LM (Moshkovich, Mechitov, & Olson, 2016) is a variation that asks fewer questions. I think.

Even with ZAPROS III-i (Tamanini and Pinheiro (2011)), it looks like the full preference scale is constructed, even if not necessary to choose the best alternatives.

UniComBOS

Ashikhmin and Furems (2005)

Continue a procedure until a best alternative according to U-dominance is selected.

Start with pairwise comparisons of one-criterion units. The subject can choose: A is better, B is better, they're equally good, or "don't know". If that suffices, stop; otherwise, continue to two-criterion units. Also stop if the decision-maker is inconsistent in their choices between several presentations of the same comparison.

There's another kind of inconsistency analysis, too (section 5), based on transitivity.

Other VDA software

Available
- Moshkovich and Mechitov (2018) - https://verbaldecisionanalysis.wordpress.com/ - ORCON_Z (ZAPROS-LM) - still around
Unavailable
- Shevchenko, Ustinovichius, and Walasek (2019) - CLARA - no URL provided; the corresponding author didn't reply to email I sent 12 Aug 2020
- Barbosa, Pinheiro, Silveira, and Filho (2019) - ORCLASSWEB - http://www2.unifor.br/OrclassWeb - dead link
- A 2018 conference paper - Aranau - no URL provided; corresponding author didn't reply to email I sent on 28 Jun 2020
- Ashikhmin and Furems (2005) - IDSS UniComBOS - http://iva.isa.ru/DSS - dead link

Possible extensions to Artiruno core

Web interface
- A button to prematurely abort
- Allow setting allowed_pairs_callback
- Show a graph of preferences
For either interactive mode
- Transcripts of choices and inferences
- Saving of choices partway through the procedure, and later replaying them
- Proper error messages instead of assertion failures
- More validation of the interactive input
A nicer interactive interface
Allow the agent to reply "don't know", a la UniComBOS. (UniComBOS takes this to mean that a preference for the given situation should never be inferrable from other preferences, but I think it makes more sense to understand it as totally uninformative; i.e., "skip this question and do what you can with everything else or other questions".)
More sophisticated alt levels
- Allow an alternative to have a missing value (None) on a criterion
- Allow an alt to have multiple values for a criterion
- Allow an alt to have a fuzzy set of values for a criterion
Print explanations of the final ranking or choice, a la ZAPROS
Consistency checks
- Check whether the agent's preferences are consistent using an all-worst-attributes reference instead of an all-best-attributes reference, a la ZAPROS
- Check transitively deduced preferences multiple times, a la ZAPROS
Support group decision-making

Basic idea for an Internet study

Collect basic demographic data (after the main part of the study)

Ask subjects about a weighty decision they have to make soon which they're not already sure about and whose outcome they'll be able to assess soon (let's say, 1 month from now).

Have them briefly describe the decision to be made and the options in prose.

Conditions:

Artiruno condition: Then they construct the options and do VDA.
Comparison: Nothing else. If this study works, then maybe there could be a follow-up to investigate the effect by comparing the Artiruno condition to a condition where subjects construct options in the same style, but don't actually use VDA.

Follow up a month later.

Remind them of what they wrote about the decision problem (but not their options or Artiruno's suggestion).
Have them briefly describe in prose what decision they made, what the outcome was, and how happy they are with the outcome.
Have them rate the outcome on numeric scales.
Have them rate the decision-making process on numeric scales.

Pilot

In the pilot, just try having people set up the decision problem, so you can see that people are making reasonable choices of criteria, levels, and decision problems.

Piloting on MTurk was a failure, apparently because of the poor English skills of the subjects I got. Let's try some pre-screened subjects from Reddit instead.

I ran four screened subjects from Reddit and am reasonably happy with the results. The instructions changes these findings suggest are:

Mention that you should only list alternatives if they're options you really have, not options you would ideally have.
Provide an example of a yes-or-no decision.

Plans for the real study

To start with, aim for 40 subjects, 20 per condition, who return; this means you should probably try to recruit 80 subjects and offer more money for session 2 than session 1. You probably won't be able to get this many from /r/samplesize, but you can check other places where studies are posted, or maybe even hire a service that connects social scientists to respondents.

Set expiration times to a week and a day since your invitation.

Session 1 (internally, visits 1 and 2)
- Warn about performance. ("can be particularly slow on phones and tablets")
- Allow for session 2 in the consent form. Be clear that the subject is expected to be in both sessions, but the second will be quite short.
- [Offer how much money?]
- Ask for a decision description and an expected decision resolution date, as in the pilot.
- Randomly assign the subject to the VDA condition or the control condition. (If the subject number, among subjects who've gotten this far in this version of the study, is even, counting the first as 1, then use the opposite condition of the previous subject.) Write the study state to disk here, so the subject can't refresh and enter different values depending on his assigned condition.
- If the subject is in the control condition, skip to the end. Otherwise, continue.
- Solicit criteria and alternatives, as in the pilot.
- Conduct VDA.
  - Display a reminder of the subject's chosen criteria and alternatives on this page. Explain that the subject doesn't have to make the choice suggested by Artiruno, but it could be a good idea.
  - Show results as text. (The graph would probably not be very helpful in this situation, with find_best = 1 and non-quantitatively minded users.)
  - Allow the subject to restart the procedure, or to return to the previous page and edit criteria and alternatives; trying to record everything is probably futile by the nature of JavaScript.
  - Don't provide an abort button. (It wouldn't be useful because in the case of find_best = 1 and not showing a graph, Artiruno can provide no useful information before it's done.)
  - Record the questions asked, the subject's choices, the subject's response times, and Artiruno's conclusion. If the subject restarted VDA, use the values from the final round.
- Ask for any comments.
Session 2 (internally, visit 3)
- Send an invitation 1 month after the subject completed session 1. Say that session 2 assumes the subject made the choice and got to see at least a little of the outcome; if that won't be true for a while, the subject should reply saying when would be a good time to do session 2.
- [Offer how much money?]
- Re-display the decision description and expected resolution date.
- Ask if they've made the choice and gotten to see some of the outcome. (In theory, this should always be "yes", because of the instructions earlier. It's a sanity check.)
- Have the subject briefly describe in prose:
  - which choice they made
  - what the outcome was
  - how happy they are with the choice and outcome
- Use 1-to-5 rating scales for the below.
- How pleased are they with the outcome of their choice?
- How well-chosen was their choice, given what they knew at the time they made it?
- How difficult did it feel to make the decision?
- In the VDA condition:
  - Show the criteria, alternatives, and results text from VDA.
  - How consistent did they feel their choice was with Artiruno's suggestion?
  - How difficult was the procedure (including writing up the criteria and alternatives) to do?
  - How helpful did the procedure feel for making the decision-making process?
- Debrief
- Another comments box

More piloting

On Prolific.

TV 7

A lot of people who seem to otherwise mostly take well to the task are messing up basic things in the VDA problem setup, particularly, putting criterion levels in the reverse order:

(.sum (getl (get ratings "pilot") (sorted (&
  (set (ssi subjects (<= $tv 7)))
  (set (. (get ratings "pilot") index))))))

I	value
reversed_criterion	10
opaque	5
desc_mismatch	2

I should probably take some time to add more checks (e.g., an extra screen where subjects have to confirm that they put criterion levels in the right order) and rerun from scratch rather than have this much compromise of the VDA sample.

TV 8

I've made a lot of changes, including putting the alts input before the criteria input, checking for unchanged placeholder alt or criteria names, checking for dominance and criterion-level order, and adding a puzzle at the start of the scenario phase to encourage dropout that would happen had the subject been assigned to the VDA condition to occur earlier. (I removed display of hypothetical best and worst items because I figured it would be overkill now.) Still, of 4 new subjects assigned to the VDA condition who completed the task (120 121 127 132), 2 got the levels backwards for at least one criterion, one of whom, subject 132, seemed otherwise quite thoughtful. Ridiculous.

TVs 9, 10

Let's try labeling criteria levels with "(best)" and "(worst)" while subjects are writing them. (TV 10 adds these labels to the example problems, too.) I've removed the manual level-order checks. Since level order is officially A Problem, let's try VDA mode only to see if things are okay before returning to random assignment. I've also added a question to the demographics questionnaire about education.

The 8 subjects who I've gotten to finish VDA ((ss subjects (.isin $tv [9 10]))) all gave reasonable-looking level orders, so I think things are good now.

Screening and scenario

In TV 11, I restored the random assignment of conditions, and I slightly increased the pay and the time estimate for the scenario phase. In TV 12, I fixed a typo.

I wanted to run subjects until I had at least 40 in each condition who completed the scenario phase, so I ceased recruiting new subjects once that criterion was satisfied after my last batch. I picked this number because I wanted at least 20 subjects to have completed all phases, and I estimated a 50% follow-up rate.

(comments-dt (ssi subjects (.isin $tv [11 12])))

sn	visit	date	comments
158	0	2023-02-23	My partner and I are currently making this decision.
158	3	2023-04-05	Thank you for the study. I hope you retrieve some useless results!
164	0	2023-02-25	Hope i am qualified. Thank you
172	2	2023-02-25	I think it's an excellent decision support system!
172	3	2023-04-05	Impressive work!
177	3	2023-04-05	I just wanted to say that although I did not follow the suggestion offered to me, I did find the process of doing this really helpful. It allowed me to give good consideration to the outcomes and possibilities in front of me at the time.
184	0	2023-02-28	N/A
190	3	2023-04-13	It is a good decision making system.
199	3	2023-04-07	Very helpful. Thankyou.
202	3	2023-04-06	interesting task!
212	0	2023-03-02	thanks
217	0	2023-03-02	Thank you for inviting me to take part in this study, it sounds very interesting.
217	2	2023-03-02	Looks like I'm buying a nice new Porsche!!
217	3	2023-04-07	Thank you for inviting me to take part, and wish me luck with my new silly chariot!
220	0	2023-03-04	submit responses
222	3	2023-04-25	This was an interesting study, thank you for inviting me.
227	2	2023-03-04	An interesting study and tool, thank you!
231	2	2023-03-05	Hi • I really enjoyed the study, specially when it started to altering between my choices. I also never wrote down a pros and cons of my decision which this task made me do it and I actually took a picture of that and shared it with my partner, so thank you very much for that. • Good luck with your research. I liked the web page design too, they were quite smart and yet very simple. • All the best.
233	3	2023-04-10	thank you
243	2	2023-03-05	Thanks
243	3	2023-05-01	Thank you for letting me take part.
245	0	2023-03-05	Thank you
245	2	2023-03-05	Thank you
245	3	2023-05-09	Interesting study. I especially appreciate the explanation on the debriefing page. Thank you
254	0	2023-03-05	Submit Responses
261	0	2023-03-06	Thank you for the opportunity to participate! There were lots of great choices.
273	3	2023-04-12	great study!

(wc
  (ss subjects
    (& (.isin $tv [11 12]) (pd.notnull $n_puzzle_attempts)))
  (cbind
    $began
    :res_len (. (- $expected_resolution_date $began) dt days)
    :puz_t (.round $time_puzzle_minutes 1)
    :puz_n $n_puzzle_attempts
    $cond
    :v2 (.round $time_visit2_minutes 1)
    :cy $country
    :edu $education_years))

sn	began	res_len	puz_t	puz_n	cond	v2	cy	edu
149	2023-02-23	98	1.2	1	control	0.2	us	18
150	2023-02-23	57	2.3	1	vda	8.0	us	14
154	2023-02-23	20	11.1	4	vda
155	2023-02-23	312	2.5	4	control	0.2	gb	14
156	2023-02-23	1	3.0	2	control	0.4	gb	25
158	2023-02-23	97	7.7	8	control	0.3	gb	14
159	2023-02-23	334	6.1	1	vda	18.3	gb	15
160	2023-02-25	-7	2.3	1	vda	33.9	gb	17
161	2023-02-25		1.0	1	control
162	2023-02-25	34	3.0	1	vda	9.5	gb	12
163	2023-02-25	278	1.8	1	control	0.3	gb	20
165	2023-02-25	69	1.5	1	vda
166	2023-02-25	-13	1.2	1	control	0.3	gb	12
167	2023-02-25	35	1.9	1	vda	9.1	gb	18
168	2023-02-25	29	3.8	4	vda
169	2023-02-25	33	2.6	1	control	0.4	gb	13
171	2023-02-25	35	2.5	1	vda	13.6	gb	17
172	2023-02-25	19	4.1	1	vda	34.4	gb	18
173	2023-02-25	705	1.7	1	control	0.6	gb	16
174	2023-02-25	11	3.9	3	control	0.3	gb	14
175	2023-02-25	34	10.8	1	control	0.4	gb	13
177	2023-02-25	16	3.2	1	vda	9.3	gb	21
178	2023-02-25	10	7.1	12	control	0.6	gb	8
179	2023-02-25	14	15.4	12	vda	11.3	gb	10
180	2023-02-28		2.3	3	vda	4.1	us	16
181	2023-02-28	62	2.5	1	control	0.3	us	17
182	2023-02-28	31	2.4	1	control	0.8	us	18
186	2023-02-28	31	3.0	1	control	0.6	gb	15
188	2023-02-28	24	2.0	1	vda	3.3	us	17
189	2023-02-28	31	2.9	3	control	1.3	us	16
190	2023-02-28	92	2.6	2	vda	8.5	gb	17
192	2023-02-28	18	3.7	1	vda	11.4	gb	17
193	2023-02-28	27	4.2	1	control	0.5	gb	18
194	2023-02-28	132	1.6	1	control	0.4	us	18
195	2023-02-28	20	2.1	1	control	0.5	us	13
196	2023-02-28	15	1.2	1	vda	19.4	gb	14
197	2023-02-28	31	4.4	1	control	0.4	gb	16
198	2023-02-28	16	6.6	1	vda	7.1	gb	15
199	2023-02-28	17	2.0	1	vda	16.4	gb	11
200	2023-03-02	29	4.7	4	control	0.2	us	16
201	2023-03-02	13	3.9	1	vda	4.3	gb	16
202	2023-03-02	29	3.3	1	control	0.4	gb	17
204	2023-03-02	29	2.5	1	control	0.3	gb	14
205	2023-03-02	20	3.9	3	control	0.3	us	11
206	2023-03-02	29	1.8	1	control	0.5	gb	18
207	2023-03-02	29	5.2	1	control	0.5	gb	17
208	2023-03-02	5	2.7	1	vda	17.1	us	15
209	2023-03-02	29	8.1	1	control	0.5	us	16
210	2023-03-02	13	9.4	4	vda	21.1	us	14
211	2023-03-02	23	6.0	1	vda	21.1	us	14
215	2023-03-02	105	11.9	10	vda	6.8	pl	15
216	2023-03-02	29	1.8	1	control	0.4	gb	18
217	2023-03-02	3	2.6	1	vda	17.5	gb	17
218	2023-03-02	31	4.6	1	vda
219	2023-03-02	20	2.8	1	vda	20.2	za	17
220	2023-03-04	90	6.7	1	vda
221	2023-03-04	3	1.6	1	control	0.3	gb	18
222	2023-03-04	2	4.1	4	control	1.5	gb	16
223	2023-03-04	27	2.7	1	control	0.3	us	14
224	2023-03-04	28	5.0	6	control	0.2	us	16
225	2023-03-04	27	1.3	1	control	0.3	gb	16
226	2023-03-04	37	5.4	1	vda	5.4	gb	9
227	2023-03-04	15	2.6	1	vda	11.3	gb	16
228	2023-03-04	34	1.7	2	vda	13.1	gb	21
230	2023-03-04	22	2.6	1	control	0.4	gb	20
231	2023-03-04	72	5.1	1	vda	25.9	gb	20
232	2023-03-04	7	6.2	3	control	0.4	gb	18
233	2023-03-04	21	4.2	1	control	0.5	gb	12
235	2023-03-04	24	10.8	3	vda
236	2023-03-04	28	2.2	1	control	0.5	gb	15
237	2023-03-04	6	1.2	1	vda	3.5	gb	14
238	2023-03-04	23	6.8	1	vda
239	2023-03-04	303	16.0	2	vda	22.2	gb	12
241	2023-03-05	40	5.5	1	control	0.4	gb	13
242	2023-03-05	118	2.0	1	control	0.3	gb	16
243	2023-03-05	25	3.9	2	control	0.3	gb	15
244	2023-03-05	15	5.2	1	control	0.4	gb	12
245	2023-03-05	14	1.4	1	control	0.4	us	16
246	2023-03-05	49	12.0	5	vda	10.1	gb	18
247	2023-03-05	14	1.7	1	vda	11.0	gb	10
248	2023-03-05	88	3.5	1	vda	7.4	gb	17
249	2023-03-05	10	2.7	1	control	0.2	us	13
250	2023-03-05	26	5.4	1	vda	36.4	gb	19
252	2023-03-05	26	1.6	1	vda	5.2	gb	17
254	2023-03-05	14	28.8	15	vda
255	2023-03-05	26	5.2	1	vda
257	2023-03-05	19	5.6	4	vda
258	2023-03-05	26	4.7	5	control	0.3	nl	12
259	2023-03-05	46	3.1	1	control	2.3	gb	12
260	2023-03-06	300	2.3	2	control	0.1	us	13
261	2023-03-06	25	5.5	1	vda	12.1	us	12
262	2023-03-06	25	2.7	4	control	0.8	ie	16
263	2023-03-06	25	1.7	1	vda	9.4	us	13
264	2023-03-06	25	5.1	4	control	0.2	us	16
265	2023-03-06	26	4.3	1	vda	14.0	us	16
266	2023-03-06	14	42.4	1	control	0.6	gb	16
267	2023-03-06	56	24.9	3	vda	7.7	us	23
268	2023-03-06	31	0.8	2	control	0.2	us	16
270	2023-03-06	28	1.7	1	control	0.8	gb	16
271	2023-03-06	4	2.7	2	vda	4.7	gb	18
272	2023-03-06	25	3.2	1	control	0.3	gb	13
273	2023-03-06	14	4.2	1	vda	15.4	us	16
274	2023-03-06	25	3.0	1	control	0.3	gb	21
275	2023-03-06	25	3.6	1	vda
276	2023-03-06	1	4.5	2	vda	7.3	gb	19
277	2023-03-06	31	6.7	1	control	0.5	gb	11
278	2023-03-06	-34	2.4	1	vda	8.6	gb	13

(.sort-index (.value-counts (wc
  (ss subjects (& (.isin $tv [11 12]) (pd.notnull $time_puzzle_minutes)))
  (cbind $cond :did_v2 (pd.notnull $time_visit2_minutes)))))

cond	did_v2	value
control	False	1
control	True	53
vda	False	11
vda	True	42

VDA properties

(setv sns (ssi subjects (&
  $round1
  (= $cond "vda"))))
(.sum (pd.concat :axis 1 [
  (getl (get ratings "round1") sns)
  (.drop (getl vda-props sns) :axis 1 ["n_questions" "n_criteria" "result_quality"])]))

I	value
reversed_criterion	6
bad_alt_level	3
other_issue	2
dominator	13
dominated	19
unused_level	18
constant_criterion	12
vda_varied_all_cs	6
vda_redundant	8

The ratings categories are:

reversed_criterion: One of the criteria looks to have its levels in the wrong order.
bad_alt_level: One or more levels of the alts look to be set incorrectly, at least when taking the description into account. Possibly the subject changed the criteria after setting alt levels, triggering the alt levels to reset, but didn't notice and correct for it.

Idiosyncratic problems (other_issue):

Subject 219: Two alts appear to represent the same choice by the subject, and are distinguished only by possible outcomes.
Subject 248: Some of the criteria don't make sense for some of the alts.

Bad VDA subjects

Which subjects should be considered good enough for a main analysis of people who used the tool correctly enough? I'm going to exclude subjects who had any one of the three rated problems: reversed_criterion, bad_alt_level, or other_issue. The various automatically detected problems, like unused_level or varied_all_cs represent either inherent limitations of Artiruno (which ought to be reflected in my results) or less-than-optimal usage that shouldn't compromise results too much. The resulting sample in the VDA condition is:

(valcounts (np.where (ss subjects $round1 "bad_vda") "exclude" "include"))

I	value
include	84
exclude	11

The specific subjects excluded are:

(pd.Series (ssi subjects (& $round1 $bad_vda)))

I	value
0	162
1	171
2	172
3	198
4	215
5	219
6	226
7	246
8	248
9	265
10	273

I've archived this at https://web.archive.org/web/20230321/https://arfer.net/projects/artiruno/notebook#sec--bad-vda-subjects to show that I made this decision before rerunning the follow-up, and therefore before seeing the outcomes.

Planning for the follow-up

Let's send out the first invitations on April 3rd, which is 4 weeks after I ran the last subject in the analytic sample.

I need to ask each subject if they've made the decision and gotten to see some outcome before reinviting them. Let's do it by sending them a message on Prolific.

Send out only one message first. Then you can send them out in waves of 10 or 20 in subject-number order, ideally in the morning so you don't get questions while you're asleep.

Follow-up results

(rd (. (wcby
  (ss subjects (& $followed_up (bnot $bad_vda))
    ["cond" "eval_rate_easiness" "eval_rate_quality" "eval_rate_satisfaction" "eval_rate_vda_consistency" "eval_rate_vda_easiness" "eval_rate_vda_helpfulness"])
  $cond
  (pd.concat [
    (pd.Series [(. $ shape [0])] :index ["n"])
    (.mean $ :numeric-only T)])) T))

I	control	vda
n	32.000	23.000
eval_rate_easiness	2.656	2.522
eval_rate_quality	4.344	4.304
eval_rate_satisfaction	4.125	4.217
eval_rate_vda_consistency		4.043
eval_rate_vda_easiness		3.130
eval_rate_vda_helpfulness		3.739

(rd 2 (lfor
  vname ["eval_rate_easiness" "eval_rate_quality" "eval_rate_satisfaction"]
  :setv [lo hi] (scikits.bootstrap.ci
    (tuple (gfor
      cond ["control" "vda"]
      (ss subjects
        (& $followed_up (= $cond cond) (bnot $bad_vda))
        vname)))
    (fn [control vda] (- (np.mean vda) (np.mean control)))
    :multi "independent"
    :seed (int.from-bytes (.encode vname "ASCII") "big")
    :alpha .05 :n-samples 1,000,000)
  [vname lo hi]))

eval_rate_easiness	-0.57	0.33
eval_rate_quality	-0.45	0.32
eval_rate_satisfaction	-0.43	0.58

Reminders for analysis

If subjects refresh the page and redo parts of the task, the timing data you get will only reflect the final attempt.

References

Ashikhmin, I., & Furems, E. (2005). UniComBOS—Intelligent decision support system for multi-criteria comparison and choice. Journal of Multi-Criteria Decision Analysis, 13(2-3), 147–157. doi:10.1002/mcda.380

Barbosa, P. A. M., Pinheiro, P. R., Silveira, F. R. V., & Filho, M. S. (2019). Selection and prioritization of software requirements applying verbal decision analysis. Complexity. doi:10.1155/2019/2306213

Edwards, W. (1977). How to use multiattribute utility measurement for social decisionmaking. IEEE Transactions on Systems, Man, and Cybernetics, 7(5), 326–340. doi:10.1109/TSMC.1977.4309720

Hodgett, R. E., Martin, E. B., Montague, G., & Talford, M. (2014). Handling uncertain decisions in whole process design. Production Planning and Control, 25(12), 1028–1038. doi:10.1080/09537287.2013.798706

Larichev, O. I. (2001). Ranking multicriteria alternatives: The method ZAPROS III. European Journal of Operational Research, 131(3), 550–558. doi:10.1016/S0377-2217(00)00096-5

Moshkovich, H. M., & Mechitov, A. I. (2018). Selection of a faculty member in academia: A case for verbal decision analysis. International Journal of Business and Systems Research, 12(3), 343–363. doi:10.1504/IJBSR.2018.10011350

Moshkovich, H., Mechitov, A., & Olson, D. (2016). Verbal decision analysis. In S. Greco, M. Ehrgott, & J. R. Figueira (Eds.), Multiple criteria decision analysis (2nd ed., pp. 605–636). New York, NY: Springer. ISBN 978-0-387-23081-8. doi:10.1007/978-1-4939-3094-4_15

Shevchenko, G., Ustinovichius, L., & Walasek, D. (2019). The evaluation of the contractor's risk in implementing the investment projects in construction by using the verbal analysis methods. Sustainability, 11(9). doi:10.3390/su11092660

Tamanini, I., & Pinheiro, P. R. (2011). Reducing incomparability in multicriteria decision analysis: An extension of the ZAPROS method. Pesquisa Operacional, 31, 251–270. doi:10.1590/S0101-74382011000200004