Notes on decision aids
I'm particularly interested in software, and other means for making decisions in situations that aren't already fully quantitatively specified.
Multiattribute utility theory (MAUT)
Generally, multiattribute utility methods compute a utility for each item as the weighted sum of attribute utilities.
Simple multiattribute rating theory (SMART)
Edwards (1977) - Each item's utility is a weighted sum of the attributes. The weights are chosen by making the least important attribute 10 and choosing larger integers for the second least important, the third least important, etc. Preference functions are made for attributes by assuming linear preference from the lowest plausible value to the highest plausible value (when there aren't real units to use, the attribute is estimated on an abstract 0-to-100 scale).
Multi-attribute range evaluation (MARE)
Hodgett, Martin, Montague, and Talford (2014) - A variation on a simple weighted-sum method. The weighted-sum method normalizes attributes (by dividing each by its maximum), multiplies by attribute weights, and sums them to get a per-item score. In MARE, you specify up to three values per attribute (the minimum possible value, the most likely value, and the maximum possible value), and then compute the range of possible weighted sums for each item.
A way to decide on attribute weighting for a multi-attribute utility method. You find the best and worst values of each the K attributes among the available items, and then construct K + 1 hypothetical items: one with the worst value on each attribute, and one for each attribute k that has the best value on k and the worst value on all other attributes. Then you rank these items. Then you rate each item 0 to 100. These ratings, normalized, are used as the weights of the corresponding attributes.
Analytic Hierarchy Process (AHP)
The decision-maker compares each pair of alternatives on each attribute on a 1-to-9 scale. To weight the attributes, they compare each pair of attributes on a 1-to-9 scale. Then, they can compute the priority for each item.
It's called the AHP because attributes can be optionally combined into categories of attributes, which can themselves be combined into higher-order categories, and so on.
The criteria on which all the options are evaluated are either ordinal (like a rating scale) or continuous (like a price in dollars).
For each continuous criterion, the decision-maker chooses a maximum difference on the attribute about which he would be indifferent (e.g., $10 of price) and a minimum difference that saturates importance on that criterion (e.g., $1,000 of price). Between these points, the criterion is weighted linearly.
For ordinal criteria, one traditionally uses the 0–1 "usual" preference function, in which any difference at all is regarded as saturating for that criterion (so a difference of 2 steps is no more important than a difference of 1 step).
Each criterion gets a weight. Visual PROMETHEE encourages you to see how decisions change as you change the weights, rather than using a single fixed set of weights.
The PROMETHEE I method produces two metrics for each option O, one that averages the preference for O over all others, and another that averages the preference for all others over O. PROMETHEE 2 combines these by taking the difference.
Verbal decision analysis (VDA)
I've yet to find any VDA paper that specifies an algorithm for which comparisons to ask the decision-maker to make.
ZAPROS III (Larichev, 2001): All attributes are ordinal. The decision-maker is asked questions like "Would you rather change from an item that's best on attribute A to an item with the second-best value on A, or from an item best on B to an item second-best on B?". Then the decision-maker is confronted with any inconsistent judgments they've made and asked to correct them. This done, a partial order can be constructed on all possible items. No weights or other numeric judgments ever need to be provided.
ZAPROS-LM (Moshkovich, Mechitov, & Olson, 2016) is a variation that asks fewer questions. I think.
Even with ZAPROS III-i (Tamanini and Pinheiro (2011)), it looks like the full preference scale is constructed, even if not necessary to choose the best alternatives.
Continue a procedure until a best alternative according to U-dominance is selected.
Start with pairwise comparisons of one-criterion units. The subject can choose: A is better, B is better, they're equally good, or "don't know". If that suffices, stop; otherwise, continue to two-criterion units. Also stop if the decision-maker is inconsistent in their choices between several presentations of the same comparison.
There's another kind of inconsistency analysis, too (section 5), based on transitivity.
Other VDA software
- Shevchenko, Ustinovichius, and Walasek (2019) - CLARA - no URL provided; the corresponding author didn't reply to email I sent 12 Aug 2020
- Barbosa, Pinheiro, Silveira, and Filho (2019) - ORCLASSWEB - http://www2.unifor.br/OrclassWeb - dead link
- A 2018 conference paper - Aranau - no URL provided; corresponding author didn't reply to email I sent on 28 Jun 2020
- Ashikhmin and Furems (2005) - IDSS UniComBOS - http://iva.isa.ru/DSS - dead link
Possible extensions to Artiruno core
- Web interface
- A button to prematurely abort
- Allow setting
- Show a graph of preferences
- For either interactive mode
- Transcripts of choices and inferences
- Saving of choices partway through the procedure, and later replaying them
- Proper error messages instead of assertion failures
- More validation of the interactive input
- A nicer interactive interface
- Allow the agent to reply "don't know", a la UniComBOS. (UniComBOS takes this to mean that a preference for the given situation should never be inferrable from other preferences, but I think it makes more sense to understand it as totally uninformative; i.e., "skip this question and do what you can with everything else or other questions".)
- More sophisticated alt levels
- Allow an alternative to have a missing value (
None) on a criterion
- Allow an alt to have multiple values for a criterion
- Allow an alt to have a fuzzy set of values for a criterion
- Allow an alternative to have a missing value (
- Print explanations of the final ranking or choice, a la ZAPROS
- Consistency checks
- Check whether the agent's preferences are consistent using an all-worst-attributes reference instead of an all-best-attributes reference, a la ZAPROS
- Check transitively deduced preferences multiple times, a la ZAPROS
- Support group decision-making
Basic idea for an Internet study
Collect basic demographic data (after the main part of the study)
Ask subjects about a weighty decision they have to make soon which they're not already sure about and whose outcome they'll be able to assess soon (let's say, 1 month from now).
Have them briefly describe the decision to be made and the options in prose.
- Artiruno condition: Then they construct the options and do VDA.
- Comparison: Nothing else. If this study works, then maybe there could be a follow-up to investigate the effect by comparing the Artiruno condition to a condition where subjects construct options in the same style, but don't actually use VDA.
Follow up a month later.
- Remind them of what they wrote about the decision problem (but not their options or Artiruno's suggestion).
- Have them briefly describe in prose what decision they made, what the outcome was, and how happy they are with the outcome.
- Have them rate the outcome on numeric scales.
- Have them rate the decision-making process on numeric scales.
In the pilot, just try having people set up the decision problem, so you can see that people are making reasonable choices of criteria, levels, and decision problems.
Piloting on MTurk was a failure, apparently because of the poor English skills of the subjects I got. Let's try some pre-screened subjects from Reddit instead.
I ran four screened subjects from Reddit and am reasonably happy with the results. The instructions changes these findings suggest are:
- Mention that you should only list alternatives if they're options you really have, not options you would ideally have.
- Provide an example of a yes-or-no decision.
Plans for the real study
To start with, aim for 40 subjects, 20 per condition, who return; this means you should probably try to recruit 80 subjects and offer more money for session 2 than session 1. You probably won't be able to get this many from /r/samplesize, but you can check other places where studies are posted, or maybe even hire a service that connects social scientists to respondents.
Set expiration times to a week and a day since your invitation.
- Session 1 (internally, visits 1 and 2)
- Warn about performance. ("can be particularly slow on phones and tablets")
- Allow for session 2 in the consent form. Be clear that the subject is expected to be in both sessions, but the second will be quite short.
- [Offer how much money?]
- Ask for a decision description and an expected decision resolution date, as in the pilot.
- Randomly assign the subject to the VDA condition or the control condition. (If the subject number, among subjects who've gotten this far in this version of the study, is even, counting the first as 1, then use the opposite condition of the previous subject.) Write the study state to disk here, so the subject can't refresh and enter different values depending on his assigned condition.
- If the subject is in the control condition, skip to the end. Otherwise, continue.
- Solicit criteria and alternatives, as in the pilot.
- Conduct VDA.
- Display a reminder of the subject's chosen criteria and alternatives on this page. Explain that the subject doesn't have to make the choice suggested by Artiruno, but it could be a good idea.
- Show results as text. (The graph would probably not be very helpful in this situation, with
find_best = 1and non-quantitatively minded users.)
- Don't provide an abort button. (It wouldn't be useful because in the case of
find_best = 1and not showing a graph, Artiruno can provide no useful information before it's done.)
- Record the questions asked, the subject's choices, the subject's response times, and Artiruno's conclusion. If the subject restarted VDA, use the values from the final round.
- Ask for any comments.
- Session 2 (internally, visit 3)
- Send an invitation 1 month after the subject completed session 1. Say that session 2 assumes the subject made the choice and got to see at least a little of the outcome; if that won't be true for a while, the subject should reply saying when would be a good time to do session 2.
- [Offer how much money?]
- Re-display the decision description and expected resolution date.
- Ask if they've made the choice and gotten to see some of the outcome. (In theory, this should always be "yes", because of the instructions earlier. It's a sanity check.)
- Have the subject briefly describe in prose:
- which choice they made
- what the outcome was
- how happy they are with the choice and outcome
- Use 1-to-5 rating scales for the below.
- How pleased are they with the outcome of their choice?
- How well-chosen was their choice, given what they knew at the time they made it?
- How difficult did it feel to make the decision?
- In the VDA condition:
- Show the criteria, alternatives, and results text from VDA.
- How consistent did they feel their choice was with Artiruno's suggestion?
- How difficult was the procedure (including writing up the criteria and alternatives) to do?
- How helpful did the procedure feel for making the decision-making process?
- Another comments box
A lot of people who seem to otherwise mostly take well to the task are messing up basic things in the VDA problem setup, particularly, putting criterion levels in the reverse order:
(.sum (getl (get ratings "pilot") (sorted (& (set (ssi subjects (<= $tv 7))) (set (. (get ratings "pilot") index))))))
I should probably take some time to add more checks (e.g., an extra screen where subjects have to confirm that they put criterion levels in the right order) and rerun from scratch rather than have this much compromise of the VDA sample.
I've made a lot of changes, including putting the alts input before the criteria input, checking for unchanged placeholder alt or criteria names, checking for dominance and criterion-level order, and adding a puzzle at the start of the scenario phase to encourage dropout that would happen had the subject been assigned to the VDA condition to occur earlier. (I removed display of hypothetical best and worst items because I figured it would be overkill now.) Still, of 4 new subjects assigned to the VDA condition who completed the task (
120 121 127 132), 2 got the levels backwards for at least one criterion, one of whom, subject 132, seemed otherwise quite thoughtful. Ridiculous.
TVs 9, 10
Let's try labeling criteria levels with "(best)" and "(worst)" while subjects are writing them. (TV 10 adds these labels to the example problems, too.) I've removed the manual level-order checks. Since level order is officially A Problem, let's try VDA mode only to see if things are okay before returning to random assignment. I've also added a question to the demographics questionnaire about education.
The 8 subjects who I've gotten to finish VDA (
(ss subjects (.isin $tv [9 10]))) all gave reasonable-looking level orders, so I think things are good now.
Screening and scenario
In TV 11, I restored the random assignment of conditions, and I slightly increased the pay and the time estimate for the scenario phase. In TV 12, I fixed a typo.
I wanted to run subjects until I had at least 40 in each condition who completed the scenario phase, so I ceased recruiting new subjects once that criterion was satisfied after my last batch. I picked this number because I wanted at least 20 subjects to have completed all phases, and I estimated a 50% follow-up rate.
(comments-dt (ssi subjects (.isin $tv [11 12])))
|158||0||2023-02-23||My partner and I are currently making this decision.|
|158||3||2023-04-05||Thank you for the study. I hope you retrieve some useless results!|
|164||0||2023-02-25||Hope i am qualified. Thank you|
|172||2||2023-02-25||I think it's an excellent decision support system!|
|177||3||2023-04-05||I just wanted to say that although I did not follow the suggestion offered to me, I did find the process of doing this really helpful. It allowed me to give good consideration to the outcomes and possibilities in front of me at the time.|
|190||3||2023-04-13||It is a good decision making system.|
|199||3||2023-04-07||Very helpful. Thankyou.|
|217||0||2023-03-02||Thank you for inviting me to take part in this study, it sounds very interesting.|
|217||2||2023-03-02||Looks like I'm buying a nice new Porsche!!|
|217||3||2023-04-07||Thank you for inviting me to take part, and wish me luck with my new silly chariot!|
|222||3||2023-04-25||This was an interesting study, thank you for inviting me.|
|227||2||2023-03-04||An interesting study and tool, thank you!|
|231||2||2023-03-05||Hi • I really enjoyed the study, specially when it started to altering between my choices. I also never wrote down a pros and cons of my decision which this task made me do it and I actually took a picture of that and shared it with my partner, so thank you very much for that. • Good luck with your research. I liked the web page design too, they were quite smart and yet very simple. • All the best.|
|243||3||2023-05-01||Thank you for letting me take part.|
|245||3||2023-05-09||Interesting study. I especially appreciate the explanation on the debriefing page. Thank you|
|261||0||2023-03-06||Thank you for the opportunity to participate! There were lots of great choices.|
(wc (ss subjects (& (.isin $tv [11 12]) (pd.notnull $n_puzzle_attempts))) (cbind $began :res_len (. (- $expected_resolution_date $began) dt days) :puz_t (.round $time_puzzle_minutes 1) :puz_n $n_puzzle_attempts $cond :v2 (.round $time_visit2_minutes 1) :cy $country :edu $education_years))
(.sort-index (.value-counts (wc (ss subjects (& (.isin $tv [11 12]) (pd.notnull $time_puzzle_minutes))) (cbind $cond :did_v2 (pd.notnull $time_visit2_minutes)))))
(setv sns (ssi subjects (& $round1 (= $cond "vda")))) (.sum (pd.concat :axis 1 [ (getl (get ratings "round1") sns) (.drop (getl vda-props sns) :axis 1 ["n_questions" "n_criteria" "result_quality"])]))
The ratings categories are:
reversed_criterion: One of the criteria looks to have its levels in the wrong order.
bad_alt_level: One or more levels of the alts look to be set incorrectly, at least when taking the description into account. Possibly the subject changed the criteria after setting alt levels, triggering the alt levels to reset, but didn't notice and correct for it.
Idiosyncratic problems (
- Subject 219: Two alts appear to represent the same choice by the subject, and are distinguished only by possible outcomes.
- Subject 248: Some of the criteria don't make sense for some of the alts.
Bad VDA subjects
Which subjects should be considered good enough for a main analysis of people who used the tool correctly enough? I'm going to exclude subjects who had any one of the three rated problems:
other_issue. The various automatically detected problems, like
varied_all_cs represent either inherent limitations of Artiruno (which ought to be reflected in my results) or less-than-optimal usage that shouldn't compromise results too much. The resulting sample in the VDA condition is:
(valcounts (np.where (ss subjects $round1 "bad_vda") "exclude" "include"))
The specific subjects excluded are:
(pd.Series (ssi subjects (& $round1 $bad_vda)))
I've archived this at https://web.archive.org/web/20230321/https://arfer.net/projects/artiruno/notebook#sec--bad-vda-subjects to show that I made this decision before rerunning the follow-up, and therefore before seeing the outcomes.
Planning for the follow-up
Let's send out the first invitations on April 3rd, which is 4 weeks after I ran the last subject in the analytic sample.
I need to ask each subject if they've made the decision and gotten to see some outcome before reinviting them. Let's do it by sending them a message on Prolific.
Send out only one message first. Then you can send them out in waves of 10 or 20 in subject-number order, ideally in the morning so you don't get questions while you're asleep.
(rd (. (wcby (ss subjects (& $followed_up (bnot $bad_vda)) ["cond" "eval_rate_easiness" "eval_rate_quality" "eval_rate_satisfaction" "eval_rate_vda_consistency" "eval_rate_vda_easiness" "eval_rate_vda_helpfulness"]) $cond (pd.concat [ (pd.Series [(. $ shape )] :index ["n"]) (.mean $ :numeric-only T)])) T))
(rd 2 (lfor vname ["eval_rate_easiness" "eval_rate_quality" "eval_rate_satisfaction"] :setv [lo hi] (scikits.bootstrap.ci (tuple (gfor cond ["control" "vda"] (ss subjects (& $followed_up (= $cond cond)) vname))) (fn [control vda] (- (np.mean vda) (np.mean control))) :multi "independent" :seed (int.from-bytes (.encode vname "ASCII") "big") :alpha .05 :n-samples 1,000,000) [vname lo hi]))
Reminders for analysis
- If subjects refresh the page and redo parts of the task, the timing data you get will only reflect the final attempt.
Ashikhmin, I., & Furems, E. (2005). UniComBOS—Intelligent decision support system for multi-criteria comparison and choice. Journal of Multi-Criteria Decision Analysis, 13(2-3), 147–157. doi:10.1002/mcda.380
Barbosa, P. A. M., Pinheiro, P. R., Silveira, F. R. V., & Filho, M. S. (2019). Selection and prioritization of software requirements applying verbal decision analysis. Complexity. doi:10.1155/2019/2306213
Edwards, W. (1977). How to use multiattribute utility measurement for social decisionmaking. IEEE Transactions on Systems, Man, and Cybernetics, 7(5), 326–340. doi:10.1109/TSMC.1977.4309720
Hodgett, R. E., Martin, E. B., Montague, G., & Talford, M. (2014). Handling uncertain decisions in whole process design. Production Planning and Control, 25(12), 1028–1038. doi:10.1080/09537287.2013.798706
Larichev, O. I. (2001). Ranking multicriteria alternatives: The method ZAPROS III. European Journal of Operational Research, 131(3), 550–558. doi:10.1016/S0377-2217(00)00096-5
Moshkovich, H. M., & Mechitov, A. I. (2018). Selection of a faculty member in academia: A case for verbal decision analysis. International Journal of Business and Systems Research, 12(3), 343–363. doi:10.1504/IJBSR.2018.10011350
Moshkovich, H., Mechitov, A., & Olson, D. (2016). Verbal decision analysis. In S. Greco, M. Ehrgott, & J. R. Figueira (Eds.), Multiple criteria decision analysis (2nd ed., pp. 605–636). New York, NY: Springer. ISBN 978-0-387-23081-8. doi:10.1007/978-1-4939-3094-4_15
Shevchenko, G., Ustinovichius, L., & Walasek, D. (2019). The evaluation of the contractor's risk in implementing the investment projects in construction by using the verbal analysis methods. Sustainability, 11(9). doi:10.3390/su11092660
Tamanini, I., & Pinheiro, P. R. (2011). Reducing incomparability in multicriteria decision analysis: An extension of the ZAPROS method. Pesquisa Operacional, 31, 251–270. doi:10.1590/S0101-74382011000200004