Cheat notebook
Created 14 Nov 2015 • Last modified 26 Jun 2018
Birthdates in voter data
- CA
- 1931-01-01: 17,195
- 1900-01-02 to 1901-01-01: each has 50 or more, compared to 1 to 1901-01-03
- FL
- 1942-10-10: 674, compared to 573 at most in the rest of the month
- KS
- 1901-01-01: 32, compared to 1s and 0s in the surrounding months
- NY
- 1901-01-01, 1950-01-01: 3,024 and 1,695, compared to less than 1,000 elsewhere
- 1921-01-01: 529, compared to at most 204 in the surrounding months
- OK
- There are 44 unique impossible dates: 19080000 19100000 19110000 19130000 19140000 19160000 19170000 19180000 19190000 19200000 19210000 19220000 19230000 19240000 19250000 19260000 19270000 19280000 19290000 19300000 19310000 19320000 19330000 19340000 19340931 19350000 19360000 19370000 19380000 19390000 19400000 19410000 19420000 19430000 19450000 19460100 19470000 19480000 19480014 19490000 19510000 19530000 19550000 19640000
AM and voter data
Descriptive
["Total voters" (format (.sum ($ counts voters)) ",")]
Total voters | 48,852,975 |
Match rate:
(setv statecounts (.sum (.groupby (getl counts : (qw state voters am_users)) "state"))) (setv state-summary (apply cbind (rmap [state (filt (in it am-users-in-states) states)] (setv matched (int (getl statecounts state "am_users"))) (setv total-am (get am-users-in-states state)) (setv voters (int (getl statecounts state "voters"))) (setv pop (get state-pop-2012 state)) (setv party-counts (do (setv d (ss counts (= $state state))) (setv [pv vv] (map second (.iteritems (getl d : (qw party voters))))) (setv pv (.set-categories pv.cat (+ keep-parties ["rest"]))) (setv pv (.fillna pv "rest")) (setv d (.sum (.groupby (cbind pv vv) 0))) (list (.iteritems (geti d : 0))))) (pds-from-pairs :name (get state-names state) (+ party-counts (pairs "Total registered voters" voters "Population" pop "Percent covered" (round (* 100 (/ voters pop))) "AM users matched with voters" matched "Total AM users" total-am "Percent matched" (round (* 100 (/ matched total-am))))))))) (setv state-summary.index.name "State") (setv state-summary (.fillna state-summary 0)) (setv state-summary (.applymap state-summary (λ (format (int it) ",")))) (.rename state-summary :inplace T :index { "np" "Unaffiliated voters" "DEM" "Democrats" "REP" "Republicans" "GRE" "Greens" "LBT" "Libertarians" "rest" "Voters in other party"}) state-summary
State | California | Florida | Kansas | New York | Oklahoma |
---|---|---|---|---|---|
Unaffiliated voters | 2,710,719 | 2,700,397 | 522,094 | 3,016,645 | 229,494 |
Democrats | 7,936,325 | 4,988,433 | 432,858 | 7,134,846 | 879,343 |
Republicans | 5,297,443 | 4,378,861 | 773,346 | 3,484,717 | 844,305 |
Greens | 114,126 | 6,143 | 0 | 32,811 | 0 |
Libertarians | 110,808 | 12,016 | 11,690 | 4,710 | 0 |
Voters in other party | 2,004,915 | 429,399 | 0 | 796,528 | 3 |
Total registered voters | 18,174,336 | 12,515,249 | 1,739,988 | 14,470,257 | 1,953,145 |
Population | 38,041,430 | 19,317,568 | 2,885,905 | 19,570,261 | 3,814,820 |
Percent covered | 48 | 65 | 60 | 74 | 51 |
AM users matched with voters | 32,457 | 17,928 | 2,642 | 21,776 | 2,072 |
Total AM users | 92,058 | 43,987 | 6,282 | 54,203 | 6,603 |
Percent matched | 35 | 41 | 42 | 40 | 31 |
Match rates range from 31% to 42%.
Here's a breakdown of gender and gender imputation.
(setv d (.reset-index (.applymap (.dropna (.sum (.groupby (.replace (getl counts : (qw state gender_known gender voters)) np.nan "Unknown") (qw state gender_known gender)))) (λ (format (int it) ","))))) (.sort-values d (qw state gender_known gender) :ascending [True False True])
I | state | gender_known | gender | voters |
---|---|---|---|---|
3 | CA | True | Female | 6,279,902 |
4 | CA | True | Male | 5,453,470 |
0 | CA | False | Female | 2,991,313 |
1 | CA | False | Male | 2,617,152 |
2 | CA | False | Unknown | 832,499 |
8 | FL | True | Female | 6,641,276 |
9 | FL | True | Male | 5,645,059 |
5 | FL | False | Female | 97,195 |
6 | FL | False | Male | 94,095 |
7 | FL | False | Unknown | 37,624 |
13 | KS | True | Female | 921,923 |
14 | KS | True | Male | 817,735 |
10 | KS | False | Female | 132 |
11 | KS | False | Male | 122 |
12 | KS | False | Unknown | 76 |
18 | NY | True | Female | 7,909,124 |
19 | NY | True | Male | 6,560,341 |
15 | NY | False | Female | 298 |
16 | NY | False | Male | 285 |
17 | NY | False | Unknown | 209 |
20 | OK | False | Female | 938,158 |
21 | OK | False | Male | 777,237 |
22 | OK | False | Unknown | 237,750 |
The OK voter data has no gender information, so we had to impute every case there.
(setv n-total (.sum ($ counts voters))) (setv n-not-given (.sum ($ (ss counts (~ $gender_known)) voters))) (setv n-imputed (.sum ($ (ss counts (& (.notnull $gender) (~ $gender_known))) voters))) (rd [ ["Proportion of genders missing" (/ n-not-given n-total)] ["Proprtion of missing genders imputed" (/ n-imputed n-not-given)] ["Proportion of genders still missing" (/ (- n-not-given n-imputed) n-total)]])
Proportion of genders missing | 0.177 |
Proprtion of missing genders imputed | 0.872 |
Proportion of genders still missing | 0.023 |
(setv n-bad-age (.sum ($ (ss counts (.isnull $age)) voters))) ["Proportion of ages invalid" (format (/ n-bad-age n-total) ".03f")]
Proportion of ages invalid | 0.010 |
Here is the number of registered voters per party and state (ignoring the AM data).
(setv x (.sort-values (.reset-index (.sum ($ (.set-index counts ["state" "party"]) voters) :level ["state" "party"])) ["state" "voters"] :ascending [True False])) (setv ($ x voters) (amap (.format "{:,}" it) ($ x voters))) (setv x (.reset-index :drop T x)) x
I | state | party | voters |
---|---|---|---|
0 | CA | DEM | 7,936,325 |
1 | CA | REP | 5,297,443 |
2 | CA | np | 2,710,719 |
3 | CA | DS | 1,114,635 |
4 | CA | AI | 479,378 |
5 | CA | OTH | 271,806 |
6 | CA | GRE | 114,126 |
7 | CA | LBT | 110,808 |
8 | CA | PF | 63,014 |
9 | CA | MIS | 61,684 |
10 | CA | REF | 7,786 |
11 | CA | AME | 3,379 |
12 | CA | NAT | 2,271 |
13 | CA | WWP | 313 |
14 | CA | JP | 146 |
15 | CA | CTP | 113 |
16 | CA | CPC | 96 |
17 | CA | WP | 50 |
18 | CA | PIR | 31 |
19 | CA | CP | 31 |
20 | CA | CMP | 31 |
21 | CA | HUM | 27 |
22 | CA | HPC | 14 |
23 | CA | MMW | 11 |
24 | CA | OP | 9 |
25 | CA | PPC | 9 |
26 | CA | NRP | 8 |
27 | CA | TVP | 7 |
28 | CA | ACP | 7 |
29 | CA | UCA | 6 |
30 | CA | MCP | 4 |
31 | CA | FED | 4 |
32 | CA | NMB | 4 |
33 | CA | DP | 4 |
34 | CA | AMC | 4 |
35 | CA | SEU | 4 |
36 | CA | WFP | 4 |
37 | CA | SAP | 3 |
38 | CA | LRU | 3 |
39 | CA | CPP | 3 |
40 | CA | EJP | 3 |
41 | CA | POT | 2 |
42 | CA | GSP | 2 |
43 | CA | EGA | 2 |
44 | CA | UMP | 2 |
45 | CA | APP | 1 |
46 | CA | UCB | 1 |
47 | CA | NSP | 1 |
48 | CA | U08 | 1 |
49 | CA | ATP | 1 |
50 | FL | DEM | 4,988,433 |
51 | FL | REP | 4,378,861 |
52 | FL | np | 2,700,397 |
53 | FL | INT | 280,485 |
54 | FL | UNK | 59,740 |
55 | FL | IDP | 58,517 |
56 | FL | NRS | 25,928 |
57 | FL | LBT | 12,016 |
58 | FL | GRE | 6,143 |
59 | FL | REF | 2,037 |
60 | FL | CPF | 1,119 |
61 | FL | AIP | 452 |
62 | FL | TPF | 391 |
63 | FL | FPP | 336 |
64 | FL | ECO | 166 |
65 | FL | PSL | 122 |
66 | FL | PFP | 45 |
67 | FL | FSW | 29 |
68 | FL | JPF | 12 |
69 | FL | OBJ | 12 |
70 | FL | AEL | 4 |
71 | FL | FWP | 3 |
72 | FL | SOC | 1 |
73 | KS | REP | 773,346 |
74 | KS | np | 522,094 |
75 | KS | DEM | 432,858 |
76 | KS | LBT | 11,690 |
77 | NY | DEM | 7,134,846 |
78 | NY | REP | 3,484,717 |
79 | NY | np | 3,016,645 |
80 | NY | IND | 557,885 |
81 | NY | CON | 183,041 |
82 | NY | WOR | 55,420 |
83 | NY | GRE | 32,811 |
84 | NY | LBT | 4,710 |
85 | NY | FDM | 110 |
86 | NY | RTH | 58 |
87 | NY | TXP | 6 |
88 | NY | APP | 5 |
89 | NY | SWP | 3 |
90 | OK | DEM | 879,343 |
91 | OK | REP | 844,305 |
92 | OK | np | 229,494 |
93 | OK | AE | 3 |
Florida and California have quite a few voters with undocumented party codes.
The web page for New York's Independence Party states that "The Party's leadership recognizes that individuals do sometimes unwittingly register as members of the Independence Party when their intent was to register to vote as a 'blank'." That may explain why CA, FL, and NY all have so many people in an "Independence Party" or "Independent Party" when I've never even heard of such a party before. Here's the number of people in each such party:
(ss x (.isin $party (qw IND IDP INT AI)))
I | state | party | voters |
---|---|---|---|
4 | CA | AI | 479,378 |
53 | FL | INT | 280,485 |
55 | FL | IDP | 58,517 |
80 | NY | IND | 557,885 |
Below, voters
shows the number of registered voters for each state and party, and AMr
shows the reciprocal of the Ashley Madison match rate (e.g., 500 means that 1 in 500 such voters were matched to an Ashley Madison user).
(setv x (.dropna (.sum (.groupby (getl (drop-unused-cats (ss counts (.isin $party keep-parties))) : (qw state party voters am_users)) (qw state party))))) (setv ($ x AMr) (wc x (/ $voters $am_users))) (setv x (.drop x "am_users" 1)) (setv xR (.reset-index x)) (setv ($ xR state) (.map ($ xR state) state-names)) (.to-csv xR "/tmp/amr.csv") (.applymap x (λ (if (numeric? it) (format (int (round it)) ",") it)))
state | party | voters | AMr |
---|---|---|---|
CA | np | 2,710,719 | 527 |
CA | DEM | 7,936,325 | 703 |
CA | REP | 5,297,443 | 476 |
CA | GRE | 114,126 | 399 |
CA | LBT | 110,808 | 260 |
FL | np | 2,700,397 | 629 |
FL | DEM | 4,988,433 | 1,057 |
FL | REP | 4,378,861 | 542 |
FL | GRE | 6,143 | 410 |
FL | LBT | 12,016 | 300 |
KS | np | 522,094 | 587 |
KS | DEM | 432,858 | 1,007 |
KS | REP | 773,346 | 604 |
KS | LBT | 11,690 | 285 |
NY | np | 3,016,645 | 627 |
NY | DEM | 7,134,846 | 901 |
NY | REP | 3,484,717 | 485 |
NY | GRE | 32,811 | 566 |
NY | LBT | 4,710 | 236 |
OK | np | 229,494 | 702 |
OK | DEM | 879,343 | 1,573 |
OK | REP | 844,305 | 712 |
Here's a plot. (The y-axis is upside-down so that higher points mean more Ashley Madison users.)
library(ggplot2) d = read.csv("/tmp/amr.csv") d = transform(d, party = factor(party, levels = c("np", "DEM", "REP", "GRE", "LBT"))) breaks = c(1, seq(250, 1750, 250)) amr.plot = ggplot(d) + geom_point(aes(state, AMr)) + geom_text(aes( x = as.integer(state) + ifelse(party == "REP", -.05, .05), y = AMr, label = c( np = "none", DEM = "Dem", REP = "Rep", GRE = "Green", LBT = "Lib")[as.character(party)], hjust = ifelse(party == "REP", 1, 0))) + xlab("State") + scale_y_reverse(name = "Ashley Madison users", expand = c(0, 0), breaks = breaks, labels = paste("1 in", formatC(breaks, big.mark = ","))) + coord_cartesian(ylim = range(breaks)) + theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), axis.text.x = element_text(color = "black"), axis.text.y = element_text(color = "black"), axis.line = element_line(color = "black")) amr.plot
In all five states, Dems cheat less than other party categories (including the more liberal Greens). In all four states with a libertarian party, libertarians cheat more than other party categories.
Here are the AM rates by state alone:
(setv x (getl (.sum (.groupby counts "state")) : (qw voters am_users))) (setv ($ x AMr) (wc x (/ $voters $am_users))) ;(setv x (.drop x "am_users" 1)) ;(setv xR (.reset-index x)) ;(setv ($ xR state) (.map ($ xR state) state-names)) ;(.to-csv xR "/tmp/amr.csv") (ordf (.applymap x (λ (if (numeric? it) (format (int (round it)) ",") it))) $AMr)
state | voters | am_users | AMr |
---|---|---|---|
CA | 18,174,336 | 32,457 | 560 |
KS | 1,739,988 | 2,642 | 659 |
NY | 14,470,257 | 21,776 | 665 |
FL | 12,515,249 | 17,928 | 698 |
OK | 1,953,145 | 2,072 | 943 |
Party membership differs by gender and age. If we consider only males, and we control for age, do we still see these AM usage differences by party? Below is a crude graph where we group men by year of birth.
(setv cs (getl (ss counts (& (= $gender "Male") (.notnull $born) (.isin $party ["np" "DEM" "REP"]))) : (qw state born party am n))) (setv out (.dropna (.apply (.groupby cs ["state" (// ($ cs born) 10000) "party"]) (λ (if (.any ($ it am)) (/ (.sum ($ (ss it [[var:a]][[var:m]]) [[var:n]])) (.[[var:s]][[var:u]][[var:m]] ( it n))) 0))))) (setv out (kwc pd.DataFrame out :columns ["am_proportion"])) ;(pd.Series [(.sum ($ it n)) (.sum ($ (ss it $am) n))] ["n" "a"]))))) (.to-csv out "/tmp/am-year-party.csv")
d = read.csv("/tmp/am-year-party.csv") library(ggplot2) ggplot(subset(d, born >= 1915)) + geom_line(aes(born, am_proportion, color = party)) + scale_color_manual(values = c("blue", "darkgreen", "red")) + facet_grid(state ~ ., scales = "free")
It looks like it, pretty much.
Modeling
Basic information about the data we'll use for modeling:
(setv d (am-usage-filter counts)) (setv base-rate (/ (.sum ($ d am_users)) (.sum ($ d voters)))) [ ["Sample size" (format (.sum ($ d voters)) ",")] ["AM users" (format (.sum ($ d am_users)) ",")] ["Base rate" (+ "1 in " (str (int (round (/ 1 base-rate)))))] ["Base MSE" (round (* base-rate (- 1 base-rate)) 8)]]
Sample size | 44,172,769 |
AM users | 69,023 |
Base rate | 1 in 640 |
Base MSE | 0.00156013 |
(setv folds (do (setv old-rng-state (np.random.get-state)) (np.random.seed 55) (setv out (gen-counts-cv-folds-ts ($ d voters) ($ d am_users) 10)) (np.random.set-state old-rng-state) out)) (import [collections [OrderedDict]]) (setv models (OrderedDict (, [:Trivial {:desc "No predictors" :formula None}] [:Parties {:desc "Predictors: party only" :formula "party"}] [:Demo {:desc "Predictors: state, gender, age" :formula "state + female + age + age2"}] [:DemoParties {:desc "Predictors: state, gender, age, party" :formula "state + female + age + age2 + party"}] [:IntDemo {:desc "Demo, plus all first-order interactions" :formula (.join " " [ "state + female + age + age2 +" "female:(state + age + age2) +" "state:(age + age2)"])}] [:IntDemoParties {:desc "DemoParties, plus all first-order interactions" :formula (.join " " [ "state + female + age + age2 + party +" "female:(state + age + age2 + party) +" "state:(age + age2 + party) +" "party:(age + age2)"])}]))) (for [[k v] (.items models)] (setv (get v :name) (keyword->str k))) (for [model (.values models)] (.update model (cached-eval (+ "am-usage-cv " (get model :name)) :cache-dir cache-dir (fn [] (if (= (get model :name) "Trivial") (am-usage "cv" counts "state" folds :trivial-model T) (am-usage "cv" counts (get model :formula) folds)))))) (setv cv-results (df-from-pairs (rmap [model (.values models)] (pairs "Model" (get model :name) "Description" (get model :desc) "Terms" (if (= (get model :name) "Trivial") 1 (+ 1 (len (get model :x-column-names)))) "MSE" (round (get model :cv-mse) 9) "[[var:p]]_{0}" (.format "^{}" (int (round (/ 1 (get model :cv-p-cond 0))))) "[[var:p]]_{1}" (.format "^{}" (int (round (/ 1 (get model :cv-p-cond 1))))))))) (setv cv-results (.set-index cv-results "Model")) cv-results
Model | Description | Terms | MSE | p0 | p1 |
---|---|---|---|---|---|
Trivial | No predictors | 1 | 0.001560127 | ^640 | ^640 |
Parties | Predictors: party only | 5 | 0.001559982 | ^640 | ^604 |
Demo | Predictors: state, gender, age | 8 | 0.001555791 | ^642 | ^232 |
DemoParties | Predictors: state, gender, age, party | 12 | 0.001555600 | ^642 | ^226 |
IntDemo | Demo, plus all first-order interactions | 22 | 0.001555783 | ^642 | ^231 |
IntDemoParties | DemoParties, plus all first-order interactions | 51 | 0.001555565 | ^642 | ^224 |
Here p0 is the reciprocal of the mean predicted probability among all voters who didn't actually use AM, and p1 is the same thing for voters who actually did use AM.
None of the models can improve much on the base rate for p0. parties_only
somewhat improves p1, while all the other nontrivial models substantially improve p1. Parties help, and interaction terms help a tiny bit more.
Below are the coefficients of the parties
and ints&parties
models.
(setv noint-coefs (cache (am-usage "fit_all" counts :formula (get models :DemoParties :formula)))) (setv int-coefs (cache (am-usage "fit_all" counts :formula (get models :IntDemoParties :formula)))) (defn pretty-term-name [s] (cond [(in ":" s) (do (setv [s1 s2] (.split s ":")) (+ (pretty-term-name s1) " × " (pretty-term-name s2)))] [(= s "age") "Age"] [(= s "age2") "Age²"] [(= s "female[T.True]") "Female"] [(.startswith s "state[") (.format "State: {}" (get state-names (cut (.rstrip s "]") -2)))] [(.startswith s "party[") (.format "Party: {}" (get long-party-names (cut (.rstrip s "]") -3)))] [True s])) (setv pretty-coefs (.applymap (cbind :DemoParties noint-coefs :IntDemoParties int-coefs) (λ (if (np.isnan it) "" (.replace (format it ".02f") "-" "−"))))) (setv pretty-coefs.index (amap (pretty-term-name it) pretty-coefs.index)) (setv pretty-coefs.index.name "Term") pretty-coefs
Term | DemoParties | IntDemoParties |
---|---|---|
Intercept | −6.20 | −6.22 |
Age | −1.07 | −1.10 |
Age² | −2.23 | −2.14 |
Female | −3.31 | −3.13 |
Female × Age | 0.52 | |
Female × Age² | 0.43 | |
Female × Party: Democratic | 0.18 | |
Female × Party: Green | −0.08 | |
Female × Party: Libertarian | 0.03 | |
Female × Party: Republican | −0.01 | |
Female × State: California | −0.14 | |
Female × State: Florida | −0.14 | |
Female × State: Kansas | −0.23 | |
Female × State: Oklahoma | −0.23 | |
Party: Democratic | −0.18 | −0.17 |
Party: Democratic × Age | −0.13 | |
Party: Democratic × Age² | 0.08 | |
Party: Green | 0.02 | −0.19 |
Party: Green × Age | 0.22 | |
Party: Green × Age² | −0.29 | |
Party: Libertarian | 0.46 | 0.47 |
Party: Libertarian × Age | −0.34 | |
Party: Libertarian × Age² | −0.34 | |
Party: Republican | 0.19 | 0.28 |
Party: Republican × Age | −0.21 | |
Party: Republican × Age² | 0.05 | |
State: California | 0.14 | 0.16 |
State: California × Age | 0.22 | |
State: California × Age² | −0.19 | |
State: California × Party: Democratic | 0.06 | |
State: California × Party: Green | 0.23 | |
State: California × Party: Libertarian | −0.15 | |
State: California × Party: Republican | −0.16 | |
State: Florida | −0.04 | 0.01 |
State: Florida × Age | 0.19 | |
State: Florida × Age² | −0.22 | |
State: Florida × Party: Democratic | −0.13 | |
State: Florida × Party: Green | 0.42 | |
State: Florida × Party: Libertarian | −0.25 | |
State: Florida × Party: Republican | −0.10 | |
State: Kansas | −0.10 | −0.16 |
State: Kansas × Age | −0.33 | |
State: Kansas × Age² | −0.59 | |
State: Kansas × Party: Democratic | −0.07 | |
State: Kansas × Party: Libertarian | −0.08 | |
State: Kansas × Party: Republican | −0.25 | |
State: Oklahoma | −0.31 | −0.15 |
State: Oklahoma × Age | 0.23 | |
State: Oklahoma × Age² | −0.14 | |
State: Oklahoma × Party: Democratic | −0.26 | |
State: Oklahoma × Party: Republican | −0.20 |
Let's make predictions for a 40-year-old male New Yorker.
Using the parties
model:
(setv [age-t age2-t] (am-usage-transform-ages (am-usage-filter counts) 40)) (defn f [coefs] (defn c [k] (.get coefs k 0)) (setv l (lc [[name v] (+ [["party[T.np]" 0]] (list (.iteritems coefs)))] (re.match r"party\[T\.[A-Za-z]+\]\Z" name) (, (.group (re.search "T.([A-Za-z]+)" name) 1) (int (round (/ 1 (ilogit (+ (get noint-coefs "Intercept") (* age-t (c "age")) (* age2-t (c "age2")) (* age-t (c (+ name ":age"))) (* age2-t (c (+ name ":age2"))) v)))))))) (geti (.set-index (pd.DataFrame l) 0) : 0)) (.sort-values (f noint-coefs))
I | 1 |
---|---|
LBT | 117 |
REP | 152 |
GRE | 180 |
np | 184 |
DEM | 219 |
Using the ints_and_parties
model:
(setv predcompare (cbind :DemoParties (f noint-coefs) :IntDemoParties (f int-coefs))) (setv predcompare (.applymap (.sort-values predcompare "DemoParties") (λ (.format "^{}" it)))) (setv predcompare.index (amap (get long-party-names it) predcompare.index)) (setv predcompare.index.name "Party") predcompare
Party | DemoParties | IntDemoParties |
---|---|---|
Libertarian | ^117 | ^98 |
Republican | ^152 | ^138 |
Green | ^180 | ^219 |
unaffiliated | ^184 | ^189 |
Democratic | ^219 | ^223 |
Old modeling (with SGD)
Would it make things substantially easier if all IVs were discrete? Potentially, yes, since when all IVs of a logistic-regression model are discrete, and all interaction terms are included (and there's no regularization or anything else special), the best-fitting logistic-regression model just reproduces the contingency table. However, the cells of the IV contingency table here vary widely in size. Some are 0, because, e.g., I don't have data for Greens in Kansas.
SGDClassifier
can do probabilistic prediction not only with logistic loss (which is what I've done) but also with modified Huber loss. However, I bet it would be less effective.
(setv [n-modeling-subjects n-modeling-cheated] (cache (get-modeling-ns))) (setv base-rate (/ n-modeling-cheated n-modeling-subjects)) [ ["Sample size" (format n-modeling-subjects ",")] ["Cheated" (format n-modeling-cheated ",")] ["Base rate" (+ "1 in " (str (int (/ 1 base-rate))))] ["Base MSE" (round (* base-rate (- 1 base-rate)) 8)]]
Sample size | 45,201,799 |
Cheated | 68,602 |
Base rate | 1 in 658 |
Base MSE | 0.00151538 |
(setv cv-results (dict (cache (lc [ii [False True] ipp [False True]] (, (, ii ipp) (kwc try-model "cv" n-modeling-subjects n-modeling-cheated :include-interactions ii :include-party-params ipp)))))) (kwc .set-index :keys "model" (kwc pd.DataFrame :columns ["model" "MSE" "P when 0" "P when 1"] (+ [["trivial" (round (* base-rate (- 1 base-rate)) 8) (int (/ 1 base-rate)) (int (/ 1 base-rate))]] (lc [ii [False True] ipp [False True]] [ (.format "{}, {}" (if ii "interacts" "no interacts") (if ipp "parties" "no parties")) (round (get cv-results (, ii ipp) "mse") 8) (int (/ 1 (get cv-results (, ii ipp) "mean pred prob when 0"))) (int (/ 1 (get cv-results (, ii ipp) "mean pred prob when 1")))]))))
model | MSE | P when 0 | P when 1 |
---|---|---|---|
trivial | 0.00151538 | 658 | 658 |
no interacts, no parties | 0.00139625 | 709 | 333 |
no interacts, parties | 0.00139611 | 724 | 330 |
interacts, no parties | 0.00139643 | 706 | 375 |
interacts, parties | 0.00139615 | 719 | 351 |
Here we compared the fivefold cross-validate performance of four models, and also include base rates for comparison. The rightmost two columns give the mean predicted probability of cheating (expressed as a reciprocal) among subjects who were not (0) or were (1) actually cheating. We see that all models improve substantially upon the base rate, and adding party terms helps, but interactions don't help. So let's use the model with parties but no interactions.
Here are its parameters with fit to the whole dataset:
(setv params (OrderedDict (cache (kwc try-model "fit" n-modeling-subjects n-modeling-cheated :!include-interactions :+include-party-params)))) (lc [[name value] (.items params)] [name (rd value)])
Intercept | -5.106 |
CA | 0.064 |
FL | -0.196 |
KS | -0.247 |
OK | -0.430 |
known_female | -3.226 |
unknown_gender | -0.934 |
age | -1.059 |
age_squared | -0.978 |
DEM | -0.198 |
REP | 0.191 |
GRE | -0.044 |
LBT | 0.085 |
We can use them to create a little table of the predicted probability of a 25-year-old male New Yorker cheating:
(defn p [x] (get params x)) (kwc sorted :key second (lc [party keep-parties] (, party (int (/ 1 (ilogit (+ (p "Intercept") ;(p "NY") (* (p "age") (first (transformed-age 25))) (* (p "age_squared") (second (transformed-age 25))) (if (= party "np") 0 (p party)))))))))
REP | 182 |
LBT | 202 |
np | 220 |
GRE | 230 |
DEM | 268 |
We see that Democrats have the lowest probability of cheating, but differently from the simple graph above, Greens and libertarians are between Republicans and Democrats rather than cheating more than Republicans. The differences here are generally much smaller than those in the graph; the biggest difference is between Republicans and Democrats, which has Democrats cheat about a third less than Republicans.
Matching voter records across years
First attempt
As a test run, we consider NY 2010 and 2012. I compute how many records each dataset has that the other doesn't have by just subtracting the number of shared records.
- Matching by
rn
- 2010: 13,026,768 records
- 2012: 14,500,804 records
- Shared: 13,017,759
- 2010-only records: 9,009
- 2012-only records: 1,483,045
- Matching by
name
,zip
,addrnum1
,addrnum2
- 2010: 12,852,502 distinct records
- 2012: 14,283,937 distinct records
- Shared: 10,854,796
- 2010-only records: 1,997,706
- 2012-only records: 3,429,141
Matching by rn
, 1,385,648 of the pairs of records have a different addrnum1
or addrnum2
.
Matching by rn
, 12,999,937 of the 13,017,473 (99.87%) pairs of records have matching birthdates.
2010 voters matched to AM: 20,191
- By
rn
, 20,171 of these are also present in 2012. - And 1,803 have a new
addrnum1
oraddrnum2
.sqlite3 voter.sqlite 'attach "voter-2010.sqlite" as V10; select count(*) from (select rn, addrnum1, addrnum2 from V10.T where am_user notnull) as T10 join T as T12 using (rn) where T10.addrnum1 != T12.addrnum1 or T10.addrnum2 != T12.addrnum2
- 1,803 / 20,171 = .09
2010 voters not matched to AM:
- By
rn
, 12,997,588 are present in 2012. - And 1,383,845 of those have a new
addrnum1
oraddrnum2
. - 1,383,845 / 12,997,588 = .11
Notice that this difference is in the wrong direction: 2010 New York voters were slightly less likely to have a different address in 2012 if they were on AM.
We might do this better by handling time more explicitly on both sides, using transaction dates to date AM usage and record modification dates (when they exist) to date voter records.
Using transaction dates
I hoped to use registration dates as an indicator of when each voter record was last updated. However, this doesn't seem to be correct, at least for New York, because sqlite3 voter.sqlite 'attach "voter-2010.sqlite" as V10; select V10.T.registered < V12T.registered as updated, V10.T.addrnum1 != V12T.addrnum1 as moved, count(*) from (T as V12T inner join V10.T using (rn)) group by updated, moved
returns:
0 | 0 | 11639406 |
0 | 1 | 1076701 |
1 | 0 | 11582 |
1 | 1 | 289784 |
which shows us that quite a few seemingly non-updated 2012 records have new addresses. So instead, let's just use January 1st, 2012, as the threshold date. (Or rather, January 3rd, to provide some wiggle room for time-zone shenanigans.)
attach "voter-2010.sqlite" as V10; attach "amcc.sqlite" as AMCC; select V12T.am_user notnull and early_enough as am_before, V12T.addrnum1 != V10.T.addrnum1 as moved, count(*) from (T as V12T inner join V10.T using (rn)) left outer join (select usernum, min(time) < cast(strftime("%s", "20120103") as integer) as early_enough -- cast(strftime("%Y%m%d", min(time) - 2*24*60*60, "unixepoch", "utc") as integer) -- as first_am from AMCC.T group by usernum) on V12T.am_user = usernum where V12T.am_user isnull or V12T.am_user != -1 group by am_before, moved;
0 | 0 | 11632713 |
0 | 1 | 1363900 |
1 | 0 | 16329 |
1 | 1 | 2337 |
- Among moved, 2337 / 1363900 = 1 / 584 used AM before 2012
- Among non-moved, 16329 / 11632713 = 1 / 712 used AM before 2012
- Among AM users before 2012, 2337 / 16329 = 14% moved
- Among non-AM users before 2012, 1363900 / 11632713 = 12% moved
That's in the right direction, at least.
I'm still matching up AM users with voters using 2012 voter addresses. Should I be using the 2010 voter addresses instead? Should I use both somehow?
Diversity of names in voter data
One odd explanation I thought of for the finding that Republicans use AM more than Democrats is that Republicans may have less variety in full names (i.e., combination of first name and last name), inflating the number of false positives. In particular, Republicans being whiter than Democrats suggests there should be a smaller pool of names.
The 2012 voter records have 21,371,805 Democrats and 14,778,672 Republicans.
> sqlite3 ../data/voter.sqlite 'select count(distinct printf("%s %s", first_name, last_name)) from T where party = 1' # DEM 11457810 > sqlite3 ../data/voter.sqlite 'select count(distinct printf("%s %s", first_name, last_name)) from T where party = 2' # REP 8287618
The number of voters per name is thus
{ "Dem" (rd (/ 21,371,805 11,457,810)) "Rep" (rd (/ 14,778,672 8,287,618))}
K | value |
---|---|
Dem | 1.865 |
Rep | 1.783 |
The difference is in the direction I expected, but it's small, probably too small to make a substantial difference in terms of false positives.
am_am
These analyses consider only user numbers we found in the credit-card data.
AMLIB_SelectOptions.class.php
From srcdump/ashleymadison
, repository ashley.git
, revision HEAD
, file common/pinflib/amlib/AMLIB/AMLIB_SelectOptions.class.php
.
This defines the numeric codes for "preferences" (opento
in am_am), "tastes" (turnsmeon
), and "desires" (lookingfor
); gender; the various seeking types; etc.
For each code type, the codes are pretty much equivalent between the four gender-seeking classes, although not every class has access to every code, and there are some minor differences in wording. am_am
has some codes that aren't listed, but these generally don't appear in profiles created before 2009, suggesting that the ability to indicate them on the website was removed.
Frequency tables
(thousep (pd.read-sql :con amam "select female, count(*) as n from Users group by female"))
I | female | n |
---|---|---|
0 | nan | 12 |
1 | 0.0 | 720,634 |
2 | 1.0 | 18,022 |
(thousep (user-var-freq "ethnicity"))
ethnicity | n |
---|---|
Caucasian (white) | 596,751 |
Hispanic | 45,340 |
African American (black) | 40,826 |
Other | 23,026 |
Rather Not Say | 12,747 |
Asian | 12,020 |
East Indian | 4,666 |
Middle Eastern | 2,144 |
First Nations | 1,134 |
N/A | 14 |
(setv ages (do (setv v (geti (pd.read-sql :con amam "select born from Users") : 0)) (setv (get v (= v 0)) NaN) (setv v (pd.to-datetime v :format "%Y%m%d")) (/ (. (- (pd.Timestamp "2012-01-01") v) dt days) 365.25))) (rd 0 [ ["missing (count)" (.sum (pd.isnull ages))] ["<1 (count)" (.sum (& (pd.notnull ages) (<= ages 1)))] [".025 quantile (y)" (.quantile ages .025)] ["median (y)" (.median ages)] [".975 quantile (y)" (.quantile ages .975)] ["max (y)" (.max ages)]])
missing (count) | 3 |
<1 (count) | 47 |
.025 quantile (y) | 23 |
median (y) | 41 |
.975 quantile (y) | 62 |
max (y) | 91 |
(thousep (user-var-freq "seeking"))
seeking | n |
---|---|
attached male seeking females | 536,882 |
single male seeking females | 182,460 |
attached female seeking males | 10,861 |
female seeking females | 4,528 |
single female seeking males | 2,633 |
male seeking males | 1,292 |
N/A | 12 |
(thousep (user-var-freq "smoking"))
smoking | n |
---|---|
Not specified | 403,585 |
Never | 298,460 |
Occasionally | 27,106 |
Regularly | 9,026 |
N/A | 491 |
(thousep (user-var-freq "drinking"))
drinking | n |
---|---|
N/A | 737,081 |
Never | 1,485 |
Socially | 62 |
Occasionally | 36 |
Regularly | 4 |
(msf-prefcode-freq "OpenTo")
code | n | p |
---|---|---|
Conventional Sex | 408,457 | 0.568 |
Likes to Give Oral Sex | 405,127 | 0.563 |
Likes to Receive Oral Sex | 386,918 | 0.538 |
Kissing | 339,375 | 0.472 |
Good With Your Hands | 307,319 | 0.427 |
Sensual Massage | 302,026 | 0.420 |
Light Kinky Fun | 301,648 | 0.419 |
Sex Talk | 299,921 | 0.417 |
Extended Foreplay/Teasing | 292,603 | 0.407 |
Open to Experimentation | 276,365 | 0.384 |
Sharing Fantasies | 243,226 | 0.338 |
Cuddling & Hugging | 241,644 | 0.336 |
One-Night Stands | 233,780 | 0.325 |
Experimenting with Sex Toys | 224,530 | 0.312 |
Gentleness | 211,383 | 0.294 |
Lots of Stamina | 207,447 | 0.288 |
Erotic Movies | 191,189 | 0.266 |
Role Playing | 182,222 | 0.253 |
Bubble Bath for 2 | 179,206 | 0.249 |
Threesome | 178,851 | 0.249 |
Likes to Go Slow | 164,568 | 0.229 |
Someone Who Can Teach Me | 161,172 | 0.224 |
Someone I Can Teach | 157,231 | 0.219 |
Aggressiveness | 128,866 | 0.179 |
Blindfolding | 115,807 | 0.161 |
Experimenting with Tantric Sex | 113,165 | 0.157 |
Spanking | 112,192 | 0.156 |
Erotic Tickling | 92,255 | 0.128 |
Fetishes | 91,510 | 0.127 |
Likes to be Watched/Exhibitionism | 81,645 | 0.113 |
Dressing Up/Lingerie | 80,545 | 0.112 |
Being Dominant/Master | 78,000 | 0.108 |
N/A | 72,520 | 0.101 |
Curious - Domination | 62,970 | 0.088 |
Curious - Submission | 59,080 | 0.082 |
Bondage | 49,549 | 0.069 |
Being Submissive/Slave | 41,613 | 0.058 |
Nothing Kinky | 10,672 | 0.015 |
I Like to Cross Dress | 1,668 | 0.002 |
You Like to Cross Dress | 1,486 | 0.002 |
Transvestitism | 153 | 0.000 |
(msf-prefcode-freq "TurnsMeOn")
code | n | p |
---|---|---|
N/A | 525,202 | 0.730 |
Sense of Humor | 360,543 | 0.501 |
Good Personal Hygiene | 333,933 | 0.464 |
Discretion/Secrecy | 326,886 | 0.454 |
High Sex Drive | 309,451 | 0.430 |
Naughty Girl | 293,261 | 0.408 |
Relaxed and Easy Going | 285,479 | 0.397 |
Casual Jeans/T-shirt Type | 283,472 | 0.394 |
Imagination | 282,552 | 0.393 |
Creative and Adventurous | 279,220 | 0.388 |
Girl Next Door | 267,147 | 0.371 |
Stylish/Classy | 261,757 | 0.364 |
A Professional/Well Groomed | 261,029 | 0.363 |
Confidence | 255,735 | 0.356 |
Drug Free | 231,210 | 0.321 |
Not Possessive | 212,026 | 0.295 |
Good Communicator | 206,881 | 0.288 |
Casual/Social Drinker | 199,718 | 0.278 |
Petite Figure | 198,221 | 0.276 |
Long Hair | 195,490 | 0.272 |
Muscular/Fit Body | 162,050 | 0.225 |
Short Height | 136,838 | 0.190 |
A Good Listener | 132,376 | 0.184 |
Short Hair | 129,734 | 0.180 |
Aggressive/Take Charge Nature | 126,987 | 0.177 |
Has a Secret Love Nest | 121,362 | 0.169 |
Tall Height | 117,615 | 0.164 |
Tattoos | 117,150 | 0.163 |
Dislikes Routine | 108,025 | 0.150 |
Natural Breasts | 105,325 | 0.146 |
Hopeless Romantic | 102,383 | 0.142 |
Body Piercing | 93,689 | 0.130 |
Average Sex Drive | 85,712 | 0.119 |
BBW | 38,142 | 0.053 |
Seeking a Sugar Baby | 24,032 | 0.033 |
Likes Routine | 22,985 | 0.032 |
A Don Juan | 122 | 0.000 |
Slim to Average Body | 70 | 0.000 |
Full Size Body | 52 | 0.000 |
Bad Boy | 33 | 0.000 |
Boy Next Door | 21 | 0.000 |
Tall, Dark and Handsome | 2 | 0.000 |
Seeking a Sugar Daddy | 1 | 0.000 |
(msf-prefcode-freq "LookingFor")
code | n | p |
---|---|---|
N/A | 324,141 | 0.451 |
I Am a Social Drinker | 251,773 | 0.350 |
Music Lover | 224,666 | 0.312 |
Travel | 213,776 | 0.297 |
The Outdoors/Nature | 212,929 | 0.296 |
Daring Rendezvous | 198,557 | 0.276 |
Physical Fitness | 195,492 | 0.272 |
Fine Dining/Candle Lit Dinners | 188,344 | 0.262 |
Cooking/Barbequing | 169,270 | 0.235 |
Romantic Walks | 165,913 | 0.231 |
Watching Sports | 165,029 | 0.229 |
Long Drives | 155,997 | 0.217 |
Skinny Dipping | 155,110 | 0.216 |
Playing Sports | 152,602 | 0.212 |
Wine Tasting | 150,099 | 0.209 |
Strip Poker/Adult Games | 115,963 | 0.161 |
Dancing | 105,301 | 0.146 |
Boating | 104,142 | 0.145 |
Picnics | 100,305 | 0.139 |
Theatre | 98,428 | 0.137 |
Photography | 96,024 | 0.133 |
Shopping for Sexy Clothes/Lingerie | 95,847 | 0.133 |
Erotic Literature | 79,711 | 0.111 |
Motorcycles | 71,445 | 0.099 |
Politics | 62,349 | 0.087 |
Cards | 52,595 | 0.073 |
Board Games | 49,910 | 0.069 |
Cottage Country | 46,598 | 0.065 |
Visiting Adult Swing Clubs | 41,912 | 0.058 |
On-line Games | 28,016 | 0.039 |
Karaoke | 26,561 | 0.037 |
Opera | 22,451 | 0.031 |
I Do Not Drink | 17,088 | 0.024 |