Cheat notebook

Created 14 Nov 2015 • Last modified 15 May 2016

am_am

This table shows the columns of am_am.sqlite and the number of distinct values (including NULL).

id 750710
createdon 749085
createdby 2
updatedon 711923
updatedby 11
admin 2
status 2
account_type 2
membership_status 4
ad_source 133
profile_number 138
nickname 192914
first_name 35
last_name 38
street1 18
street2 4
city 21589
zip 27443
state 204
latitude 63321
longitude 49170
country 32
phone 34
work_phone 11
mobile_phone 7
gender 3
dob 20770
profile_caption 421308
profile_ethnicity 11
profile_weight 108
profile_height 56
profile_bodytype 9
profile_smoke 6
profile_drink 6
profile_initially_seeking 8
profile_relationship 7
pref_opento 436177
pref_opento_other 29856
pref_opento_abstract 359075
pref_turnsmeon 438752
pref_turnsmeon_other 25739
pref_turnsmeon_abstract 264546
pref_lookingfor 348202
pref_lookingfor_other 26530
pref_lookingfor_abstract 241815
main_photo 5
security_question 6
security_answer 108905

Birthdates in voter data

  • CA
    • 1931-01-01: 17,195
    • 1900-01-02 to 1901-01-01: each has 50 or more, compared to 1 to 1901-01-03
  • FL
    • 1942-10-10: 674, compared to 573 at most in the rest of the month
  • KS
    • 1901-01-01: 32, compared to 1s and 0s in the surrounding months
  • NY
    • 1901-01-01, 1950-01-01: 3,024 and 1,695, compared to less than 1,000 elsewhere
    • 1921-01-01: 529, compared to at most 204 in the surrounding months
  • OK
    • There are 44 unique impossible dates: 19080000 19100000 19110000 19130000 19140000 19160000 19170000 19180000 19190000 19200000 19210000 19220000 19230000 19240000 19250000 19260000 19270000 19280000 19290000 19300000 19310000 19320000 19330000 19340000 19340931 19350000 19360000 19370000 19380000 19390000 19400000 19410000 19420000 19430000 19450000 19460100 19470000 19480000 19480014 19490000 19510000 19530000 19550000 19640000

AM and voter data

Descriptive

["Total voters" (format (.sum ($ counts voters)) ",")]
Total voters 48,956,303

Match rate:

(setv statecounts (.sum (.groupby
  (getl counts : (qw state voters am_users)) "state")))
(setv state-summary (apply cbind (rmap [state (filt (in it am-users-in-states) states)]
  (setv matched (int (getl statecounts state "am_users")))
  (setv total-am (get am-users-in-states state))
  (setv voters (int (getl statecounts state "voters")))
  (setv pop (get state-pop-2012 state))
  (setv party-counts (do
    (setv d (ss counts (= $state state)))
    (setv [pv vv] (map second (.iteritems (getl d : (qw party voters)))))
    (setv pv (.set-categories pv.cat (+ keep-parties ["rest"])))
    (setv pv (.fillna pv "rest"))
    (setv d (.sum (.groupby (cbind pv vv) 0)))
    (list (.iteritems (geti d : 0)))))
  (kwc pds-from-pairs :name (get state-names state) (+ party-counts (pairs
      "Total registered voters" voters
      "Population" pop
      "Percent covered" (round (* 100 (/ voters pop)))
      "AM users matched with voters" matched
      "Total AM users" total-am
      "Percent matched" (round (* 100 (/ matched total-am)))))))))
(setv state-summary.index.name "State")
(setv state-summary (.fillna state-summary 0))
(setv state-summary (.applymap state-summary (λ (format (int it) ","))))
(kwc .rename state-summary :+inplace :index {
  "np" "Unaffiliated voters"
  "DEM" "Democrats"
  "REP" "Republicans"
  "GRE" "Greens"
  "LBT" "Libertarians"
  "rest" "Voters in other party"})
state-summary
State California Florida Kansas New York Oklahoma
Unaffiliated voters 2,714,148 2,704,253 523,128 3,018,473 229,834
Democrats 7,948,406 4,998,315 434,455 7,156,033 884,263
Republicans 5,309,194 4,392,202 777,149 3,482,361 847,590
Greens 114,309 6,149 0 32,148 0
Libertarians 110,997 12,033 11,712 4,953 0
Voters in other party 2,007,204 430,303 0 806,688 3
Total registered voters 18,204,258 12,543,255 1,746,444 14,500,656 1,961,690
Population 38,041,430 19,317,568 2,885,905 19,570,261 3,814,820
Percent covered 48 65 61 74 51
AM users matched with voters 31,651 17,299 2,581 21,967 2,014
Total AM users 93,134 44,492 6,341 54,828 6,690
Percent matched 34 39 41 40 30

Match rates range from 30% to 41%.

Here's a breakdown of gender and gender imputation.

(setv d (.reset-index (.applymap
  (.dropna (.sum (.groupby
    (.replace (getl counts : (qw state gender_known gender voters)) np.nan "Unknown")
    (qw state gender_known gender))))
  (λ (format (int it) ",")))))
(kwc .sort-values d (qw state gender_known gender)
  :ascending [True False True])
I state gender_known gender voters
3 CA True Female 6,286,406
4 CA True Male 5,463,739
0 CA False Female 2,988,956
1 CA False Male 2,616,676
2 CA False Unknown 848,481
8 FL True Female 6,650,580
9 FL True Male 5,663,435
5 FL False Female 97,125
6 FL False Male 94,047
7 FL False Unknown 38,068
13 KS True Female 924,631
14 KS True Male 821,481
10 KS False Female 131
11 KS False Male 122
12 KS False Unknown 79
18 NY True Female 7,923,344
19 NY True Male 6,576,412
15 NY False Female 340
16 NY False Male 327
17 NY False Unknown 233
20 OK False Female 937,129
21 OK False Male 777,200
22 OK False Unknown 247,361

The OK voter data has no gender information, so we had to impute every case there.

(setv n-total (.sum ($ counts voters)))
(setv n-not-given (.sum ($ (ss counts (~ $gender_known)) voters)))
(setv n-imputed (.sum ($ (ss counts (& (.notnull $gender) (~ $gender_known))) voters)))
[
  ["Proportion of genders missing" (rd (/ n-not-given n-total))]
  ["Proprtion of missing genders imputed" (rd (/ n-imputed n-not-given))]
  ["Proportion of genders still missing" (rd (/ (- n-not-given n-imputed) n-total))]]
Proportion of genders missing 0.177
Proprtion of missing genders imputed 0.869
Proportion of genders still missing 0.023
(setv n-bad-age (.sum ($ (ss counts (.isnull $age)) voters)))
["Proportion of ages invalid" (rd (/ n-bad-age n-total))]
Proportion of ages invalid 0.011

Here is the number of registered voters per party and state (ignoring the AM data).

(setv x (kwc .sort-values
  (.reset-index
    (kwc .sum ($ (.set-index counts ["state" "party"]) voters)
      :level ["state" "party"]))
  ["state" "voters"]
  :ascending [True False]))
(setv ($ x voters) (amap (.format "{:,}" it) ($ x voters)))
(kwc .reset-index :+drop x)
I state party voters
0 CA DEM 7,948,406
1 CA REP 5,309,194
2 CA np 2,714,148
3 CA DS 1,115,991
4 CA AI 479,868
5 CA OTH 272,050
6 CA GRE 114,309
7 CA LBT 110,997
8 CA PF 63,088
9 CA MIS 61,768
10 CA REF 7,818
11 CA AME 3,380
12 CA NAT 2,276
13 CA WWP 314
14 CA JP 147
15 CA CTP 113
16 CA CPC 96
17 CA WP 50
18 CA PIR 31
19 CA CP 31
20 CA CMP 31
21 CA HUM 27
22 CA HPC 14
23 CA MMW 11
24 CA OP 9
25 CA PPC 9
26 CA NRP 8
27 CA TVP 8
28 CA ACP 7
29 CA UCA 6
30 CA MCP 4
31 CA FED 4
32 CA NMB 4
33 CA DP 4
34 CA AMC 4
35 CA SEU 4
36 CA WFP 4
37 CA SAP 3
38 CA LRU 3
39 CA CPP 3
40 CA EJP 3
41 CA POT 2
42 CA GSP 2
43 CA EGA 2
44 CA UMP 2
45 CA APP 1
46 CA UCB 1
47 CA NSP 1
48 CA U08 1
49 CA ATP 1
50 FL DEM 4,998,315
51 FL REP 4,392,202
52 FL np 2,704,253
53 FL INT 281,132
54 FL UNK 59,813
55 FL IDP 58,618
56 FL NRS 25,999
57 FL LBT 12,033
58 FL GRE 6,149
59 FL REF 2,044
60 FL CPF 1,122
61 FL AIP 453
62 FL TPF 391
63 FL FPP 336
64 FL ECO 167
65 FL PSL 122
66 FL PFP 45
67 FL FSW 29
68 FL JPF 12
69 FL OBJ 12
70 FL AEL 4
71 FL FWP 3
72 FL SOC 1
73 KS REP 777,149
74 KS np 523,128
75 KS DEM 434,455
76 KS LBT 11,712
77 NY DEM 7,156,033
78 NY REP 3,482,361
79 NY np 3,018,473
80 NY IND 565,652
81 NY CON 184,613
82 NY WOR 56,242
83 NY GRE 32,148
84 NY LBT 4,953
85 NY FDM 109
86 NY RTH 57
87 NY TXP 6
88 NY APP 5
89 NY SWP 4
90 OK DEM 884,263
91 OK REP 847,590
92 OK np 229,834
93 OK AE 3

Florida and California have quite a few voters with undocumented party codes.

The web page for New York's Independence Party states that "The Party's leadership recognizes that individuals do sometimes unwittingly register as members of the Independence Party when their intent was to register to vote as a 'blank'." That may explain why CA, FL, and NY all have so many people in an "Independence Party" or "Independent Party" when I've never even heard of such a party before.

Below, voters shows the number of registered voters for each state and party, and AMr shows the reciprocal of the Ashley Madison match rate (e.g., 500 means that 1 in 500 such voters were matched to an Ashley Madison user).

(setv x (.dropna (.sum (.groupby
  (getl (drop-unused-cats (ss counts (.isin $party keep-parties)))
    : (qw state party voters am_users))
  (qw state party)))))
(setv ($ x AMr) (wc x (/ $voters $am_users)))
(setv x (.drop x "am_users" 1))
(setv xR (.reset-index x))
(setv ($ xR state) (.map ($ xR state) state-names))
(.to-csv xR "/tmp/amr.csv")
(.applymap x (λ (if (numeric? it) (format (int (round it)) ",") it)))
state party voters AMr
CA np 2,714,148 538
CA DEM 7,948,406 726
CA REP 5,309,194 489
CA GRE 114,309 405
CA LBT 110,997 271
FL np 2,704,253 648
FL DEM 4,998,315 1,098
FL REP 4,392,202 566
FL GRE 6,149 439
FL LBT 12,033 325
KS np 523,128 603
KS DEM 434,455 1,039
KS REP 777,149 618
KS LBT 11,712 300
NY np 3,018,473 621
NY DEM 7,156,033 894
NY REP 3,482,361 485
NY GRE 32,148 502
NY LBT 4,953 225
OK np 229,834 739
OK DEM 884,263 1,641
OK REP 847,590 728

Here's a plot. (The y-axis is upside-down so that higher points mean more Ashley Madison users.)

library(ggplot2)

d = read.csv("/tmp/amr.csv")
d = transform(d, party = factor(party, levels =
    c("np", "DEM", "REP", "GRE", "LBT")))
breaks = c(1, seq(250, 1750, 250))
amr.plot = ggplot(d) +
    geom_point(aes(state, AMr)) +
    geom_text(aes(
        x = as.integer(state) + ifelse(party == "REP", -.05, .05),
        y = AMr,
        label = c(
            np = "none", DEM = "Dem", REP = "Rep",
            GRE = "Green", LBT = "Lib")[as.character(party)],
        hjust = ifelse(party == "REP", 1, 0))) +
    xlab("State") +
    scale_y_reverse(name = "Ashley Madison users",
      expand = c(0, 0),
      breaks = breaks,
      labels = paste("1 in", breaks)) +
    coord_cartesian(ylim = range(breaks)) +
    theme(
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "black"))
amr.plot

amr.png

In all five states, Dems cheat less than other party categories (including the more liberal Greens). In all four states with a libertarian party, libertarians cheat more than other party categories.

Party membership differs by gender and age. If we consider only males, and we control for age, do we still see these AM usage differences by party? Below is a crude graph where we group men by year of birth.

(setv cs (getl
  (ss counts
    (& (= $gender "Male") (.notnull $born)
      (.isin $party ["np" "DEM" "REP"])))
  :
  (qw state born party am n)))
(setv out (.dropna (.apply
  (.groupby cs ["state" (// ($ cs born) 10000) "party"])
  (λ (if (.any ($ it am))
    (/ (.sum ($ (ss it [[var:a]][[var:m]]) [[var:n]])) (.[[var:s]][[var:u]][[var:m]] ( it n)))
    0)))))
(setv out (kwc pd.DataFrame out :columns ["am_proportion"]))
;(pd.Series [(.sum ($ it n)) (.sum ($ (ss it $am) n))] ["n" "a"])))))
(.to-csv out "/tmp/am-year-party.csv")
d = read.csv("/tmp/am-year-party.csv")
library(ggplot2)
ggplot(subset(d, born >= 1915)) +
   geom_line(aes(born, am_proportion, color = party)) +
   scale_color_manual(values = c("blue", "darkgreen", "red")) +
   facet_grid(state ~ ., scales = "free")

am-year-party.png

It looks like it, pretty much.

Modeling

Basic information about the data we'll use for modeling:

(setv d (am-usage-filter counts))
(setv base-rate (/ (.sum ($ d am_users)) (.sum ($ d voters))))
[
  ["Sample size" (format (.sum ($ d voters)) ",")]
  ["AM users"    (format (.sum ($ d am_users)) ",")]
  ["Base rate" (+ "1 in " (str (int (round (/ 1 base-rate)))))]
  ["Base MSE" (round (* base-rate (- 1 base-rate)) 8)]]
Sample size 44,238,025
AM users 67,738
Base rate 1 in 653
Base MSE 0.00152887
(setv folds (do
  (setv old-rng-state (np.random.get-state))
  (np.random.seed 55)
  (setv out (gen-counts-cv-folds-ts ($ d voters) ($ d am_users) 10))
  (np.random.set-state old-rng-state)
  out))

(import [collections [OrderedDict]])

(setv models (OrderedDict (,
  [:Trivial {:desc "No predictors" :formula
    None}]
  [:Parties {:desc "Predictors: party only" :formula
    "party"}]
  [:Demo {:desc "Predictors: state, gender, age" :formula
    "state + female + age + age2"}]
  [:DemoParties {:desc "Predictors: state, gender, age, party" :formula
    "state + female + age + age2 + party"}]
  [:IntDemo {:desc "Demo, plus all first-order interactions" :formula (.join " " [
    "state + female + age + age2 +"
    "female:(state + age + age2) +"
    "state:(age + age2)"])}]
  [:IntDemoParties {:desc "DemoParties, plus all first-order interactions" :formula (.join " " [
    "state + female + age + age2 + party +"
    "female:(state + age + age2 + party) +"
    "state:(age + age2 + party) +"
    "party:(age + age2)"])}])))
(for [[k v] (.items models)]
  (setv (get v :name) (keyword->str k)))

(for [model (.values models)]
  (.update model (kwc cached-eval
    (+ "am-usage-cv " (get model :name))
    :cache-dir cache-dir
    (fn []
      (if (= (get model :name) "Trivial")
        (kwc am-usage "cv" counts "state" folds :+trivial-model)
        (am-usage "cv" counts (get model :formula) folds))))))

(setv cv-results (df-from-pairs (rmap [model (.values models)] (pairs
  "Model" (get model :name)
  "Description" (get model :desc)
  "Terms" (if (= (get model :name) "Trivial") 1
    (+ 1 (len (get model :x-column-names))))
  "Mean square error" (round (get model :cv-mse) 9)
  "[[var:p]]_{0}" (int (round (/ 1 (get model :cv-p-cond 0))))
  "[[var:p]]_{1}" (int (round (/ 1 (get model :cv-p-cond 1))))))))
(setv cv-results (.set-index cv-results "Model"))
cv-results
Model Description Terms Mean square error p0 p1
Trivial No predictors 1 0.001528872 653 653
Parties Predictors: party only 5 0.001528734 653 617
Demo Predictors: state, gender, age 8 0.001524705 655 237
DemoParties Predictors: state, gender, age, party 12 0.001524521 655 230
IntDemo Demo, plus all first-order interactions 22 0.001524695 655 236
IntDemoParties DemoParties, plus all first-order interactions 51 0.001524486 655 229

Here p0 is the reciprocal of the mean predicted probability among all voters who didn't actually use AM, and p1 is the same thing for voters who actually did use AM.

None of the models can improve much on the base rate for p0. parties_only somewhat improves p1, while all the other nontrivial models substantially improve p1. Parties help, and interaction terms help a tiny bit more.

Below are the coefficients of the parties and ints&parties models.

(setv noint-coefs (cache
  (kwc am-usage "fit_all" counts :formula
    (get models :DemoParties :formula))))
(setv int-coefs (cache
  (kwc am-usage "fit_all" counts :formula
    (get models :IntDemoParties :formula))))

(defn pretty-term-name [s] (cond
  [(in ":" s) (do
    (setv [s1 s2] (.split s ":"))
    (+ (pretty-term-name s1) " × " (pretty-term-name s2)))]
  [(= s "age")
    "Age"]
  [(= s "age2")
    "Age²"]
  [(= s "female[T.True]")
    "Female"]
  [(.startswith s "state[")
    (.format "State: {}" (get state-names (slice (.rstrip s "]") -2)))]
  [(.startswith s "party[")
    (.format "Party: {}" (get long-party-names (slice (.rstrip s "]") -3)))]
  [True
    s]))

(setv pretty-coefs (.applymap
  (cbind :DemoParties noint-coefs :IntDemoParties int-coefs)
  (λ (if (np.isnan it) "" (.replace (format it ".02f") "-" "−")))))
(setv pretty-coefs.index (amap (pretty-term-name it) pretty-coefs.index))
(setv pretty-coefs.index.name "Term")
pretty-coefs
Term DemoParties IntDemoParties
Intercept −6.20 −6.23
Age −1.12 −1.16
Age² −2.25 −2.15
Female −3.29 −3.13
Female × Age   0.54
Female × Age²   0.43
Female × Party: Democratic   0.19
Female × Party: Green   −0.09
Female × Party: Libertarian   0.07
Female × Party: Republican   0.02
Female × State: California   −0.14
Female × State: Florida   −0.13
Female × State: Kansas   −0.22
Female × State: Oklahoma   −0.21
Party: Democratic −0.18 −0.17
Party: Democratic × Age   −0.13
Party: Democratic × Age²   0.08
Party: Green 0.04 −0.06
Party: Green × Age   0.27
Party: Green × Age²   −0.28
Party: Libertarian 0.44 0.47
Party: Libertarian × Age   −0.41
Party: Libertarian × Age²   −0.43
Party: Republican 0.19 0.28
Party: Republican × Age   −0.23
Party: Republican × Age²   0.05
State: California 0.11 0.15
State: California × Age   0.27
State: California × Age²   −0.18
State: California × Party: Democratic   0.04
State: California × Party: Green   0.12
State: California × Party: Libertarian   −0.20
State: California × Party: Republican   −0.16
State: Florida −0.09 −0.02
State: Florida × Age   0.22
State: Florida × Age²   −0.22
State: Florida × Party: Democratic   −0.14
State: Florida × Party: Green   0.25
State: Florida × Party: Libertarian   −0.33
State: Florida × Party: Republican   −0.11
State: Kansas −0.14 −0.19
State: Kansas × Age   −0.29
State: Kansas × Age²   −0.59
State: Kansas × Party: Democratic   −0.08
State: Kansas × Party: Libertarian   −0.14
State: Kansas × Party: Republican   −0.24
State: Oklahoma −0.35 −0.21
State: Oklahoma × Age   0.26
State: Oklahoma × Age²   −0.19
State: Oklahoma × Party: Democratic   −0.25
State: Oklahoma × Party: Republican   −0.17

Let's make predictions for a 25-year-old known-to-be-male New Yorker.

Using the parties model:

(setv [age-t age2-t] (am-usage-transform-ages
  (am-usage-filter counts) 40))
(defn f [coefs]
  (defn c [k] (.get coefs k 0))
  (setv l (lc [[name v] (+ [["party[T.np]" 0]] (list (.iteritems coefs)))]
    (re.match r"party\[T\.[A-Za-z]+\]\Z" name)
    (, (.group (re.search "T.([A-Za-z]+)" name) 1) (int (round (/ 1 (ilogit (+
      (get noint-coefs "Intercept")
      (* age-t (c "age"))
      (* age2-t (c "age2"))
      (* age-t (c (+ name ":age")))
      (* age2-t (c (+ name ":age2")))
      v))))))))
  (geti (.set-index (pd.DataFrame l) 0) : 0))
(.sort-values (f noint-coefs))
I 1
LBT 117
REP 151
GRE 174
np 182
DEM 217

Using the ints_and_parties model:

(setv predcompare (cbind :DemoParties (f noint-coefs) :IntDemoParties (f int-coefs)))
(setv predcompare (.sort-values predcompare "DemoParties"))
(setv predcompare.index (amap (get long-party-names it) predcompare.index))
(setv predcompare.index.name "Party")
predcompare
Party DemoParties IntDemoParties
Libertarian 117 91
Republican 151 135
Green 174 193
unaffiliated 182 185
Democratic 217 218

Old modeling (with SGD)

Would it make things substantially easier if all IVs were discrete? Potentially, yes, since when all IVs of a logistic-regression model are discrete, and all interaction terms are included (and there's no regularization or anything else special), the best-fitting logistic-regression model just reproduces the contingency table. However, the cells of the IV contingency table here vary widely in size. Some are 0, because, e.g., I don't have data for Greens in Kansas.

SGDClassifier can do probabilistic prediction not only with logistic loss (which is what I've done) but also with modified Huber loss. However, I bet it would be less effective.

(setv [n-modeling-subjects n-modeling-cheated] (cache (get-modeling-ns)))
(setv base-rate (/ n-modeling-cheated n-modeling-subjects))
[
  ["Sample size" (format n-modeling-subjects ",")]
  ["Cheated" (format n-modeling-cheated ",")]
  ["Base rate" (+ "1 in " (str (int (/ 1 base-rate))))]
  ["Base MSE" (round (* base-rate (- 1 base-rate)) 8)]]
Sample size 45,201,799
Cheated 68,602
Base rate 1 in 658
Base MSE 0.00151538
(setv cv-results (dict (cache (lc [ii [False True] ipp [False True]]
  (, (, ii ipp) (kwc try-model "cv" n-modeling-subjects n-modeling-cheated
    :include-interactions ii
    :include-party-params ipp))))))
(kwc .set-index :keys "model" (kwc pd.DataFrame :columns ["model" "MSE" "P when 0" "P when 1"] (+
  [["trivial" (round (* base-rate (- 1 base-rate)) 8)
    (int (/ 1 base-rate)) (int (/ 1 base-rate))]]
  (lc [ii [False True] ipp [False True]] [
      (.format "{}, {}"
        (if ii "interacts" "no interacts")
        (if ipp "parties" "no parties"))
      (round (get cv-results (, ii ipp) "mse") 8)
      (int (/ 1 (get cv-results (, ii ipp) "mean pred prob when 0")))
      (int (/ 1 (get cv-results (, ii ipp) "mean pred prob when 1")))]))))
model MSE P when 0 P when 1
trivial 0.00151538 658 658
no interacts, no parties 0.00139625 709 333
no interacts, parties 0.00139611 724 330
interacts, no parties 0.00139643 706 375
interacts, parties 0.00139615 719 351

Here we compared the fivefold cross-validate performance of four models, and also include base rates for comparison. The rightmost two columns give the mean predicted probability of cheating (expressed as a reciprocal) among subjects who were not (0) or were (1) actually cheating. We see that all models improve substantially upon the base rate, and adding party terms helps, but interactions don't help. So let's use the model with parties but no interactions.

Here are its parameters with fit to the whole dataset:

(setv params (OrderedDict (cache (kwc try-model "fit" n-modeling-subjects n-modeling-cheated
  :!include-interactions :+include-party-params))))
(lc [[name value] (.items params)] [name (rd value)])
Intercept -5.106
CA 0.064
FL -0.196
KS -0.247
OK -0.430
known_female -3.226
unknown_gender -0.934
age -1.059
age_squared -0.978
DEM -0.198
REP 0.191
GRE -0.044
LBT 0.085

We can use them to create a little table of the predicted probability of a 25-year-old male New Yorker cheating:

(defn p [x] (get params x))
(kwc sorted :key second (lc [party keep-parties]
  (, party (int (/ 1 (ilogit (+
    (p "Intercept")
    ;(p "NY")
    (* (p "age") (first (transformed-age 25)))
    (* (p "age_squared") (second (transformed-age 25)))
    (if (= party "np") 0 (p party)))))))))
REP 182
LBT 202
np 220
GRE 230
DEM 268

We see that Democrats have the lowest probability of cheating, but differently from the simple graph above, Greens and libertarians are between Republicans and Democrats rather than cheating more than Republicans. The differences here are generally much smaller than those in the graph; the biggest difference is between Republicans and Democrats, which has Democrats cheat about a third less than Republicans.

Matching voter records across years

First attempt

As a test run, we consider NY 2010 and 2012. I compute how many records each dataset has that the other doesn't have by just subtracting the number of shared records.

  • Matching by rn
    • 2010: 13,026,768 records
    • 2012: 14,500,804 records
    • Shared: 13,017,759
    • 2010-only records: 9,009
    • 2012-only records: 1,483,045
  • Matching by name, zip, addrnum1, addrnum2
    • 2010: 12,852,502 distinct records
    • 2012: 14,283,937 distinct records
    • Shared: 10,854,796
    • 2010-only records: 1,997,706
    • 2012-only records: 3,429,141

Matching by rn, 1,385,648 of the pairs of records have a different addrnum1 or addrnum2.

Matching by rn, 12,999,937 of the 13,017,473 (99.87%) pairs of records have matching birthdates.

2010 voters matched to AM: 20,191

  • By rn, 20,171 of these are also present in 2012.
  • And 1,803 have a new addrnum1 or addrnum2.
    • sqlite3 voter.sqlite 'attach "voter-2010.sqlite" as V10; select count(*) from (select rn, addrnum1, addrnum2 from V10.T where am_user notnull) as T10 join T as T12 using (rn) where T10.addrnum1 != T12.addrnum1 or T10.addrnum2 != T12.addrnum2
  • 1,803 / 20,171 = .09

2010 voters not matched to AM:

  • By rn, 12,997,588 are present in 2012.
  • And 1,383,845 of those have a new addrnum1 or addrnum2.
  • 1,383,845 / 12,997,588 = .11

Notice that this difference is in the wrong direction: 2010 New York voters were slightly less likely to have a different address in 2012 if they were on AM.

We might do this better by handling time more explicitly on both sides, using transaction dates to date AM usage and record modification dates (when they exist) to date voter records.

Using transaction dates

I hoped to use registration dates as an indicator of when each voter record was last updated. However, this doesn't seem to be correct, at least for New York, because sqlite3 voter.sqlite 'attach "voter-2010.sqlite" as V10; select V10.T.registered < V12T.registered as updated, V10.T.addrnum1 != V12T.addrnum1 as moved, count(*) from (T as V12T inner join V10.T using (rn)) group by updated, moved returns:

0 0 11639406
0 1 1076701
1 0 11582
1 1 289784

which shows us that quite a few seemingly non-updated 2012 records have new addresses. So instead, let's just use January 1st, 2012, as the threshold date. (Or rather, January 3rd, to provide some wiggle room for time-zone shenanigans.)

attach "voter-2010.sqlite" as V10;
attach "amcc.sqlite" as AMCC;
select
  V12T.am_user notnull and early_enough as am_before,
  V12T.addrnum1 != V10.T.addrnum1 as moved,
  count(*)
  from (T as V12T inner join V10.T using (rn))
      left outer join (select
         usernum,
         min(time) < cast(strftime("%s", "20120103") as integer)
             as early_enough
--         cast(strftime("%Y%m%d", min(time) - 2*24*60*60, "unixepoch", "utc") as integer)
--             as first_am
         from AMCC.T
         group by usernum)
      on V12T.am_user = usernum
  where V12T.am_user isnull or V12T.am_user != -1
  group by am_before, moved;
0 0 11632713
0 1 1363900
1 0 16329
1 1 2337
  • Among moved, 2337 / 1363900 = 1 / 584 used AM before 2012
  • Among non-moved, 16329 / 11632713 = 1 / 712 used AM before 2012
  • Among AM users before 2012, 2337 / 16329 = 14% moved
  • Among non-AM users before 2012, 1363900 / 11632713 = 12% moved

That's in the right direction, at least.

I'm still matching up AM users with voters using 2012 voter addresses. Should I be using the 2010 voter addresses instead? Should I use both somehow?