Birthdates in voter data

  • CA
    • 1931-01-01: 17,195
    • 1900-01-02 to 1901-01-01: each has 50 or more, compared to 1 to 1901-01-03
  • FL
    • 1942-10-10: 674, compared to 573 at most in the rest of the month
  • KS
    • 1901-01-01: 32, compared to 1s and 0s in the surrounding months
  • NY
    • 1901-01-01, 1950-01-01: 3,024 and 1,695, compared to less than 1,000 elsewhere
    • 1921-01-01: 529, compared to at most 204 in the surrounding months
  • OK
    • There are 44 unique impossible dates: 19080000 19100000 19110000 19130000 19140000 19160000 19170000 19180000 19190000 19200000 19210000 19220000 19230000 19240000 19250000 19260000 19270000 19280000 19290000 19300000 19310000 19320000 19330000 19340000 19340931 19350000 19360000 19370000 19380000 19390000 19400000 19410000 19420000 19430000 19450000 19460100 19470000 19480000 19480014 19490000 19510000 19530000 19550000 19640000

AM and voter data


Total voters 48,852,975

Match rate:

State California Florida Kansas New York Oklahoma
Unaffiliated voters 2,710,719 2,700,397 522,094 3,016,645 229,494
Democrats 7,936,325 4,988,433 432,858 7,134,846 879,343
Republicans 5,297,443 4,378,861 773,346 3,484,717 844,305
Greens 114,126 6,143 0 32,811 0
Libertarians 110,808 12,016 11,690 4,710 0
Voters in other party 2,004,915 429,399 0 796,528 3
Total registered voters 18,174,336 12,515,249 1,739,988 14,470,257 1,953,145
Population 38,041,430 19,317,568 2,885,905 19,570,261 3,814,820
Percent covered 48 65 60 74 51
AM users matched with voters 32,457 17,928 2,642 21,776 2,072
Total AM users 92,058 43,987 6,282 54,203 6,603
Percent matched 35 41 42 40 31

Match rates range from 31% to 42%.

Here's a breakdown of gender and gender imputation.

Proportion of genders missing 0.177
Proprtion of missing genders imputed 0.872
Proportion of genders still missing 0.023
(setv n-bad-age (.sum ($ (ss counts (.isnull $age)) voters)))
["Proportion of ages invalid" (format (/ n-bad-age n-total) ".03f")]
Proportion of ages invalid 0.010

Here is the number of registered voters per party and state (ignoring the AM data).

I state party voters
0 CA DEM 7,936,325
1 CA REP 5,297,443
2 CA np 2,710,719
3 CA DS 1,114,635
4 CA AI 479,378
5 CA OTH 271,806
6 CA GRE 114,126
7 CA LBT 110,808
8 CA PF 63,014
9 CA MIS 61,684
10 CA REF 7,786
11 CA AME 3,379
12 CA NAT 2,271
13 CA WWP 313
14 CA JP 146
15 CA CTP 113
16 CA CPC 96
17 CA WP 50
18 CA PIR 31
19 CA CP 31
20 CA CMP 31
21 CA HUM 27
22 CA HPC 14
23 CA MMW 11
24 CA OP 9
25 CA PPC 9
26 CA NRP 8
27 CA TVP 7
28 CA ACP 7
29 CA UCA 6
30 CA MCP 4
31 CA FED 4
32 CA NMB 4
33 CA DP 4
34 CA AMC 4
35 CA SEU 4
36 CA WFP 4
37 CA SAP 3
38 CA LRU 3
39 CA CPP 3
40 CA EJP 3
41 CA POT 2
42 CA GSP 2
43 CA EGA 2
44 CA UMP 2
45 CA APP 1
46 CA UCB 1
47 CA NSP 1
48 CA U08 1
49 CA ATP 1
50 FL DEM 4,988,433
51 FL REP 4,378,861
52 FL np 2,700,397
53 FL INT 280,485
54 FL UNK 59,740
55 FL IDP 58,517
56 FL NRS 25,928
57 FL LBT 12,016
58 FL GRE 6,143
59 FL REF 2,037
60 FL CPF 1,119
61 FL AIP 452
62 FL TPF 391
63 FL FPP 336
64 FL ECO 166
65 FL PSL 122
66 FL PFP 45
67 FL FSW 29
68 FL JPF 12
69 FL OBJ 12
70 FL AEL 4
71 FL FWP 3
72 FL SOC 1
73 KS REP 773,346
74 KS np 522,094
75 KS DEM 432,858
76 KS LBT 11,690
77 NY DEM 7,134,846
78 NY REP 3,484,717
79 NY np 3,016,645
80 NY IND 557,885
81 NY CON 183,041
82 NY WOR 55,420
83 NY GRE 32,811
84 NY LBT 4,710
85 NY FDM 110
86 NY RTH 58
87 NY TXP 6
88 NY APP 5
89 NY SWP 3
90 OK DEM 879,343
91 OK REP 844,305
92 OK np 229,494
93 OK AE 3

Florida and California have quite a few voters with undocumented party codes.

The web page for New York's Independence Party states that "The Party's leadership recognizes that individuals do sometimes unwittingly register as members of the Independence Party when their intent was to register to vote as a 'blank'." That may explain why CA, FL, and NY all have so many people in an "Independence Party" or "Independent Party" when I've never even heard of such a party before. Here's the number of people in each such party:

(ss x (.isin $party (qw IND IDP INT AI)))
I state party voters
4 CA AI 479,378
53 FL INT 280,485
55 FL IDP 58,517
80 NY IND 557,885

Below, voters shows the number of registered voters for each state and party, and AMr shows the reciprocal of the Ashley Madison match rate (e.g., 500 means that 1 in 500 such voters were matched to an Ashley Madison user).

(setv x (.dropna (.sum (.groupby
  (getl (drop-unused-cats (ss counts (.isin $party keep-parties)))
    : (qw state party voters am_users))
  (qw state party)))))
(setv ($ x AMr) (wc x (/ $voters $am_users)))
(setv x (.drop x "am_users" 1))
(setv xR (.reset-index x))
(setv ($ xR state) (.map ($ xR state) state-names))
(.to-csv xR "/tmp/amr.csv")
(.applymap x (λ (if (numeric? it) (format (int (round it)) ",") it)))
state party voters AMr
CA np 2,710,719 527
CA DEM 7,936,325 703
CA REP 5,297,443 476
CA GRE 114,126 399
CA LBT 110,808 260
FL np 2,700,397 629
FL DEM 4,988,433 1,057
FL REP 4,378,861 542
FL GRE 6,143 410
FL LBT 12,016 300
KS np 522,094 587
KS DEM 432,858 1,007
KS REP 773,346 604
KS LBT 11,690 285
NY np 3,016,645 627
NY DEM 7,134,846 901
NY REP 3,484,717 485
NY GRE 32,811 566
NY LBT 4,710 236
OK np 229,494 702
OK DEM 879,343 1,573
OK REP 844,305 712

Here's a plot. (The y-axis is upside-down so that higher points mean more Ashley Madison users.)


In all five states, Dems cheat less than other party categories (including the more liberal Greens). In all four states with a libertarian party, libertarians cheat more than other party categories.

Here are the AM rates by state alone:

(setv x (getl (.sum (.groupby counts "state")) : (qw voters am_users)))
(setv ($ x AMr) (wc x (/ $voters $am_users)))
;(setv x (.drop x "am_users" 1))
;(setv xR (.reset-index x))
;(setv ($ xR state) (.map ($ xR state) state-names))
;(.to-csv xR "/tmp/amr.csv")
(ordf (.applymap x (λ (if (numeric? it) (format (int (round it)) ",") it)))
state voters am_users AMr
CA 18,174,336 32,457 560
KS 1,739,988 2,642 659
NY 14,470,257 21,776 665
FL 12,515,249 17,928 698
OK 1,953,145 2,072 943

Party membership differs by gender and age. If we consider only males, and we control for age, do we still see these AM usage differences by party? Below is a crude graph where we group men by year of birth.

It looks like it, pretty much.


Basic information about the data we'll use for modeling:

(setv d (am-usage-filter counts))
(setv base-rate (/ (.sum ($ d am_users)) (.sum ($ d voters))))
  ["Sample size" (format (.sum ($ d voters)) ",")]
  ["AM users"    (format (.sum ($ d am_users)) ",")]
  ["Base rate" (+ "1 in " (str (int (round (/ 1 base-rate)))))]
  ["Base MSE" (round (* base-rate (- 1 base-rate)) 8)]]
Sample size 44,172,769
AM users 69,023
Base rate 1 in 640
Base MSE 0.00156013
(setv cv-results (.set-index cv-results "Model"))
Model Description Terms MSE p0 p1
Trivial No predictors 1 0.001560127 ^640 ^640
Parties Predictors: party only 5 0.001559982 ^640 ^604
Demo Predictors: state, gender, age 8 0.001555791 ^642 ^232
DemoParties Predictors: state, gender, age, party 12 0.001555600 ^642 ^226
IntDemo Demo, plus all first-order interactions 22 0.001555783 ^642 ^231
IntDemoParties DemoParties, plus all first-order interactions 51 0.001555565 ^642 ^224

Here p0 is the reciprocal of the mean predicted probability among all voters who didn't actually use AM, and p1 is the same thing for voters who actually did use AM.

None of the models can improve much on the base rate for p0. parties_only somewhat improves p1, while all the other nontrivial models substantially improve p1. Parties help, and interaction terms help a tiny bit more.

Below are the coefficients of the parties and ints&parties models.

Let's make predictions for a 40-year-old male New Yorker.

Using the parties model:

Party DemoParties IntDemoParties
Libertarian ^117 ^98
Republican ^152 ^138
Green ^180 ^219
unaffiliated ^184 ^189
Democratic ^219 ^223

Old modeling (with SGD)

We see that Democrats have the lowest probability of cheating, but differently from the simple graph above, Greens and libertarians are between Republicans and Democrats rather than cheating more than Republicans. The differences here are generally much smaller than those in the graph; the biggest difference is between Republicans and Democrats, which has Democrats cheat about a third less than Republicans.

Matching voter records across years

First attempt

As a test run, we consider NY 2010 and 2012. I compute how many records each dataset has that the other doesn't have by just subtracting the number of shared records.

  • Matching by rn
    • 2010: 13,026,768 records
    • 2012: 14,500,804 records
    • Shared: 13,017,759
    • 2010-only records: 9,009
    • 2012-only records: 1,483,045
  • Matching by name, zip, addrnum1, addrnum2
    • 2010: 12,852,502 distinct records
    • 2012: 14,283,937 distinct records
    • Shared: 10,854,796
    • 2010-only records: 1,997,706
    • 2012-only records: 3,429,141

Matching by rn, 1,385,648 of the pairs of records have a different addrnum1 or addrnum2.

Matching by rn, 12,999,937 of the 13,017,473 (99.87%) pairs of records have matching birthdates.

2010 voters matched to AM: 20,191

  • By rn, 20,171 of these are also present in 2012.
  • And 1,803 have a new addrnum1 or addrnum2.
    • sqlite3 voter.sqlite 'attach "voter-2010.sqlite" as V10; select count(*) from (select rn, addrnum1, addrnum2 from V10.T where am_user notnull) as T10 join T as T12 using (rn) where T10.addrnum1 != T12.addrnum1 or T10.addrnum2 != T12.addrnum2
  • 1,803 / 20,171 = .09

2010 voters not matched to AM:

  • By rn, 12,997,588 are present in 2012.
  • And 1,383,845 of those have a new addrnum1 or addrnum2.
  • 1,383,845 / 12,997,588 = .11

Notice that this difference is in the wrong direction: 2010 New York voters were slightly less likely to have a different address in 2012 if they were on AM.

We might do this better by handling time more explicitly on both sides, using transaction dates to date AM usage and record modification dates (when they exist) to date voter records.

Using transaction dates

I hoped to use registration dates as an indicator of when each voter record was last updated. However, this doesn't seem to be correct, at least for New York, because sqlite3 voter.sqlite 'attach "voter-2010.sqlite" as V10; select V10.T.registered < V12T.registered as updated, V10.T.addrnum1 != V12T.addrnum1 as moved, count(*) from (T as V12T inner join V10.T using (rn)) group by updated, moved returns:

0 0 11639406
0 1 1076701
1 0 11582
1 1 289784

which shows us that quite a few seemingly non-updated 2012 records have new addresses. So instead, let's just use January 1st, 2012, as the threshold date. (Or rather, January 3rd, to provide some wiggle room for time-zone shenanigans.)

0 0 11632713
0 1 1363900
1 0 16329
1 1 2337
  • Among moved, 2337 / 1363900 = 1 / 584 used AM before 2012
  • Among non-moved, 16329 / 11632713 = 1 / 712 used AM before 2012
  • Among AM users before 2012, 2337 / 16329 = 14% moved
  • Among non-AM users before 2012, 1363900 / 11632713 = 12% moved

That's in the right direction, at least.

I'm still matching up AM users with voters using 2012 voter addresses. Should I be using the 2010 voter addresses instead? Should I use both somehow?

Diversity of names in voter data

One odd explanation I thought of for the finding that Republicans use AM more than Democrats is that Republicans may have less variety in full names (i.e., combination of first name and last name), inflating the number of false positives. In particular, Republicans being whiter than Democrats suggests there should be a smaller pool of names.

The 2012 voter records have 21,371,805 Democrats and 14,778,672 Republicans.

The number of voters per name is thus

  "Dem" (rd (/ 21,371,805 11,457,810))
  "Rep" (rd (/ 14,778,672 8,287,618))}
K value
Dem 1.865
Rep 1.783

The difference is in the direction I expected, but it's small, probably too small to make a substantial difference in terms of false positives.


These analyses consider only user numbers we found in the credit-card data.


From srcdump/ashleymadison, repository ashley.git, revision HEAD, file common/pinflib/amlib/AMLIB/AMLIB_SelectOptions.class.php.

This defines the numeric codes for "preferences" (opento in am_am), "tastes" (turnsmeon), and "desires" (lookingfor); gender; the various seeking types; etc.

For each code type, the codes are pretty much equivalent between the four gender-seeking classes, although not every class has access to every code, and there are some minor differences in wording. am_am has some codes that aren't listed, but these generally don't appear in profiles created before 2009, suggesting that the ability to indicate them on the website was removed.

Frequency tables

I female n
0 nan 12
1 0.0 720,634
2 1.0 18,022
(thousep (user-var-freq "ethnicity"))
ethnicity n
Caucasian (white) 596,751
Hispanic 45,340
African American (black) 40,826
Other 23,026
Rather Not Say 12,747
Asian 12,020
East Indian 4,666
Middle Eastern 2,144
First Nations 1,134
N/A 14
missing (count) 3
<1 (count) 47
.025 quantile (y) 23
median (y) 41
.975 quantile (y) 62
max (y) 91
