Appendix: Cross-cultural predictors of rape

Created 30 Sep 2013 • Last modified 5 Feb 2014

It's difficult to tell how much the analyses here use the same data as those of Sanday (1981b). However, Sanday's analyses are pretty simplistic (just Pearson correlations), so these are probably better.

All the below code is in R.


In this appendix, I will examine cross-cultural predictors of rape using the Standard Cross-Cultural Sample.

I got sscsfac.Rdata from, which I found a link to in the SCCS codebook.

qw = function(...) {m = = FALSE); as.character(m[["..."]])}
virtue = qw(v330, v331, v332, v333, v827, v828, v829, v830)

What should be use for our dependent variable? The following look appropriate:

Frequency of Rape
Rape: Incidents Reports, Or Thought of as Means of Punishment Women, or Part of Ceremony

v667 is described by the coder (Sanday, 1981a, p. 253) thus: "Incidents of rape are reported (suggesting that rape is not uncommon) or rape is thought of as a means for punishing women, or rape is part of a ceremony." It is a dichotomous variable, but going by Palmer (1989), a value of Absent doesn't indicate the complete absence of rape. Rather, I guess, we should regard Present as indicating more common or more thoroughly institutionalized rape than Absent.

Sexual-inhibition predictors

Here are the candidate predictors:

Talk about Sex
Sex Believed Dangerous
Frequency of Premarital Sex—Male
Frequency of Premarital Sex—Female
an Explicit View That Sexual Activity Is Dangerous Or Contaminating
v330 to v333
Inculcation of the value of sexual restraint in children
v827 to v830
Value of sexual expression and non-restraint in adolescents

(v330 to v333 and v827 to v830 are collectively virtue.)

Let's see how many cases there are for each combination of IV and DV. I'm not going to attempt any sort of missing-value interpolation here, so we need complete cases for whatever variables we want to use.

t(sapply(c("v159", "v161", "v166", "v167", "v602", virtue), function(predictor)
    sapply(qw(v173, v174, v667), function(crit)
        sum(![[predictor]]) & ![[crit]])))))
  v173 v174 v667
v159 16 13 37
v161 11 5 19
v166 30 24 64
v167 29 23 67
v602 15 13 32
v330 34 26 84
v331 33 25 82
v332 35 26 90
v333 36 26 89
v827 35 30 84
v828 36 30 83
v829 33 29 82
v830 36 30 83

v159, v161, and v602 have too few cases. All the DVs do, too, except for v667. There we have about 65 cases for the two premarital-sex variables (v166 and v167) and about 80 cases for the eight virtue variables. The virtue variables are probably worth inspecting independently of the premarital-sex variables so we don't have to lose those twenty-odd cases. Let's consider the premarital-sex variables first.

Premarital-sex frequency

d = na.omit(sccsfac[qw(v166, v167, v667)])
d = data.frame(
    # Recode v166 and v167 so higher values mean more premarital
    # sex.
    prem.male = 4 - as.integer(d$v166),
    prem.female = 4 - as.integer(d$v167),
    rape = d$v667 == "Present")
c("Complete cases" = nrow(d))
Complete cases 63

It turns out that prem.male and prem.female are generally equal, and neither varies much:

with(d, table(prem.male, prem.female))
  0 1 2 3
0 7 0 0 0
1 0 6 0 0
2 0 2 12 0
3 0 1 2 33

Suppose we then just use prem.female.

with(d, table(prem.female, rape))
0 3 4
1 5 4
2 4 10
3 16 17

It looks like rape is about 50% true across with board with an oddball at prem.female == 2. Let's try logistic regression just for laughs.

model = glm(rape ~ prem.female, data = d,
    family = binomial(link = "logit"))

Not surprisingly, this model is completely useless. The probability it predicts of rape being true is greater than 1/2 for every case:

round(d = 3, min(predict(model, type = "response")))

Conclusion: across societies, premarital-sex frequency is not predictive of rape.

Virtue variables

d = na.omit(sccsfac[c(virtue, "v667")])
c("Complete cases" = nrow(d))
Complete cases 67

This number is similar to that of the previous analysis, but unfortunately, the intersection of cases is substantially smaller:

nrow(na.omit(sccsfac[c(virtue, "v166", "v167", "v667")]))

So we'll do an analysis with just the virtue variables and the DV (v667).

The virtue variables have many high correlations:

corm = cor(d[virtue], method = "spearman")
corm[!upper.tri(corm)] = NA
round(d = 2, corm)
  v330 v331 v332 v333 v827 v828 v829 v830
v330   0.94 0.77 0.75 -0.47 -0.50 -0.46 -0.60
v331     0.79 0.81 -0.55 -0.56 -0.50 -0.63
v332       0.90 -0.51 -0.52 -0.51 -0.62
v333         -0.60 -0.60 -0.58 -0.69
v827           0.92 0.69 0.74
v828             0.65 0.80
v829               0.89

Let's try summarizing them with a principal component or two. We'll do PCA based on the Spearman correlation matrix, since these variables are all on rating scales.

dvr = d[virtue]
for (i in seq_along(dvr))
    dvr[,i] = rank(dvr[,i])
round(d = 2, eigen(cor(dvr))$values / ncol(dvr))
  0.71 0.15 0.06 0.04 0.02 0.01 0.01 0.01

The first PC accounts for the bulk of the variance (71%).

virt.pca = prcomp(dvr, center = T, scale = T)
round(d = 2, virt.pca$rotation[,1:3])
  PC1 PC2 PC3
v330 0.34 0.42 -0.07
v331 0.36 0.37 -0.13
v332 0.35 0.35 0.08
v333 0.37 0.24 0.05
v827 -0.34 0.40 0.46
v828 -0.35 0.38 0.47
v829 -0.33 0.36 -0.64
v830 -0.37 0.27 -0.35

Here are loadings from the first three PCs. On the first, the loadings are such that positive values of the PC indicate more sexual restraint. The second is more opaque (its pattern is consistent with the male–female distinction, but not very strongly) and the third doesn't account for much variance (it distinguishes between the "sexual expression" and "sexual nonrestraint" variables, which the authors say should be added together anyway). So let's just use the first as a general measure of sexual restraint.

dm = data.frame(
    restraint = virt.pca$x[,"PC1"],
    rape = d$v667 == "Present")
model = glm(rape ~ restraint, data = dm,
    family = binomial(link = "logit"))

This time, at least, we have predicted probabilities on both sides of .5:

table(predict(model, type = "response") >= .5)

So how well can this model fit its own training data?

round(d = 2,
    mean((predict(model, type = "response") >= .5) == model$y))

Amazingly, the model has achieved a training-set accuracy of less than 50%.

Okay, that was a bust; how about adding PC2 for an additional 15% of the variance in the original IVs?

dm = data.frame(
    restraint.pc1 = virt.pca$x[,"PC1"],
    restraint.pc2 = virt.pca$x[,"PC2"],
    rape = d$v667 == "Present")
model = glm(rape ~ restraint.pc1 * restraint.pc2, data = dm,
    family = binomial(link = "logit"))
round(d = 2,
    mean((predict(model, type = "response") > .5) == model$y))

That's not much bigger than the base rate in the training set:

round(d = 2,
    max(mean(model$y), mean(!model$y)))

I'm not going to bother checking test error.

Conclusion: across societies, attitudes about sexual expression in children are not predictive of rape.

Sexism predictors

Here are the candidate predictors:

Intermediate Or Local Political Leaders
Leadership Posts in Kinship Or Extended Family Units
Double Standard in Regard to Premarital Sex
Double Standard in Regard to Extramarital Sex
Voice of the Potential Bride And Groom in Marriage Decisions
Relative Ease of Initiating Divorce
Wife to Husband Institutionalized Deference (guttman Scale)
an Explicit View That Men Should And Do Dominate Their Wives
Physical Punishment of the Spouse Condoned
Belief That Women Are Generally Inferior to Men
Domestic Authority Scale (for women)
Female Power Guttman Scale constructed from 657-662
Ideology of Male Toughness
Male Segregation
Ideology of Male Superiority

(v670, "Composite of Male Dominance", overlaps with v667.)

Again, let's see how many cases there are for each combination of IV and DV.

t(sapply(qw(v582, v583, v596, v597, v604, v610, v615, v621, v620, v626, v632, v663, v664, v665, v1767), function(predictor)
    sapply(qw(v173, v174, v667), function(crit)
        sum(![[predictor]]) & ![[crit]])))))
  v173 v174 v667
v582 17 11 32
v583 10 7 26
v596 17 13 36
v597 19 15 39
v604 18 14 39
v610 21 17 43
v615 19 15 39
v621 17 12 29
v620 18 14 31
v626 21 17 43
v632 21 17 43
v663 29 23 83
v664 20 17 73
v665 24 19 71
v1767 20 16 47

As in the previous section, v173 and v174 have few cases. v667 has more, but still less than 50, with the notable exceptions of v663, v664, and v665.

Here's the sorted number of cases for v667:

sort(dec = T, sapply(qw(v582, v583, v596, v597, v604, v610, v615, v621, v620, v626, v632, v663, v664, v665, v1767), function(predictor)
    sum(![[predictor]]) & !$v667))))
v663 83
v664 73
v665 71
v1767 47
v610 43
v626 43
v632 43
v597 39
v604 39
v615 39
v596 36
v582 32
v620 31
v621 29
v583 26

It turns out, if you explore the dataset, that the situation with respect to how much missingness there is on which combinations of variables is complicated and confusing. Ultimately, I picked the three predictors that I considered most interesting: v626, v663, and v1767. When considering v1767, we might as well consider v663 at the same time, since there's just a handful of cases that are missing on v663 but not on v1767 or the DV. So we have two lines of analysis.

Belief in female inferiority

Look at the cross-classification table:

table(na.omit(sccsfac[qw(v626, v667)]))
  Absent Present
Yes 1 10
No such belief 18 14

This table suggests that in almost all cultures with less rape, there is no "Belief That Women Are Generally Inferior to Men". In cultures with more rape, on the other hand, such a belief is reasonably likely, although perhaps still less likely than a 50% chance.

Here is the table with probabilities of rape as a function of belief:

round(d = 2, prop.table(margin = 1,
    table(na.omit(sccsfac[qw(v626, v667)]))))
  Absent Present
Yes 0.09 0.91
No such belief 0.56 0.44

Without the belief, higher rape is 44% likely, but with it, it's 91% likely. Interesting.

c(accuracy = round(d = 2, with(na.omit(sccsfac[qw(v626, v667)]),
    mean((v626 == "Yes") == (v667 == "Present")))))
accuracy 0.65

The accuracy rate is still not very impressive, since in the no-such-belief case, the rape rate is around 50%.

I won't do a separate (i.e., cross-validatory) estimate of predictive accuracy because I don't think of this model as having any free parameters. (If this IV is to be nontrivially dichotomously used to predict this DV, No such belief must map to Absent and Yes to Present. Vice versa wouldn't make sense.)

Conclusion: among cultures with a belief that women are generally inferior to men, rape is likely to be reasonably common or institutionalized. Among cultures without such a belief, rape rates vary more widely.

Female power and male superiority

Here's the contingency table for female power and rape:

table(na.omit(sccsfac[qw(v663, v667)]))
  Absent Present
All items absent 3 4
Flexible marriage mores only 3 3
Plus Female nondomestic production 1 2
Plus demand for female produce 1 6
Plus female economic control 4 8
Plus female political participation 24 9
Plus female solidarity groups 7 8

An inconsistent picture. However, the modal level of v663 also seems to be one in which the rape rate is distant from 50%. So let's see how well we can do with logistic regression.

dm = with(na.omit(sccsfac[qw(v663, v667)]), data.frame(
    female.power = as.integer(v663),
    rape = v667 == "Present"))
model = glm(rape ~ female.power, data = dm,
    family = binomial(link = "logit"))
round(d = 2,
    mean((predict(model, type = "response") > .5) == model$y))

Not great.

Suppose we add ideology of male superiority. It has only three levels, so let's treat it as nominal.

dm = with(na.omit(sccsfac[qw(v663, v1767, v667)]), data.frame(
    female.power = as.integer(v663),
    male.sup = v1767,
    rape = v667 == "Present"))
model = glm(rape ~ female.power * male.sup, data = dm,
    family = binomial(link = "logit"))
round(d = 2,
    mean((predict(model, type = "response") > .5) == model$y))

Better. But all the predictive ability is coming from v1767, actually:

model = glm(rape ~ male.sup, data = dm,
    family = binomial(link = "logit"))
round(d = 2,
    mean((predict(model, type = "response") > .5) == model$y))

So let's use just that.

d = with(na.omit(sccsfac[qw(v1767, v667)]), data.frame(
    male.sup = ordered(levels = qw(None, Weak, Strong),
        ifelse(v1767 == "strongly articulated ideology of male superiority", "Strong",
        ifelse(v1767 == "weakly articulated ideology of male superiority", "Weak",
    rape = v667 == "Present"))
None 16 11
Weak 2 5
Strong 4 9

Here it is with probabilities:

round(d = 2, prop.table(table(d), 1))
None 0.59 0.41
Weak 0.29 0.71
Strong 0.31 0.69

You can see that ideologies of male superiority, weak or strong, predict rape at 70% probability, whereas societies without such an ideology are only 41% likely to have lots of rape.

round(d = 2, mean(with(d,
    (male.sup %in% qw(Weak, Strong)) == rape)))

Let's estimate predictive accuracy. The only free parameter I would consider the model to have is whether Weak implies TRUE or FALSE. None will be fixed at FALSE and Strong at TRUE.

cv.folds = 10
d$fold = sample(rep(factor(1 : cv.folds), len = nrow(d)))
results = c(recursive = T, lapply(1 : cv.folds, function(fold)
   {test = d[d$fold == fold,]
    train = d[d$fold != fold,]
    weak.rate = mean(subset(train, male.sup == "Weak")$rape)
    weak.pred =
        if (weak.rate < .5) F
        else if (weak.rate > .5) T
        else sample(c(T, F), 1)
    with(test, rape ==
        ifelse(male.sup == "None", F,
        ifelse(male.sup == "Strong", T,
round(d = 2, mean(results))

The cross-validatory folds always assigned Weak to TRUE, so this is the same number as the training-set accuracy.

Conclusion: across cultures, an ideology of male superiority predicts more rape.


Palmer, C. T. (1989). Is rape a cultural universal? A re-examination of the ethnographic data. Ethnology, 28(1), 1–16. Retrieved from

Sanday, P. R. (1981a). Female power and male dominance: On the origins of sexual inequality. Cambridge: Cambridge University Press. ISBN 978-0-521-23618-8.

Sanday, P. R. (1981b). The socio-cultural context of rape: A cross-cultural study. Journal of Social Issues, 37(4), 5–27. doi:10.1111/j.1540-4560.1981.tb01068.x