mSTUDY notebook

Created 24 Sep 2016 • Last modified 4 Jan 2017

General information

Go to Amy Ragsdale for questions about the data.

None of the subjects use injection drugs.

Viral load level

From: Brendan Quinn
Date: 14 September 2016
To: Jesse L. Clark

I'm working on a HIV paper with Steve Shoptaw and he asked me to get in touch with you to ask what level you would use for undetectable viral load?

From: Jesse L. Clark
Date: 14 September 2016
To: Brendan Quinn

It depends on the machine used for testing. The standard we use at ucla has a lower limit of detection of 20 copies per mL. The machines commonly used in South America have a limit of 50 copies. In the past (about 10 or 15 years ago) the standard lower limit was 400 copies. If you are using recent data from a developed country setting you should be safe assuming the upper limit of detection is 20 copies per mL.

Thinking

The first kind of DV to consider is HIV transmission risk.

  • In HIV-positive subjects, this is viral load (dichotomized as detectable or undetectable)
  • In HIV-negative subjects, this is receptive anal sex with an HIV-positive or HIV-unknown partner (ideally, we would focus on unprotected such sex, but that wasn't measured)

Possible framings of the problem:

  • Use only the DVs in the last timepoint available for each subject as the DV to be predicted. (So the DVs at previous timepoints are essentially IVs, from the perspective of the predictive procedure.)
  • Treat every timepoint as a potential case to be predicted, but allow the model to see all past timepoints. This rules out conventional k-fold cross-validation (if in one fold you have timepoint 1 in training and timepoint 2 in testing, you'll have it the other way around in another fold). I guess the simplest way to do this is a variation on leave-one-out where we consider one observation at a time and allow everything as training data except that observation and future observations for the same subject.

Model extensions to consider:

  • Age and race

Simple analyses

Race

(setv d (.append
  (.mean
    (getl obs1 : (filt (.startswith it "race_") obs1.columns)))
  (pd.Series :index ["race_hispanic"] (.mean
    (= (ss ($ obs1 race_hispanic) (pd.notnull $)) "Yes")))))
(rd 2 (.sort-values d :ascending F))
I value
race_black 0.49
race_hispanic 0.45
race_white 0.29
race_nat_am 0.08
race_pacific 0.02
race_asian 0.02
race_indian 0.01

Only black, Hispanic, and white have sufficiently high prevalences for them to have much predictive value.

Age

(setv v (valcounts (.astype (ss ($ obs1 age_at_baseline) (pd.notnull $)) int)))
(plt.bar (- v.index .5) v.values :width 1 :color "black" :edgecolor "none")

ages.png

Interesting how there's a gap at 29. Otherwise, looks okay.

Rates of self-reported drug usage

(setv d (. (.apply (.groupby obs "hiv_status") (λ
  (.mean (~ (.isin
    (getl it : (filt (.startswith it "drug_sr_") it.columns))
    (qw Never Once)))))) T))
(rd 2 (ordf d (.mean d 1)))
I HIV- HIV+
drug_sr_heroin 0.04 0.01
drug_sr_other 0.05 0.02
drug_sr_crack 0.07 0.04
drug_sr_mdma 0.09 0.06
drug_sr_party 0.06 0.12
drug_sr_rx 0.13 0.06
drug_sr_pde5_inhib 0.10 0.10
drug_sr_powder_cocaine 0.15 0.08
drug_sr_alkyl_nitrite 0.22 0.25
drug_sr_meth 0.22 0.47
drug_sr_cigarettes 0.46 0.47
drug_sr_cannabis 0.51 0.44
drug_sr_alcohol_heavy 0.61 0.46
drug_sr_alcohol 0.81 0.72

Here's the proportion of observations at which a subject had used the drug in question more than once (or for the alcohol items, even once). Only a few (alkyl nitrite, meth, cigarettes, cannabis, and alcohol) have usage in more than a fifth of observations in either group. Another four drugs (PDE5 inhibitors, other party drugs, perscription drugs, and powder cocaine) have usage between 10% and 20% in at least one of the two groups. Two of these drugs (other party drugs and perscription drugs) are rather vague classes of drugs and hence seem unlikely to be useful for analysis. In summary, we should probably concern ourselves with alcohol, cannabis, cigarettes, meth, alklyl nitrite, powder cocaine, and PDE5 inhibitors.

(rd 2 (.apply (.groupby obs "hiv_status") (λ
  (.mean (.any (~ (.isin
    (getl it : (qw
      drug_sr_heroin drug_sr_other drug_sr_crack drug_sr_mdma drug_sr_party drug_sr_rx))
    (qw Never Once))) :axis 1)))))
hiv_status value
HIV- 0.23
HIV+ 0.22

Here's the proportion of subjects who used at least one of these rarer drugs more than once.

Urine tests

(setv d (. (pd.concat :axis 1 (rmap [[sr urine] [
    ["drug_sr_meth" "drug_urine_meth"]
    ["drug_sr_mdma" "drug_urine_mdma"]
    ["drug_sr_cannabis" "drug_urine_cannabis"]
    [["drug_sr_powder_cocaine" "drug_sr_crack"] "drug_urine_cocaine"]
    ["drug_sr_heroin" "drug_urine_opiate"]
    ["drug_sr_alkyl_nitrite" "drug_urine_nitrite"]]]
  (setv deny (if (coll? sr)
    (.all (= (getl obs : sr) "Never") :axis 1)
    (= (getl obs : sr) "Never")))
  (valcounts (getl (get obs deny) : urine)))) T))
(setv ($ d prop) (rd 2 (wc d (/ $Positive (+ $Positive $Negative)))))
(getl (.rename d :index (λ (cut it (len "drug_urine_")))) :
  (qw N/A Positive Negative prop))
I N/A Positive Negative prop
meth 1 11 488 0.02
mdma 2 16 678 0.02
cannabis 1 48 336 0.12
cocaine 1 4 619 0.01
opiate 2 9 779 0.01
nitrite 1 11 552 0.02

This shows the urine tests results among observations for which the subject denied using each drug. These dishonesty rates are pretty low. The one for cannabis is probably higher because physiological tests for cannabis have much longer temporal ranges than those for other drugs.

There's some slippage for heroin because heroin isn't actually an opiate, just an opioid, so I don't know if the opiate urine test could in fact detect it.

Cross-tabulation of self-reported drug usage and the DV

HIV-negative

(setv d (pd.melt :id_vars ["risky_sex"] :var_name "drug"
  (getl obs-hiv- : (+ ["risky_sex"] (filt
    (and (.startswith it "drug_sr_") (not-in "alcohol" it) (not-in "cigarettes" it))
    obs.columns)))))

(sns.factorplot :data d :kind "count"
  :row "drug" :x "value" :hue "risky_sex"
  :hue_order (qw Safe Risky)
  :palette {"Risky" "red" "Safe" "green"})

risky_sex_by_drug_bars.png

(setv l (rmap [dc (filt
    (and (.startswith it "drug_sr_") (not-in "alcohol" it) (not-in "cigarettes" it)) obs.columns)]
  (setv x (.apply (.groupby obs-hiv- dc) (λ
    (.mean (= ($ it risky_sex) "Risky")))))
  (setv x.name (cut x.index.name (len "drug_sr_")))
  x))
(rd 2 (pd.DataFrame l))
I Never Once Rarely Monthly Weekly Daily
meth 0.20 0.43 0.37 0.60 0.35 0.72
mdma 0.25 0.48 0.25 0.25 0.00 0.83
cannabis 0.26 0.18 0.45 0.33 0.30 0.21
powder_cocaine 0.25 0.33 0.30 0.29 0.20 1.00
crack 0.26 0.67 0.00 0.25 0.27 0.50
heroin 0.26 0.60 0.20 0.67 1.00 0.60
party 0.26 0.25 0.25 0.43 0.75 0.80
rx 0.23 0.37 0.52 0.33 0.50 0.50
pde5_inhib 0.26 0.15 0.50 0.27 0.50 0.50
alkyl_nitrite 0.22 0.32 0.40 0.39 0.40 0.62
other 0.26 0.38 0.14 0.75 0.00 0.71
(rd 2 (.apply (.groupby obs-hiv- "drug_sr_alcohol") (λ
  (.mean (= ($ it risky_sex) "Risky")))))
drug_sr_alcohol value
Never 0.29
Monthly or less 0.24
2 to 4 times a month 0.27
2 to 3 times a week 0.29
4 or more times a week 0.30
(rd 2 (.apply (.groupby obs-hiv- "drug_sr_alcohol_heavy") (λ
  (.mean (= ($ it risky_sex) "Risky")))))
drug_sr_alcohol_heavy value
Never 0.26
Less than monthly 0.23
Monthly 0.37
Weekly 0.27
Daily or almost daily 0.32
(rd 2 (.apply (.groupby obs-hiv- "drug_sr_cigarettes") (λ
  (.mean (= ($ it risky_sex) "Risky")))))
drug_sr_cigarettes value
Never 0.25
Occasionally(<1 cigarette/day) 0.30
Less than half pack/day 0.35
At least half pack, but less than 1 pack/day 0.14
1 pack/day or more, but less than 2 packs/day 0.39
2+ packs/day 0.50

Daily usage of any drug (other than cannabis or alcohol), although uncommon, seems particularly associated with risky sex.

HIV-positive

(setv d (pd.melt :id_vars ["hiv_load_detect"] :var_name "drug"
  (getl obs-hiv+ : (+ ["hiv_load_detect"] (filt
    (and (.startswith it "drug_sr_") (not-in "alcohol" it) (not-in "cigarettes" it))
    obs.columns)))))

(sns.factorplot :data d :kind "count"
  :row "drug" :x "value" :hue "hiv_load_detect"
  :hue_order (qw Undetectable Detectable)
  :palette {"Detectable" "red" "Undetectable" "green"})

hiv_load_by_drug_bars.png

(setv l (rmap [dc (filt
    (and (.startswith it "drug_sr_") (not-in "alcohol" it) (not-in "cigarettes" it)) obs.columns)]
  (setv x (.apply (.groupby obs-hiv+ dc) (λ
    (.mean (= ($ it hiv_load_detect) "Detectable")))))
  (setv x.name (cut x.index.name (len "drug_sr_")))
  x))
(rd 2 (pd.DataFrame l))
I Never Once Rarely Monthly Weekly Daily
meth 0.61 0.50 0.64 0.79 0.80 0.89
mdma 0.68 0.76 0.83 0.33 1.00 0.00
cannabis 0.68 0.64 0.77 0.55 0.79 0.67
powder_cocaine 0.66 0.85 0.86 0.83 0.75  
crack 0.67 0.91 1.00 1.00 0.50 1.00
heroin 0.68 0.67 1.00     1.00
party 0.67 0.87 0.81 0.50 0.78 1.00
rx 0.68 0.60 1.00 1.00 0.50 0.86
pde5_inhib 0.67 0.83 1.00 0.56 0.69 1.00
alkyl_nitrite 0.69 0.61 0.74 0.62 0.70 1.00
other 0.68 1.00 0.67 0.50 1.00 1.00
(rd 2 (.apply (.groupby obs-hiv+ "drug_sr_alcohol") (λ
  (.mean (= ($ it hiv_load_detect) "Detectable")))))
drug_sr_alcohol value
Never 0.58
Monthly or less 0.66
2 to 4 times a month 0.79
2 to 3 times a week 0.76
4 or more times a week 0.71
(rd 2 (.apply (.groupby obs-hiv+ "drug_sr_alcohol_heavy") (λ
  (.mean (= ($ it hiv_load_detect) "Detectable")))))
drug_sr_alcohol_heavy value
Never 0.64
Less than monthly 0.78
Monthly 0.77
Weekly 0.70
Daily or almost daily 0.55
(rd 2 (.apply (.groupby obs-hiv+ "drug_sr_cigarettes") (λ
  (.mean (= ($ it hiv_load_detect) "Detectable")))))
drug_sr_cigarettes value
Never 0.63
Occasionally(<1 cigarette/day) 0.68
Less than half pack/day 0.79
At least half pack, but less than 1 pack/day 0.78
1 pack/day or more, but less than 2 packs/day 0.54
2+ packs/day  

(Blank cells in the tables mean there were no HIV+ subjects with that level of drug use.)

These results overall look similar to the one for HIV- subjects and risky sex. Most drugs, the chief exceptions being alcohol and cigarettes, are associated with higher viral load at higher doses.

Prediction

HIV-negative

(setv results-freq (pred "risky_sex_receive" "Risky" "freq" obs-hiv-))
(show-pred-table results-freq)
I MSE p when y=0 p when y=1
trivial-between 0.20674 0.290 0.287
trivial-within 0.20762 0.276 0.328
logistic 0.20325 0.275 0.327
logistic-ridge 0.19846 0.279 0.315
logistic-lasso 0.19629 0.283 0.333
I MSE p when y=0 p when y=1
trivial-between 0.20674 0.290 0.287
trivial-within 0.20762 0.276 0.328
logistic 0.20165 0.273 0.331
logistic-ridge 0.19860 0.279 0.317
logistic-lasso 0.19634 0.283 0.332
logistic-random 0.20065 0.244 0.318

This table uses dose-response models. The mean squared error (MSE) is from a variation on leave-one-out cross-validation. The drugs items are coded as monthly frequencies.

All the regression models do better than the trivial models (with don't use any of the predictors, except for subject for trivial-within). The best performer is logistic-lasso, which is logistic regression with a lasso (L1) penalty.

(setv results-binary (pred "risky_sex_receive" "Risky" "any_or_none" obs-hiv-))
(show-pred-table results-binary)
I MSE p when y=0 p when y=1
trivial-between 0.20674 0.290 0.287
trivial-within 0.20762 0.276 0.328
logistic 0.19973 0.266 0.341
logistic-ridge 0.20916 0.292 0.300
logistic-lasso 0.25219 0.421 0.381
logistic-random 0.19987 0.247 0.330

This table is similar but uses dichotomous models, where each drug was dichotomized as any use vs. no use.

Here, curiously, the penalized regression models underperform the trivial models. The best is logistic.

So the best-performing dose-response model (logistic-lasso, MSE .196) does slightly better than the best-performing binary model (logistic, MSE .200).

(plt.xlim [0 1])
(sns.kdeplot (get results-freq "ypp" "logistic-lasso") :bw .1
  :label "freq")
(sns.kdeplot (get results-binary "ypp" "logistic") :bw .1
  :label "binary")

hiv-neg-ypps.png

Here is a density plot of the predicted probabilities across all observations, comparing the best-performing dose-response model to the best-performing binary model. The dose-response model is simpler in the sense that it outputs a smaller range of probabilities. This may be due to the effect of the lasso.

(.map
  (pd.Series
    (get results-freq "coefs_all_obs" "logistic-lasso")
    :index (+ ["Intercept"]
      (filt (!= it "s") (get results-freq "x_vars"))))
  (λ (if (= it 0) "" (format it ".3f"))))
I value
Intercept -0.687
t  
race_white  
race_black -0.283
race_other -0.048
age_at_baseline -0.216
drug_sr_alcohol  
drug_sr_alcohol_heavy  
drug_sr_cigarettes  
drug_sr_cannabis -0.310
drug_sr_meth 0.675
drug_sr_alkyl_nitrite  
drug_sr_powder_cocaine 0.461
drug_sr_pde5_inhib  

Here are the coefficients of the dose-response model. Several coefficients are 0 (shown as blank) because of the lasso penalty. t is the effect of time. Drug frequencies and age were standardized to have SD 1/2. Since this is a logistic-regression model, the coefficients are on the scale of log odds ratios.

We see that both powder cocaine and methamphetamine are associated with a higher probability of risky sex the more they are used. For cannabis, there's a negative effect.

HIV-positive: risky sex

(setv results-freq (pred "risky_sex_transmit_d" "Risky" "freq" obs-hiv+))
(show-pred-table results-freq)
I MSE p when y=0 p when y=1
trivial-between 0.20248 0.280 0.276
trivial-within 0.20837 0.279 0.313
logistic 0.20799 0.275 0.292
logistic-ridge 0.20440 0.282 0.275
logistic-lasso 0.25052 0.500 0.498
logistic-random 0.20828 0.246 0.270
(setv results-binary (pred "risky_sex_transmit_d" "Risky" "any_or_none" obs-hiv+))
(show-pred-table results-binary)
I MSE p when y=0 p when y=1
trivial-between 0.20248 0.280 0.276
trivial-within 0.20837 0.279 0.313
logistic 0.20369 0.268 0.308
logistic-ridge 0.20322 0.281 0.276
logistic-lasso 0.25000 0.500 0.500
logistic-random 0.20461 0.250 0.292

trivial-between is the winner among both the dose-response and binary models.

HIV-positive: viral load detectability

(setv results-freq (pred "hiv_load_detect" "Detectable" "freq" obs-hiv+))
(show-pred-table results-freq)
I MSE p when y=0 p when y=1
trivial-between 0.21398 0.697 0.693
trivial-within 0.20034 0.668 0.739
logistic 0.21984 0.678 0.697
logistic-ridge 0.21913 0.700 0.691
logistic-lasso 0.21842 0.629 0.625
logistic-random 0.20929 0.684 0.770

Here the winner is a trivial model, trivial-within.

(setv results-binary (pred "hiv_load_detect" "Detectable" "any_or_none" obs-hiv+))
(show-pred-table results-binary)
I MSE p when y=0 p when y=1
trivial-between 0.21398 0.697 0.693
trivial-within 0.20034 0.668 0.739
logistic 0.21466 0.667 0.705
logistic-ridge 0.21096 0.671 0.704
logistic-lasso 0.20972 0.662 0.703
logistic-random 0.20561 0.669 0.777

trivial-within is again the best performer.