# Comorbid notebook

Created 2 Aug 2016 • Last modified 5 Oct 2017

## The context of the data

All results with cell sizes 10 or below need approval to be distributed in any manner.

Jennifer:

- "for the comorbidity analysis, we have only been using 2010 data since it's the most recent we have. The 3 analysis files (all_*) were created using the 2010 claims."
- "the three all_* files will be best for you to work with, so you don't have to deal with the raw claims quite yet. When we asked for the data, we didn't get all Medicare/Medicaid enrollees, we only got those who had an ARV claim or HIV diagnosis at any point. Then we further refine our sample using the algorithm. Due to the nature of claims, we also have to limit to people enrolled for the full year, are fee-for-service (rather than managed care), and are in California the entire year"
- "Regarding the ID variables, bene_id is the Medicare ID so we use it for Medicare only and Duals and msis_id is the Medicaid ID which we use for Medicaid only patients. Throughout the years, Medicaid patients have gradually also been assigned a bene_id, but there are still some people who don't have one. The reason we have bene_id_10 is because we found inconsistencies in IDs when merging across years, this version is needed to merge with raw claims for that year. The bene_id variable allows for consistency in merging across years.… There are a few other caveats with the ID variables we can discuss at a later point when needed."
- The three files have the same variables

### Redacted claims

Jennifer: "Claims relating to substance abuse were redacted for our 2009 and 2010 data, thus these payments are not included in our individual claims. Thus, we will find lower total costs for those who have redacted claims."

Eligibility | Claim Type | # Suppressed Claims | Total Claims | % Claims Suppressed | # Beneficiaries w/ Suppressed Claims | Total Beneficiaries in Claims | % Beneficiaries w/ Suppressed Claims | Medicaid Payment (MDCD_PYMT_AMT) Suppressed Claims | Total Medicaid Payment (MDCD_PYMT_AMT) in Claims |
---|---|---|---|---|---|---|---|---|---|

All | Inpatient | 773,353 | 9,378,620 | 8.25% | 417,977 | 6,106,602 | 6.84% | 3,040,765,005 | 35,166,806,936 |

Long-Term Care | 225,235 | 33,319,656 | 0.68% | 38,945 | 1,648,083 | 2.36% | 522,839,791 | 62,508,895,726 | |

Other Services | 37,376,167 | 2,520,159,210 | 1.48% | 1,415,671 | 62,063,859 | 2.28% | 1,706,728,920 | 213,731,343,805 | |

Total | 38,374,755 | 2,562,857,486 | 1.50% | 1,628,983 | 62,381,748 | 2.61% | 5,270,333,716 | 311,407,046,467 | |

Aged | Inpatient | 19,737 | 951,402 | 2.07% | 16,697 | 618,562 | 2.70% | 41,584,740 | 1,973,545,032 |

Long-Term Care | 71,432 | 22,972,515 | 0.31% | 5,426 | 1,081,340 | 0.50% | 151,269,024 | 36,690,837,529 | |

Other Services | 441,310 | 239,166,210 | 0.18% | 24,220 | 4,100,416 | 0.59% | 22,440,257 | 25,928,333,797 | |

Total | 532,479 | 263,090,127 | 0.20% | 41,666 | 4,258,713 | 0.98% | 215,294,021 | 64,592,716,358 | |

Disabled / Blind | Inpatient | 389,806 | 2,831,711 | 13.77% | 215,676 | 1,418,478 | 15.20% | 1,806,188,438 | 16,188,598,664 |

Long-Term Care | 98,952 | 9,466,201 | 1.05% | 17,048 | 457,139 | 3.73% | 242,493,452 | 23,612,634,633 | |

Other Services | 15,469,683 | 772,861,048 | 2.00% | 549,705 | 9,493,018 | 5.79% | 674,012,800 | 90,720,882,400 | |

Total | 15,958,441 | 785,158,960 | 2.03% | 658,766 | 9,524,493 | 6.92% | 2,722,694,690 | 130,522,115,697 | |

Child | Inpatient | 27,627 | 2,314,122 | 1.19% | 21,927 | 1,794,121 | 1.22% | 124,069,187 | 7,485,361,209 |

Long-Term Care | 42,109 | 414,447 | 10.16% | 10,291 | 72,559 | 14.18% | 98,818,837 | 1,267,625,658 | |

Other Services | 3,514,205 | 977,750,886 | 0.36% | 220,256 | 32,530,394 | 0.68% | 241,240,915 | 54,673,886,767 | |

Total | 3,583,941 | 980,479,455 | 0.37% | 235,895 | 32,561,585 | 0.72% | 464,128,939 | 63,426,873,634 | |

Adult | Inpatient | 303,819 | 2,941,126 | 10.33% | 144,368 | 2,052,181 | 7.03% | 908,646,835 | 8,006,862,338 |

Long-Term Care | 8,499 | 119,370 | 7.12% | 4,976 | 20,899 | 23.81% | 15,911,261 | 131,464,388 | |

Other Services | 16,739,091 | 480,568,052 | 3.48% | 586,975 | 15,494,187 | 3.79% | 730,157,393 | 37,339,227,221 | |

Total | 17,051,409 | 483,628,548 | 3.53% | 647,872 | 15,573,669 | 4.16% | 1,654,715,489 | 45,477,553,947 |

(Only the claims that contain the substance abuse codes are removed; not the beneficiary and all claims associated with that beneficiary)

The "MAX Uniform Eligibility Code - For Month of Service" (EL_MAX_ELGBLTY_CD_MO) variable from each claim was used to determine eligibility grougs.

- Aged: 11, 21, 31, 41, 51
- Blind/Disabled: 12, 22, 32, 42, 52
- Child: 14, 16, 24, 34, 44, 48, 54
- Adult: 15, 17, 25, 35, 45, 55, 3A

## Descriptives

These demographic tables use the same sample inclusion criteria as the association analyses described below.

```
(display-demog "mcare_only")
```

I | value |
---|---|

Female, proportion | .048 |

Age, minimum | 31 |

Age, median | 56 |

Age, maximum | 94 |

Race, white, proportion | .761 |

Race, black, proportion | .070 |

Race, Hispanic, proportion | .117 |

Race, other, proportion | .052 |

Urban, proportion | .949 |

High-volume HIV care provider, proportion | .582 |

Disabled, proportion | .773 |

```
(display-demog "dual")
```

I | value |
---|---|

Female, proportion | .118 |

Age, minimum | 21 |

Age, median | 50 |

Age, maximum | 90 |

Race, white, proportion | .527 |

Race, black, proportion | .212 |

Race, Hispanic, proportion | .223 |

Race, other, proportion | .038 |

Urban, proportion | .944 |

High-volume HIV care provider, proportion | .661 |

Disabled, proportion | .955 |

## Prediction of total costs

These cross-validation analyses of predictive accuracy use all beneficiaries from 2010. The predictors are gender, race, age, age squared, disabled status, and each of the comorbid conditions.

mcaid_only | mcare_only | dual | |
---|---|---|---|

Trivial (predict median) | 18,918 | 22,280 | 26,439 |

OLS (no log transformation) | 17,819 | 21,613 | 23,801 |

OLS | 17,455 | 22,399 | 22,470 |

Ridge regression | 17,503 | 22,377 | 22,448 |

Elastic net | 17,422 | 22,267 | 22,428 |

Elastic net w/ all 1st-order interactions | 17,304 | 22,497 | 22,407 |

Quantile regression | 17,053 | 20,279 | 22,327 |

Quantile regression w/ lasso | 17,090 | 20,597 | 22,354 |

Quantile regression w/ lasso, some interactions | 17,152 | 20,832 | 22,420 |

Quantile regression w/ lasso, inpatient | 17,338 | 21,394 | 25,516 |

Random forest | 18,940 | 23,274 | 24,231 |

[Min MAE, gender and race] | 18,834 | 22,207 | 26,408 |

[Min MAE, all IVs but age] | 11,594 | 12,977 | 13,811 |

Here, for each of the three datasets and several models, we have the cross-validated mean absolute error (MAE) in predicting total expenditure per subject, in dollars. This means that, e.g., a MAE of $15,000 implies that the model's point prediction of a beneficiary's expenditures would have a mean distance from the true value of $15,000.

Except where marked, each model log-transforms the DV for model-fitting, then exp-transforms the predictions. You can see that this trick increases predictive accuracy (according to MAE) quite a bit.

The best-performing model in each case is unregularized quantile regression. In the case of Medicare-only beneficiares, there is a large improvement upon OLS.

Regularization (as the lasso) is necessary for the more complex quantile regression models because the model-fitting procedure has problems finding a solution otherwise. "Quantile regression w/ lasso, some interactions" interacts age and age squared with each comorbidity, and gender with each race. "Quantile regression w/ lasso, inpatient" uses logistic regression to predict whether the patient had any inpatient costs, and then applies a separately trained quantile regression model depending on each patient's predicted inpatient status.

The last two rows give a kind of lower bound on possible MAE for the demographic IVs and all IVs (demographic variables plus comorbidity flags), respectively. They are simply the MAE achieved when each subject's true value is predicted with the median among all subjects with precisely the same IV values (except for age, since it's continuous, unlike all the other IVs). So, unless age is very informative, we can't get MAEs below these numbers, and we probably won't be able to get MAEs very near them, either; getting a MAE that low in training would almost certainly lead to overfitting in testing.

## Prediction of individual cost types

### Thinking

Let's log the DV, as before.

Let's compare trivial models (i.e. predicting the median and ignoring all IVs) to quantile regression. Let's also compare models that have only the demographic variables as IVs to models that have comorbidities. And, let's compare models with a dummy variable for having Medicaid to twinned models (one for Medicare only, one for duals). That means we have…

- Trivial, single
- Trivial, twinned
- Quantile regression, demographic IVs, single
- Quantile regression, demographic IVs, twinned
- Quantile regression, all IVs, single
- Quantile regression, all IVs, twinned

…for each of the three cost types, and then there's the question of how to handle the (zero-inflated) inpatient costs. All the above methods could be applied unaltered, or you could precede them with a logistic-regression model that decides whether the patient has nonzero inpatient costs, and we can imagine four ways to use this logistic-regression model based on the choice of IVs (demographic-only versus all) and how we deal with Medicare-only versus duals (dummy variable versus twinned models). Ouch! But from talking to Arleen and Jennifer, I guess we'd best bite the bullet and consider all 16 nontrivial models for inpatient costs.

Since twin models didn't prevail for outpatient or drug costs, let's forget about those. That gives us the following models for inpatient costs:

- Fully trivial: just guess the median
- Quantile regression only (demographic IVs)
- Quantile regression only (all IVs)
- Logistic regression for nonzero (demographic IVs), then guess the conditional median
- Logistic regression for nonzero (demographic IVs), then quantile regression (demographic IVs)
- Logistic regression for nonzero (demographic IVs), then quantile regression (all IVs
- Logistic regression for nonzero (all IVs), then guess the conditional median
- Logistic regression for nonzero (all IVs), then quantile regression (demographic IVs)
- Logistic regression for nonzero (all IVs), then quantile regression (all IVs)

### Results

#### Absolute error

(display-pred-cv pred-cv-orx-mae)

I | DV | IVs | twin | MAE |
---|---|---|---|---|

0 | y_outpatient | trivial | single | 6,565 |

1 | y_outpatient | trivial | twin | 6,555 |

2 | y_outpatient | demog | single | 6,495 |

3 | y_outpatient | demog | twin | 6,487 |

4 | y_outpatient | demog+cm | single | 6,146 |

5 | y_outpatient | demog+cm | twin | 6,165 |

6 | y_drugs | trivial | single | 12,943 |

7 | y_drugs | trivial | twin | 12,932 |

8 | y_drugs | demog | single | 12,701 |

9 | y_drugs | demog | twin | 12,683 |

10 | y_drugs | demog+cm | single | 12,581 |

11 | y_drugs | demog+cm | twin | 12,595 |

The best-performing model for both DVs is a non-twinned model that includes all IVs.

With a normal bias correction (Newman, 1993):

I | DV | IVs | twin | MAE |
---|---|---|---|---|

0 | y_outpatient | trivial | single | 6565 |

1 | y_outpatient | trivial | twin | 6555 |

2 | y_outpatient | demog | single | 7445 |

3 | y_outpatient | demog | twin | 7438 |

4 | y_outpatient | demog+cm | single | 7364 |

5 | y_outpatient | demog+cm | twin | 7374 |

6 | y_drugs | trivial | single | 12943 |

7 | y_drugs | trivial | twin | 12932 |

8 | y_drugs | demog | single | 29643 |

9 | y_drugs | demog | twin | 32039 |

10 | y_drugs | demog+cm | single | 29322 |

11 | y_drugs | demog+cm | twin | 31205 |

With the smearing estimate of bias (Newman, 1993):

I | DV | IVs | twin | MAE |
---|---|---|---|---|

0 | y_outpatient | trivial | single | 6565 |

1 | y_outpatient | trivial | twin | 6555 |

2 | y_outpatient | demog | single | 7920 |

3 | y_outpatient | demog | twin | 7934 |

4 | y_outpatient | demog+cm | single | 7135 |

5 | y_outpatient | demog+cm | twin | 7180 |

6 | y_drugs | trivial | single | 12943 |

7 | y_drugs | trivial | twin | 12932 |

8 | y_drugs | demog | single | 13079 |

9 | y_drugs | demog | twin | 13090 |

10 | y_drugs | demog+cm | single | 12922 |

11 | y_drugs | demog+cm | twin | 12971 |

With a second smear for the top decile (Buntin & Zaslavsky, 2004):

I | DV | IVs | twin | MAE |
---|---|---|---|---|

0 | y_outpatient | trivial | single | 6565 |

1 | y_outpatient | trivial | twin | 6555 |

2 | y_outpatient | demog | single | 7296 |

3 | y_outpatient | demog | twin | 7177 |

4 | y_outpatient | demog+cm | single | 6913 |

5 | y_outpatient | demog+cm | twin | 6924 |

6 | y_drugs | trivial | single | 12943 |

7 | y_drugs | trivial | twin | 12932 |

8 | y_drugs | demog | single | 13486 |

9 | y_drugs | demog | twin | 13500 |

10 | y_drugs | demog+cm | single | 13604 |

11 | y_drugs | demog+cm | twin | 13469 |

(.round pred-cv-inpatient)

I | DV | prob_IVs | amount_IVs | MAE |
---|---|---|---|---|

0 | y_inpatient | trivial | trivial | 8489 |

1 | y_inpatient | trivial | demog | 8489 |

2 | y_inpatient | trivial | demog+cm | 9832 |

3 | y_inpatient | demog | trivial | 8497 |

4 | y_inpatient | demog | demog | 8505 |

5 | y_inpatient | demog | demog+cm | 8501 |

6 | y_inpatient | demog+cm | trivial | 7809 |

7 | y_inpatient | demog+cm | demog | 7777 |

8 | y_inpatient | demog+cm | demog+cm | 6792 |

By a substantial margin, the most complex model wins. How about that.

Here's coefficients and CIs for the winning models fit to all the data (or all the data with nonzero inpatient costs, in the case of y_inpatient_amount).

(rd 2 (np.exp (cbind (get pred-coefci "y_outpatient") (get pred-coefci "y_drugs") (get pred-coefci "y_inpatient_amount"))))

I | estimate | lo | hi | estimate | lo | hi | estimate | lo | hi |
---|---|---|---|---|---|---|---|---|---|

Intercept | 2151.02 | 1804.09 | 2405.39 | 21981.99 | 20522.46 | 23625.38 | 3697.35 | 2320.70 | 5560.81 |

has_medicaid | 1.21 | 1.15 | 1.28 | 1.04 | 1.02 | 1.08 | 1.02 | 0.80 | 1.28 |

female | 1.19 | 1.11 | 1.29 | 0.89 | 0.86 | 0.92 | 0.93 | 0.74 | 1.09 |

race_black | 0.84 | 0.79 | 0.89 | 0.88 | 0.85 | 0.91 | 1.04 | 0.91 | 1.28 |

race_hispanic | 0.84 | 0.79 | 0.88 | 0.95 | 0.92 | 0.99 | 0.98 | 0.83 | 1.27 |

race_other_nonwhite | 0.85 | 0.75 | 0.94 | 0.99 | 0.93 | 1.04 | 1.05 | 0.69 | 1.51 |

urban | 1.00 | 0.93 | 1.15 | 1.05 | 0.99 | 1.10 | 0.98 | 0.74 | 1.25 |

hiv_docvol_50plus | 1.11 | 1.05 | 1.16 | 1.08 | 1.05 | 1.10 | 0.90 | 0.78 | 1.07 |

disabled | 1.05 | 0.93 | 1.17 | 1.04 | 0.98 | 1.10 | 1.17 | 0.78 | 1.67 |

cm_Congestive_heart_failure | 1.26 | 1.13 | 1.37 | 0.95 | 0.88 | 1.02 | 1.23 | 1.00 | 1.52 |

cm_Cardiac_arrhythmias | 1.37 | 1.22 | 1.50 | 1.01 | 0.96 | 1.07 | 1.45 | 1.26 | 1.74 |

cm_Valvular_disease | 1.27 | 1.10 | 1.44 | 1.02 | 0.94 | 1.12 | 1.08 | 0.85 | 1.41 |

cm_Peripheral_vascular_disorders | 1.33 | 1.21 | 1.52 | 1.05 | 0.98 | 1.12 | 1.19 | 0.92 | 1.51 |

cm_Hypertension__uncomplicated | 1.22 | 1.17 | 1.28 | 1.06 | 1.03 | 1.09 | 1.35 | 1.14 | 1.64 |

cm_Hypertension__complicated | 1.41 | 1.26 | 1.61 | 0.91 | 0.85 | 0.98 | 1.14 | 0.94 | 1.50 |

cm_Paralysis | 1.64 | 1.31 | 2.02 | 1.01 | 0.83 | 1.18 | 2.02 | 1.45 | 2.74 |

cm_Other_neurological_disorders | 1.40 | 1.24 | 1.53 | 1.04 | 0.98 | 1.11 | 1.66 | 1.44 | 2.09 |

cm_Pulmonary_circulation_disorders | 1.01 | 0.85 | 1.37 | 1.16 | 1.07 | 1.35 | 1.25 | 0.96 | 1.75 |

cm_Chronic_pulmonary_disease | 1.36 | 1.30 | 1.46 | 1.07 | 1.03 | 1.10 | 1.37 | 1.20 | 1.65 |

cm_Diabetes__uncomplicated | 1.14 | 1.07 | 1.23 | 1.10 | 1.07 | 1.15 | 1.14 | 0.89 | 1.33 |

cm_Diabetes__complicated | 1.26 | 1.12 | 1.47 | 0.94 | 0.86 | 1.03 | 1.04 | 0.85 | 1.40 |

cm_Hypothyroidism | 1.33 | 1.22 | 1.44 | 1.12 | 1.07 | 1.19 | 1.08 | 0.84 | 1.33 |

cm_Renal_failure | 1.40 | 1.29 | 1.51 | 1.07 | 1.02 | 1.11 | 1.25 | 0.94 | 1.50 |

cm_Liver_disease | 1.41 | 1.32 | 1.50 | 1.04 | 1.00 | 1.07 | 1.40 | 1.16 | 1.64 |

cm_Peptic_ulcer_disease | 1.32 | 1.00 | 1.79 | 0.88 | 0.61 | 1.06 | 1.58 | 1.15 | 2.22 |

cm_Lymphoma | 1.68 | 1.43 | 1.88 | 0.99 | 0.91 | 1.07 | 1.53 | 1.16 | 2.05 |

cm_Metastatic_cancer | 2.25 | 1.41 | 2.86 | 1.05 | 0.90 | 1.18 | 1.66 | 1.07 | 2.95 |

cm_Solid_tumor_without_metastasis | 1.76 | 1.56 | 1.92 | 1.05 | 1.00 | 1.11 | 1.20 | 0.91 | 1.52 |

cm_Rheumatoid_arthritis | 1.27 | 1.11 | 1.49 | 1.00 | 0.88 | 1.11 | 0.81 | 0.58 | 1.21 |

cm_Coagulopathy | 1.13 | 1.01 | 1.35 | 0.97 | 0.88 | 1.04 | 1.50 | 1.19 | 1.84 |

cm_Coagulopathy_hemophilia | 20.39 | 3.07 | 40.03 | 1.25 | 1.00 | 1.68 | 2.28 | 1.00 | 4.29 |

cm_Blood_loss_anemia | 1.13 | 0.93 | 1.35 | 0.99 | 0.77 | 1.26 | 0.90 | 0.73 | 1.81 |

cm_Deficiency_anemia | 1.66 | 1.53 | 1.81 | 1.08 | 1.01 | 1.12 | 1.24 | 1.00 | 1.44 |

cm_Obesity | 1.12 | 1.03 | 1.26 | 0.98 | 0.92 | 1.07 | 1.56 | 1.11 | 2.01 |

cm_Weight_loss | 1.37 | 1.25 | 1.50 | 1.05 | 1.00 | 1.12 | 1.72 | 1.40 | 1.95 |

cm_Fluid_and_electrolyte_disorders | 1.17 | 1.07 | 1.27 | 0.94 | 0.89 | 1.00 | 2.31 | 2.01 | 2.67 |

age_std | 1.04 | 0.98 | 1.10 | 1.04 | 1.01 | 1.07 | 0.87 | 0.71 | 1.00 |

age_std2 | 1.04 | 0.97 | 1.09 | 0.85 | 0.83 | 0.89 | 1.07 | 0.91 | 1.20 |

From left to right, these are the coefficients and 95% confidence limits for: outpatient costs, drug costs, and conditional inpatient costs. Everything has been already antilogged, so 1.27 means an effect of multiplying the median by 1.27. I excluded the intercept since it's of limited interest in a model that's effectively multiplicative.

I couldn't find any preexisting method to find CIs for lasso-regularized quantile regression, so I using bootstrapping. Some simulations convinced me that the resulting CIs indeed have coverage probabilities near 95%.

(rd 2 (np.exp (get pred-coefci "y_inpatient_prob")))

I | estimate | lo | hi |
---|---|---|---|

Intercept | 0.06 | 0.04 | 0.09 |

has_medicaid | 1.49 | 1.29 | 1.73 |

female | 1.10 | 0.91 | 1.32 |

race_black | 1.00 | 0.86 | 1.17 |

race_hispanic | 0.96 | 0.83 | 1.12 |

race_other_nonwhite | 0.83 | 0.61 | 1.11 |

urban | 1.48 | 1.12 | 1.97 |

hiv_docvol_50plus | 0.92 | 0.82 | 1.04 |

disabled | 0.81 | 0.60 | 1.10 |

cm_Congestive_heart_failure | 1.50 | 1.13 | 1.99 |

cm_Cardiac_arrhythmias | 3.85 | 3.04 | 4.87 |

cm_Valvular_disease | 1.88 | 1.28 | 2.75 |

cm_Peripheral_vascular_disorders | 1.32 | 1.00 | 1.72 |

cm_Hypertension__uncomplicated | 1.87 | 1.65 | 2.13 |

cm_Hypertension__complicated | 2.81 | 2.08 | 3.80 |

cm_Paralysis | 4.79 | 2.85 | 8.20 |

cm_Other_neurological_disorders | 3.86 | 3.02 | 4.92 |

cm_Pulmonary_circulation_disorders | 2.24 | 1.36 | 3.71 |

cm_Chronic_pulmonary_disease | 2.65 | 2.27 | 3.09 |

cm_Diabetes__uncomplicated | 1.02 | 0.85 | 1.22 |

cm_Diabetes__complicated | 1.09 | 0.79 | 1.49 |

cm_Hypothyroidism | 1.21 | 0.96 | 1.52 |

cm_Renal_failure | 0.93 | 0.74 | 1.18 |

cm_Liver_disease | 1.76 | 1.52 | 2.04 |

cm_Peptic_ulcer_disease | 2.13 | 1.01 | 4.49 |

cm_Lymphoma | 2.63 | 1.79 | 3.86 |

cm_Metastatic_cancer | 1.77 | 0.82 | 3.86 |

cm_Solid_tumor_without_metastasis | 2.01 | 1.60 | 2.51 |

cm_Rheumatoid_arthritis | 1.29 | 0.89 | 1.86 |

cm_Coagulopathy | 4.73 | 3.36 | 6.70 |

cm_Coagulopathy_hemophilia | 0.84 | 0.29 | 2.22 |

cm_Blood_loss_anemia | 3.66 | 1.74 | 7.95 |

cm_Deficiency_anemia | 1.19 | 0.94 | 1.50 |

cm_Obesity | 1.81 | 1.33 | 2.46 |

cm_Weight_loss | 2.10 | 1.70 | 2.58 |

cm_Fluid_and_electrolyte_disorders | 9.89 | 7.99 | 12.30 |

age_std | 0.59 | 0.51 | 0.69 |

age_std2 | 1.03 | 0.90 | 1.19 |

Here are the coefficients and confidence intervals for the logistic-regression model for nonzero costs. Again, I've antilogged everything, so you're looking at odds ratios instead of log odds ratios.

(sns.set-style "white") (for [spi (range 4)] (setv ax (plt.subplot 1 4 (inc spi))) (.xaxis.grid ax T) (setv dv (get (qw y_outpatient y_drugs y_inpatient_amount y_inpatient_prob) spi)) (setv d (.drop (get pred-coefci dv) ["Intercept" "cm_Coagulopathy_hemophilia"])) (setv d (getl d (cut (sorted d.index :key (λ (.startswith it "cm_"))) None None -1))) (plt.axvline :x 0 :color "black" :zorder 1 :linewidth .5) (.hlines ax (range (len d)) ($ d lo) ($ d hi) :zorder 2) (.scatter ax ($ d estimate) (range (len d)) :zorder 3 :edgecolor "none") (plt.ylim [-.5 (+ (len d) -1 .5)]) (plt.xlim [-1.02 3]) (plt.xticks [-1 0 1 2]) (.set-xlabel ax dv) (if spi (.yaxis.set_major_locator (plt.gca) (plt.NullLocator))) (plt.yticks (range (len d)) (if spi [] (list d.index)))) (sns.despine :left T :right T :top T :bottom T)

Here is a pictorial representation of the above two tables. The dots show the estimates and the horizontal line segments show the confidence intervals. This time, I haven't antilogged the coefficients, so the scale is visually symmetric for increases and decreases. (Antilogging numbers that range from -∞ to +∞ compresses the whole (-∞, 0) range to (0, 1) and further expands anything in (0, ∞).) And I excluded `cm_Coagulopathy_hemophilia`

because it's so big it would require expanding the x-axis a lot just to accommodate it. Clearly, as in the tables, I'd need to clean up all the labels for publication.

#### Squared error

(display-pred-cv pred-cv-orx-rmse)

I | DV | IVs | twin | RMSE |
---|---|---|---|---|

0 | y_outpatient | trivial | single | 29,107 |

1 | y_outpatient | trivial | twin | 29,111 |

2 | y_outpatient | demog | single | 29,103 |

3 | y_outpatient | demog | twin | 29,111 |

4 | y_outpatient | demog+cm | single | 28,193 |

5 | y_outpatient | demog+cm | twin | 28,265 |

6 | y_drugs | trivial | single | 22,569 |

7 | y_drugs | trivial | twin | 22,533 |

8 | y_drugs | demog | single | 22,567 |

9 | y_drugs | demog | twin | 22,531 |

10 | y_drugs | demog+cm | single | 22,559 |

11 | y_drugs | demog+cm | twin | 22,526 |

Here I use lasso-regularized linear regression, and a multiplicative bias correction factor that I choose with another linear regression model in place of a smearing estimate or the normal bias correction factor or the like.

Here's what you'd get with OLS, without the lasso:

I | DV | IVs | twin | RMSE |
---|---|---|---|---|

0 | y_outpatient | trivial | single | 29,107 |

1 | y_outpatient | trivial | twin | 29,111 |

2 | y_outpatient | demog | single | 29,088 |

3 | y_outpatient | demog | twin | 29,104 |

4 | y_outpatient | demog+cm | single | 30,169 |

5 | y_outpatient | demog+cm | twin | 30,244 |

6 | y_drugs | trivial | single | 22,569 |

7 | y_drugs | trivial | twin | 22,533 |

8 | y_drugs | demog | single | 22,458 |

9 | y_drugs | demog | twin | 22,393 |

10 | y_drugs | demog+cm | single | 22,377 |

11 | y_drugs | demog+cm | twin | 22,381 |

## Association of individual cost types

Here we have two tables of regression, one for each beneficiary status. Outpatient costs, inpatient costs (only among subjects with nonzero inpatient costs), drug costs, and subtotals (the sum of outpatient, inpatient, and drug costs) are fit with quantile regression: the central tendency is the median, the base error is the mean absolute deviation from the median (MAD) and the model error is the mean absolute error (MAE). The probability of nonzero inpatient costs is fit with logistic regression: the central tendency is the mean, the base error is the variance, and the model error is the mean squared error (i.e., the Brier score, which is a proper scoring rule; Brier, 1950; Bröcker, 2009).

For the quantile-regression models, the DVs are untransformed, so each coefficient can be interpreted directly as a linear increase of the conditional median, in dollars.

A few subjects were missing on the urban-rural variable. They were simply dropped.

Among Medicare beneficiaries, we only include subjects with Part D coverage for the full year.

Each row below "Intercept" shows the coefficient for the given variable. `age_std`

is age standardized to have SD 1/2 (per Gelman, 2008), and `age_std2`

is its square, standardized again.

```
(display-assoc "mcare_only")
```

I | subtotal | outpatient | inpatient_isnonzero | inpatient_nonzero | drugs |
---|---|---|---|---|---|

(n) | 1,551 | 1,551 | 1,551 | 183 | 1,551 |

(central tendency) | 34,016 | 3,808 | 0.12 | 15,203 | 27,093 |

(base error) | 18,482 | 6,324 | 0.10 | 18,425 | 11,932 |

(model error) | 16,203 | 5,389 | 0.06 | 14,869 | 11,281 |

Intercept | 19,166 | 894 | −3.62 | 22,879 | 23,241 |

female | 50 | 348 | 1.18 | 115 | −4,339 |

age_std | 1,991 | 336 | −0.46 | −373 | −851 |

age_std2 | −2,031 | 213 | −0.44 | 2,509 | −4,673 |

race_black | −5,277 | −968 | −0.33 | −3,376 | −4,969 |

race_hispanic | −3,487 | −453 | −0.04 | 2,385 | −4,198 |

race_other_nonwhite | −2,003 | −371 | −0.59 | −2,548 | −608 |

urban | 4,608 | 1,110 | 0.18 | −15,012 | 1,729 |

hiv_docvol_50plus | 2,246 | 173 | 0.02 | −2,427 | 2,556 |

disabled | 5,719 | 352 | −0.28 | 4,695 | 937 |

cm_Congestive_heart_failure | 3,176 | 857 | −0.21 | 4,331 | −539 |

cm_Cardiac_arrhythmias | 7,122 | 2,890 | 1.47 | 6,490 | 1,553 |

cm_Valvular_disease | 12,428 | 2,043 | 0.03 | 7,577 | 5,468 |

cm_Peripheral_vascular_disorders | 1,664 | 1,431 | 0.19 | 10,999 | 247 |

cm_Hypertension__uncomplicated | 2,092 | 521 | 0.78 | 975 | 1,159 |

cm_Hypertension__complicated | 1,725 | 1,308 | 0.94 | −3,613 | −1,078 |

cm_Paralysis | 33,092 | 5,713 | 3.11 | 4,841 | 1,152 |

cm_Other_neurological_disorders | 5,711 | 2,401 | 1.14 | 2,748 | 1,810 |

cm_Pulmonary_circulation_disorders | 18,202 | 5,635 | 2.57 | −894 | 3,046 |

cm_Chronic_pulmonary_disease | 5,519 | 1,567 | 0.45 | 10,382 | 1,836 |

cm_Diabetes__uncomplicated | 3,488 | 490 | 0.30 | −413 | 3,930 |

cm_Diabetes__complicated | 8,099 | 2,404 | −1.35 | −496 | −266 |

cm_Hypothyroidism | 8,869 | 2,156 | 1.05 | 443 | 2,125 |

cm_Renal_failure | 3,892 | 1,719 | 0.32 | 2,724 | 244 |

cm_Liver_disease | 2,575 | 1,649 | 0.36 | −2,368 | −284 |

cm_Peptic_ulcer_disease | 2,511 | 0 | −12.88 | 0 | 2,871 |

cm_Lymphoma | 12,123 | 6,075 | 0.45 | 23,220 | −4,035 |

cm_Metastatic_cancer | 24,862 | 10,077 | 0.61 | 0 | −21 |

cm_Solid_tumor_without_metastasis | 4,271 | 3,615 | 0.72 | 5,131 | 308 |

cm_Rheumatoid_arthritis | 7,853 | 4,074 | 0.93 | −1,214 | 641 |

cm_Coagulopathy | 7,510 | 2,391 | 1.98 | 8,816 | 1,029 |

cm_Coagulopathy_hemophilia | 58,655 | 97,335 | 0.40 | 15,886 | 7,497 |

cm_Blood_loss_anemia | 13,421 | 3,237 | 1.19 | 14,520 | 4,753 |

cm_Deficiency_anemia | 3,748 | 736 | −0.23 | 29,003 | −379 |

cm_Obesity | 7 | 1,089 | −0.16 | 23,943 | −5,062 |

cm_Weight_loss | 9,330 | 1,762 | 0.90 | −384 | 3,017 |

cm_Fluid_and_electrolyte_disorders | 13,241 | 3,773 | 3.02 | 1,062 | −2,818 |

```
(display-bnc "mcare_only")
```

n_comorbs | subjects | median_subtotal |
---|---|---|

0 | 536 | 27,777 |

1 | 461 | 33,427 |

2 | 252 | 37,191 |

3 | 138 | 37,552 |

4 | 52 | 42,583 |

5 | 43 | 49,880 |

6 | 28 | 74,897 |

7 | 21 | 67,386 |

≥ 8 | 20 | 112,836 |

```
(display-assoc "dual")
```

I | subtotal | outpatient | inpatient_isnonzero | inpatient_nonzero | drugs |
---|---|---|---|---|---|

(n) | 6,137 | 6,137 | 6,137 | 1,242 | 6,137 |

(central tendency) | 35,630 | 4,674 | 0.20 | 18,692 | 27,347 |

(base error) | 25,091 | 7,444 | 0.16 | 29,595 | 13,295 |

(model error) | 20,224 | 5,901 | 0.10 | 25,463 | 12,920 |

Intercept | 26,602 | 2,447 | −2.95 | 4,919 | 23,926 |

female | −1,234 | 970 | 0.12 | −763 | −2,407 |

age_std | −1,060 | −91 | −0.47 | −3,166 | 538 |

age_std2 | −4,113 | −63 | 0.17 | 12 | −2,803 |

race_black | −3,072 | −404 | 0.04 | −877 | −3,308 |

race_hispanic | −1,153 | −459 | −0.04 | 1,078 | −1,055 |

race_other_nonwhite | 354 | −96 | 0.01 | −315 | 670 |

urban | 2,962 | −292 | 0.10 | 3,097 | 1,116 |

hiv_docvol_50plus | 211 | 174 | −0.13 | 10 | 1,294 |

disabled | −971 | 224 | −0.00 | −1,842 | 810 |

cm_Congestive_heart_failure | 11,874 | 3,107 | 0.39 | 4,336 | −1,170 |

cm_Cardiac_arrhythmias | 10,714 | 3,127 | 1.14 | 6,130 | 195 |

cm_Valvular_disease | 15,066 | 5,068 | 0.65 | 14,762 | 406 |

cm_Peripheral_vascular_disorders | 11,217 | 3,283 | 0.14 | 8,595 | 2,559 |

cm_Hypertension__uncomplicated | 2,944 | 878 | 0.70 | 2,925 | 1,202 |

cm_Hypertension__complicated | 14,482 | 5,381 | 0.97 | 3,746 | −163 |

cm_Paralysis | 21,317 | 5,188 | 1.06 | 6,961 | 2,361 |

cm_Other_neurological_disorders | 16,184 | 3,367 | 1.51 | 8,319 | 1,875 |

cm_Pulmonary_circulation_disorders | 18,709 | 3,687 | −0.00 | 10,189 | 5,691 |

cm_Chronic_pulmonary_disease | 7,827 | 2,356 | 1.14 | 2,334 | 1,800 |

cm_Diabetes__uncomplicated | 2,705 | 435 | 0.14 | −1,572 | 2,745 |

cm_Diabetes__complicated | 8,199 | 2,768 | 0.16 | 4,942 | 479 |

cm_Hypothyroidism | 4,939 | 1,488 | 0.28 | −3,439 | 3,123 |

cm_Renal_failure | 6,878 | 1,849 | −0.12 | 4,742 | 1,254 |

cm_Liver_disease | 5,160 | 1,785 | 0.62 | 3,092 | 1,041 |

cm_Peptic_ulcer_disease | 19,533 | 5,758 | 1.13 | 54,846 | −3,050 |

cm_Lymphoma | 9,799 | 2,303 | 0.88 | 3,007 | −1,233 |

cm_Metastatic_cancer | 30,821 | 14,990 | 1.41 | 2,750 | 1,361 |

cm_Solid_tumor_without_metastasis | 9,499 | 4,782 | 0.57 | 5,440 | 1,195 |

cm_Rheumatoid_arthritis | 3,466 | 1,216 | −0.15 | −221 | 1,750 |

cm_Coagulopathy | 20,571 | 2,699 | 1.35 | 13,974 | −2,334 |

cm_Coagulopathy_hemophilia | 170,713 | 79,780 | −0.23 | 16,538 | 12,508 |

cm_Blood_loss_anemia | 8,795 | 4,863 | 1.26 | 12,058 | −1,788 |

cm_Deficiency_anemia | 11,435 | 4,041 | 0.07 | 4,719 | 2,502 |

cm_Obesity | 3,297 | 156 | 0.78 | 4,450 | 434 |

cm_Weight_loss | 17,755 | 2,807 | 0.76 | 12,035 | 3,392 |

cm_Fluid_and_electrolyte_disorders | 18,912 | 3,560 | 2.39 | 4,532 | −1,214 |

```
(display-bnc "dual")
```

n_comorbs | subjects | median_subtotal |
---|---|---|

0 | 2,208 | 28,840 |

1 | 1,539 | 34,133 |

2 | 950 | 39,367 |

3 | 531 | 47,967 |

4 | 305 | 56,389 |

5 | 205 | 62,527 |

6 | 131 | 88,344 |

7 | 93 | 100,588 |

8 | 58 | 116,936 |

9 | 35 | 115,445 |

10 | 29 | 164,307 |

11 | 17 | 223,938 |

12 | 19 | 190,814 |

≥ 13 | 17 | 274,452 |

## Methods paper

If we chose models on the basis of fit rather than predictive accuracy, then we know a priori that the most complex models would win. This means that for outpatient and drug costs, twin models with all IVs would win (which is similar to the real winning models, non-twin with all IVs), whereas for inpatient costs, all IVs would win (which is the same result we got with predictive accuracy).

We might also compare model selection based on predictive accuracy to model selection based on p-values. In particular, we can use the simple-minded method of trying all the predictors and keeping the significant ones. But how shall I test the significance of the coefficients? `quantreg`

doesn't supply significance tests for models fit with a lasso penalty. Let's use the method of Redden, Fernández, and Allison (2004).

Scott: "Thinking of discussion points: Predictive analytics have not been big in HIV research in part because of smaller data sets we typically collect. However, large data sets, such as Medicaid / Medicare data, are becoming more of the norm than the exception in HIV research. Think of ATN and other large scale studies. NIH push to make data sets publicly available. Virtual cohorts with medical chart data. Basically, predictive analytic tools are needed."

Rob suggested trying a Box-Cox, but this would probably be a hassle to use with quantile regression or lasso linear regression, because it means there's another kind of parameter to fit. Besides, Box-Cox sort of ruins the interpretability of the coefficients, so it's not a good choice when coefficient interpretation is a goal.

I found that I wasn't able to get the sort of improvements on square error that I was able to get on absolute error. My impression is that this is basically a consequence of the high skew of the DVs: a few high values affect square error more than absolute error, so models that optimize mean square error will make a stupider sort of compromise than models that optimize absolute error. This could be discussed in the method paper with a simple, mathematically clean example, and with the RMSEs for outpatient costs.

### Outline

- Describe prediction and contrast it with association (as in Arfer & Luhmann, 2017)
- Describe the method I used in the first paper on Comorbid
- Contrast the models thus selected with those that would be selected if we just fit models with all the IVs and then dropped IVs with non-significant coefficients
- We can contrast the models in terms of predictive accuracy as well as just the coefficients

- Contrast the fits of the models with their predictive accuracies
- Discussion: predictive methods are more useful these days now that large datasets are more widely available (e.g., the ATN, virtual cohorts with medical charts, increasing interest in making datasets publicly accessible)

## References

Arfer, K. B., & Luhmann, C. C. (2017). Time-preference tests fail to predict behavior related to self-control. *Frontiers in Psychology, 8*(150). doi:10.3389/fpsyg.2017.00150

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. *Monthly Weather Review, 78*(1), 1–3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. *Quarterly Journal of the Royal Meteorological Society, 135*(643), 1512–1519. doi:10.1002/qj.456

Buntin, M. B., & Zaslavsky, A. M. (2004). Too much ado about two-part models and transformation? *Journal of Health Economics, 23*(3), 525–542. doi:10.1016/j.jhealeco.2003.10.005

Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. *Statistics in Medicine, 27*(15), 2865–2873. doi:10.1002/sim.3107

Newman, M. C. (1993). Regression analysis of log-transformed data: Statistical bias and its correction. *Environmental Toxicology and Chemistry, 12*(6), 1129–1133. doi:10.1002/etc.5620120618

Redden, D. T., Fernández, J. R., & Allison, D. B. (2004). A simple significance test for quantile regression. *Statistics in Medicine, 23*(16), 2587–2597. doi:10.1002/sim.1839