Comorbid notebook

The context of the data

All results with cell sizes 10 or below need approval to be distributed in any manner.

Jennifer:

"for the comorbidity analysis, we have only been using 2010 data since it's the most recent we have. The 3 analysis files (all_*) were created using the 2010 claims."
"the three all_* files will be best for you to work with, so you don't have to deal with the raw claims quite yet. When we asked for the data, we didn't get all Medicare/Medicaid enrollees, we only got those who had an ARV claim or HIV diagnosis at any point. Then we further refine our sample using the algorithm. Due to the nature of claims, we also have to limit to people enrolled for the full year, are fee-for-service (rather than managed care), and are in California the entire year"
"Regarding the ID variables, bene_id is the Medicare ID so we use it for Medicare only and Duals and msis_id is the Medicaid ID which we use for Medicaid only patients. Throughout the years, Medicaid patients have gradually also been assigned a bene_id, but there are still some people who don't have one. The reason we have bene_id_10 is because we found inconsistencies in IDs when merging across years, this version is needed to merge with raw claims for that year. The bene_id variable allows for consistency in merging across years.… There are a few other caveats with the ID variables we can discuss at a later point when needed."
The three files have the same variables

Redacted claims

Jennifer: "Claims relating to substance abuse were redacted for our 2009 and 2010 data, thus these payments are not included in our individual claims. Thus, we will find lower total costs for those who have redacted claims."

Eligibility	Claim Type	# Suppressed Claims	Total Claims	% Claims Suppressed	# Beneficiaries w/ Suppressed Claims	Total Beneficiaries in Claims	% Beneficiaries w/ Suppressed Claims	Medicaid Payment (MDCD_PYMT_AMT) Suppressed Claims	Total Medicaid Payment (MDCD_PYMT_AMT) in Claims
All	Inpatient	773,353	9,378,620	8.25%	417,977	6,106,602	6.84%	3,040,765,005	35,166,806,936
	Long-Term Care	225,235	33,319,656	0.68%	38,945	1,648,083	2.36%	522,839,791	62,508,895,726
	Other Services	37,376,167	2,520,159,210	1.48%	1,415,671	62,063,859	2.28%	1,706,728,920	213,731,343,805
	Total	38,374,755	2,562,857,486	1.50%	1,628,983	62,381,748	2.61%	5,270,333,716	311,407,046,467
Aged	Inpatient	19,737	951,402	2.07%	16,697	618,562	2.70%	41,584,740	1,973,545,032
	Long-Term Care	71,432	22,972,515	0.31%	5,426	1,081,340	0.50%	151,269,024	36,690,837,529
	Other Services	441,310	239,166,210	0.18%	24,220	4,100,416	0.59%	22,440,257	25,928,333,797
	Total	532,479	263,090,127	0.20%	41,666	4,258,713	0.98%	215,294,021	64,592,716,358
Disabled / Blind	Inpatient	389,806	2,831,711	13.77%	215,676	1,418,478	15.20%	1,806,188,438	16,188,598,664
	Long-Term Care	98,952	9,466,201	1.05%	17,048	457,139	3.73%	242,493,452	23,612,634,633
	Other Services	15,469,683	772,861,048	2.00%	549,705	9,493,018	5.79%	674,012,800	90,720,882,400
	Total	15,958,441	785,158,960	2.03%	658,766	9,524,493	6.92%	2,722,694,690	130,522,115,697
Child	Inpatient	27,627	2,314,122	1.19%	21,927	1,794,121	1.22%	124,069,187	7,485,361,209
	Long-Term Care	42,109	414,447	10.16%	10,291	72,559	14.18%	98,818,837	1,267,625,658
	Other Services	3,514,205	977,750,886	0.36%	220,256	32,530,394	0.68%	241,240,915	54,673,886,767
	Total	3,583,941	980,479,455	0.37%	235,895	32,561,585	0.72%	464,128,939	63,426,873,634
Adult	Inpatient	303,819	2,941,126	10.33%	144,368	2,052,181	7.03%	908,646,835	8,006,862,338
	Long-Term Care	8,499	119,370	7.12%	4,976	20,899	23.81%	15,911,261	131,464,388
	Other Services	16,739,091	480,568,052	3.48%	586,975	15,494,187	3.79%	730,157,393	37,339,227,221
	Total	17,051,409	483,628,548	3.53%	647,872	15,573,669	4.16%	1,654,715,489	45,477,553,947

(Only the claims that contain the substance abuse codes are removed; not the beneficiary and all claims associated with that beneficiary)

The "MAX Uniform Eligibility Code - For Month of Service" (EL_MAX_ELGBLTY_CD_MO) variable from each claim was used to determine eligibility grougs.

Aged: 11, 21, 31, 41, 51
Blind/Disabled: 12, 22, 32, 42, 52
Child: 14, 16, 24, 34, 44, 48, 54
Adult: 15, 17, 25, 35, 45, 55, 3A

Descriptives

These demographic tables use the same sample inclusion criteria as the association analyses described below.

(display-demog "mcare_only")

I	value
Female, proportion	.048
Age, minimum	31
Age, median	56
Age, maximum	94
Race, white, proportion	.761
Race, black, proportion	.070
Race, Hispanic, proportion	.117
Race, other, proportion	.052
Urban, proportion	.949
High-volume HIV care provider, proportion	.582
Disabled, proportion	.773

(display-demog "dual")

I	value
Female, proportion	.118
Age, minimum	21
Age, median	50
Age, maximum	90
Race, white, proportion	.527
Race, black, proportion	.212
Race, Hispanic, proportion	.223
Race, other, proportion	.038
Urban, proportion	.944
High-volume HIV care provider, proportion	.661
Disabled, proportion	.955

Prediction of total costs

These cross-validation analyses of predictive accuracy use all beneficiaries from 2010. The predictors are gender, race, age, age squared, disabled status, and each of the comorbid conditions.

	mcaid_only	mcare_only	dual
Trivial (predict median)	18,918	22,280	26,439
OLS (no log transformation)	17,819	21,613	23,801
OLS	17,455	22,399	22,470
Ridge regression	17,503	22,377	22,448
Elastic net	17,422	22,267	22,428
Elastic net w/ all 1st-order interactions	17,304	22,497	22,407
Quantile regression	17,053	20,279	22,327
Quantile regression w/ lasso	17,090	20,597	22,354
Quantile regression w/ lasso, some interactions	17,152	20,832	22,420
Quantile regression w/ lasso, inpatient	17,338	21,394	25,516
Random forest	18,940	23,274	24,231
[Min MAE, gender and race]	18,834	22,207	26,408
[Min MAE, all IVs but age]	11,594	12,977	13,811

Here, for each of the three datasets and several models, we have the cross-validated mean absolute error (MAE) in predicting total expenditure per subject, in dollars. This means that, e.g., a MAE of $15,000 implies that the model's point prediction of a beneficiary's expenditures would have a mean distance from the true value of $15,000.

Except where marked, each model log-transforms the DV for model-fitting, then exp-transforms the predictions. You can see that this trick increases predictive accuracy (according to MAE) quite a bit.

The best-performing model in each case is unregularized quantile regression. In the case of Medicare-only beneficiares, there is a large improvement upon OLS.

Regularization (as the lasso) is necessary for the more complex quantile regression models because the model-fitting procedure has problems finding a solution otherwise. "Quantile regression w/ lasso, some interactions" interacts age and age squared with each comorbidity, and gender with each race. "Quantile regression w/ lasso, inpatient" uses logistic regression to predict whether the patient had any inpatient costs, and then applies a separately trained quantile regression model depending on each patient's predicted inpatient status.

The last two rows give a kind of lower bound on possible MAE for the demographic IVs and all IVs (demographic variables plus comorbidity flags), respectively. They are simply the MAE achieved when each subject's true value is predicted with the median among all subjects with precisely the same IV values (except for age, since it's continuous, unlike all the other IVs). So, unless age is very informative, we can't get MAEs below these numbers, and we probably won't be able to get MAEs very near them, either; getting a MAE that low in training would almost certainly lead to overfitting in testing.

Prediction of individual cost types

Thinking

Let's log the DV, as before.

Let's compare trivial models (i.e. predicting the median and ignoring all IVs) to quantile regression. Let's also compare models that have only the demographic variables as IVs to models that have comorbidities. And, let's compare models with a dummy variable for having Medicaid to twinned models (one for Medicare only, one for duals). That means we have…

Trivial, single
Trivial, twinned
Quantile regression, demographic IVs, single
Quantile regression, demographic IVs, twinned
Quantile regression, all IVs, single
Quantile regression, all IVs, twinned

…for each of the three cost types, and then there's the question of how to handle the (zero-inflated) inpatient costs. All the above methods could be applied unaltered, or you could precede them with a logistic-regression model that decides whether the patient has nonzero inpatient costs, and we can imagine four ways to use this logistic-regression model based on the choice of IVs (demographic-only versus all) and how we deal with Medicare-only versus duals (dummy variable versus twinned models). Ouch! But from talking to Arleen and Jennifer, I guess we'd best bite the bullet and consider all 16 nontrivial models for inpatient costs.

Since twin models didn't prevail for outpatient or drug costs, let's forget about those. That gives us the following models for inpatient costs:

Fully trivial: just guess the median
Quantile regression only (demographic IVs)
Quantile regression only (all IVs)
Logistic regression for nonzero (demographic IVs), then guess the conditional median
Logistic regression for nonzero (demographic IVs), then quantile regression (demographic IVs)
Logistic regression for nonzero (demographic IVs), then quantile regression (all IVs
Logistic regression for nonzero (all IVs), then guess the conditional median
Logistic regression for nonzero (all IVs), then quantile regression (demographic IVs)
Logistic regression for nonzero (all IVs), then quantile regression (all IVs)

Results

Absolute error

(display-pred-cv pred-cv-orx-mae)

I	DV	IVs	twin	MAE
0	y_outpatient	trivial	single	6,565
1	y_outpatient	trivial	twin	6,555
2	y_outpatient	demog	single	6,495
3	y_outpatient	demog	twin	6,487
4	y_outpatient	demog+cm	single	6,146
5	y_outpatient	demog+cm	twin	6,165
6	y_drugs	trivial	single	12,943
7	y_drugs	trivial	twin	12,932
8	y_drugs	demog	single	12,701
9	y_drugs	demog	twin	12,683
10	y_drugs	demog+cm	single	12,581
11	y_drugs	demog+cm	twin	12,595

The best-performing model for both DVs is a non-twinned model that includes all IVs.

With a normal bias correction (Newman, 1993):

I	DV	IVs	twin	MAE
0	y_outpatient	trivial	single	6565
1	y_outpatient	trivial	twin	6555
2	y_outpatient	demog	single	7445
3	y_outpatient	demog	twin	7438
4	y_outpatient	demog+cm	single	7364
5	y_outpatient	demog+cm	twin	7374
6	y_drugs	trivial	single	12943
7	y_drugs	trivial	twin	12932
8	y_drugs	demog	single	29643
9	y_drugs	demog	twin	32039
10	y_drugs	demog+cm	single	29322
11	y_drugs	demog+cm	twin	31205

With the smearing estimate of bias (Newman, 1993):

I	DV	IVs	twin	MAE
0	y_outpatient	trivial	single	6565
1	y_outpatient	trivial	twin	6555
2	y_outpatient	demog	single	7920
3	y_outpatient	demog	twin	7934
4	y_outpatient	demog+cm	single	7135
5	y_outpatient	demog+cm	twin	7180
6	y_drugs	trivial	single	12943
7	y_drugs	trivial	twin	12932
8	y_drugs	demog	single	13079
9	y_drugs	demog	twin	13090
10	y_drugs	demog+cm	single	12922
11	y_drugs	demog+cm	twin	12971

With a second smear for the top decile (Buntin & Zaslavsky, 2004):

I	DV	IVs	twin	MAE
0	y_outpatient	trivial	single	6565
1	y_outpatient	trivial	twin	6555
2	y_outpatient	demog	single	7296
3	y_outpatient	demog	twin	7177
4	y_outpatient	demog+cm	single	6913
5	y_outpatient	demog+cm	twin	6924
6	y_drugs	trivial	single	12943
7	y_drugs	trivial	twin	12932
8	y_drugs	demog	single	13486
9	y_drugs	demog	twin	13500
10	y_drugs	demog+cm	single	13604
11	y_drugs	demog+cm	twin	13469

(.round pred-cv-inpatient)

I	DV	prob_IVs	amount_IVs	MAE
0	y_inpatient	trivial	trivial	8489
1	y_inpatient	trivial	demog	8489
2	y_inpatient	trivial	demog+cm	9832
3	y_inpatient	demog	trivial	8497
4	y_inpatient	demog	demog	8505
5	y_inpatient	demog	demog+cm	8501
6	y_inpatient	demog+cm	trivial	7809
7	y_inpatient	demog+cm	demog	7777
8	y_inpatient	demog+cm	demog+cm	6792

By a substantial margin, the most complex model wins. How about that.

Here's coefficients and CIs for the winning models fit to all the data (or all the data with nonzero inpatient costs, in the case of y_inpatient_amount).

(rd 2 (np.exp (cbind
  (get pred-coefci "y_outpatient")
  (get pred-coefci "y_drugs")
  (get pred-coefci "y_inpatient_amount"))))

I	estimate	lo	hi	estimate	lo	hi	estimate	lo	hi
Intercept	2151.02	1804.09	2405.39	21981.99	20522.46	23625.38	3697.35	2320.70	5560.81
has_medicaid	1.21	1.15	1.28	1.04	1.02	1.08	1.02	0.80	1.28
female	1.19	1.11	1.29	0.89	0.86	0.92	0.93	0.74	1.09
race_black	0.84	0.79	0.89	0.88	0.85	0.91	1.04	0.91	1.28
race_hispanic	0.84	0.79	0.88	0.95	0.92	0.99	0.98	0.83	1.27
race_other_nonwhite	0.85	0.75	0.94	0.99	0.93	1.04	1.05	0.69	1.51
urban	1.00	0.93	1.15	1.05	0.99	1.10	0.98	0.74	1.25
hiv_docvol_50plus	1.11	1.05	1.16	1.08	1.05	1.10	0.90	0.78	1.07
disabled	1.05	0.93	1.17	1.04	0.98	1.10	1.17	0.78	1.67
cm_Congestive_heart_failure	1.26	1.13	1.37	0.95	0.88	1.02	1.23	1.00	1.52
cm_Cardiac_arrhythmias	1.37	1.22	1.50	1.01	0.96	1.07	1.45	1.26	1.74
cm_Valvular_disease	1.27	1.10	1.44	1.02	0.94	1.12	1.08	0.85	1.41
cm_Peripheral_vascular_disorders	1.33	1.21	1.52	1.05	0.98	1.12	1.19	0.92	1.51
cm_Hypertension__uncomplicated	1.22	1.17	1.28	1.06	1.03	1.09	1.35	1.14	1.64
cm_Hypertension__complicated	1.41	1.26	1.61	0.91	0.85	0.98	1.14	0.94	1.50
cm_Paralysis	1.64	1.31	2.02	1.01	0.83	1.18	2.02	1.45	2.74
cm_Other_neurological_disorders	1.40	1.24	1.53	1.04	0.98	1.11	1.66	1.44	2.09
cm_Pulmonary_circulation_disorders	1.01	0.85	1.37	1.16	1.07	1.35	1.25	0.96	1.75
cm_Chronic_pulmonary_disease	1.36	1.30	1.46	1.07	1.03	1.10	1.37	1.20	1.65
cm_Diabetes__uncomplicated	1.14	1.07	1.23	1.10	1.07	1.15	1.14	0.89	1.33
cm_Diabetes__complicated	1.26	1.12	1.47	0.94	0.86	1.03	1.04	0.85	1.40
cm_Hypothyroidism	1.33	1.22	1.44	1.12	1.07	1.19	1.08	0.84	1.33
cm_Renal_failure	1.40	1.29	1.51	1.07	1.02	1.11	1.25	0.94	1.50
cm_Liver_disease	1.41	1.32	1.50	1.04	1.00	1.07	1.40	1.16	1.64
cm_Peptic_ulcer_disease	1.32	1.00	1.79	0.88	0.61	1.06	1.58	1.15	2.22
cm_Lymphoma	1.68	1.43	1.88	0.99	0.91	1.07	1.53	1.16	2.05
cm_Metastatic_cancer	2.25	1.41	2.86	1.05	0.90	1.18	1.66	1.07	2.95
cm_Solid_tumor_without_metastasis	1.76	1.56	1.92	1.05	1.00	1.11	1.20	0.91	1.52
cm_Rheumatoid_arthritis	1.27	1.11	1.49	1.00	0.88	1.11	0.81	0.58	1.21
cm_Coagulopathy	1.13	1.01	1.35	0.97	0.88	1.04	1.50	1.19	1.84
cm_Coagulopathy_hemophilia	20.39	3.07	40.03	1.25	1.00	1.68	2.28	1.00	4.29
cm_Blood_loss_anemia	1.13	0.93	1.35	0.99	0.77	1.26	0.90	0.73	1.81
cm_Deficiency_anemia	1.66	1.53	1.81	1.08	1.01	1.12	1.24	1.00	1.44
cm_Obesity	1.12	1.03	1.26	0.98	0.92	1.07	1.56	1.11	2.01
cm_Weight_loss	1.37	1.25	1.50	1.05	1.00	1.12	1.72	1.40	1.95
cm_Fluid_and_electrolyte_disorders	1.17	1.07	1.27	0.94	0.89	1.00	2.31	2.01	2.67
age_std	1.04	0.98	1.10	1.04	1.01	1.07	0.87	0.71	1.00
age_std2	1.04	0.97	1.09	0.85	0.83	0.89	1.07	0.91	1.20

From left to right, these are the coefficients and 95% confidence limits for: outpatient costs, drug costs, and conditional inpatient costs. Everything has been already antilogged, so 1.27 means an effect of multiplying the median by 1.27. I excluded the intercept since it's of limited interest in a model that's effectively multiplicative.

I couldn't find any preexisting method to find CIs for lasso-regularized quantile regression, so I using bootstrapping. Some simulations convinced me that the resulting CIs indeed have coverage probabilities near 95%.

(rd 2 (np.exp (get pred-coefci "y_inpatient_prob")))

I	estimate	lo	hi
Intercept	0.06	0.04	0.09
has_medicaid	1.49	1.29	1.73
female	1.10	0.91	1.32
race_black	1.00	0.86	1.17
race_hispanic	0.96	0.83	1.12
race_other_nonwhite	0.83	0.61	1.11
urban	1.48	1.12	1.97
hiv_docvol_50plus	0.92	0.82	1.04
disabled	0.81	0.60	1.10
cm_Congestive_heart_failure	1.50	1.13	1.99
cm_Cardiac_arrhythmias	3.85	3.04	4.87
cm_Valvular_disease	1.88	1.28	2.75
cm_Peripheral_vascular_disorders	1.32	1.00	1.72
cm_Hypertension__uncomplicated	1.87	1.65	2.13
cm_Hypertension__complicated	2.81	2.08	3.80
cm_Paralysis	4.79	2.85	8.20
cm_Other_neurological_disorders	3.86	3.02	4.92
cm_Pulmonary_circulation_disorders	2.24	1.36	3.71
cm_Chronic_pulmonary_disease	2.65	2.27	3.09
cm_Diabetes__uncomplicated	1.02	0.85	1.22
cm_Diabetes__complicated	1.09	0.79	1.49
cm_Hypothyroidism	1.21	0.96	1.52
cm_Renal_failure	0.93	0.74	1.18
cm_Liver_disease	1.76	1.52	2.04
cm_Peptic_ulcer_disease	2.13	1.01	4.49
cm_Lymphoma	2.63	1.79	3.86
cm_Metastatic_cancer	1.77	0.82	3.86
cm_Solid_tumor_without_metastasis	2.01	1.60	2.51
cm_Rheumatoid_arthritis	1.29	0.89	1.86
cm_Coagulopathy	4.73	3.36	6.70
cm_Coagulopathy_hemophilia	0.84	0.29	2.22
cm_Blood_loss_anemia	3.66	1.74	7.95
cm_Deficiency_anemia	1.19	0.94	1.50
cm_Obesity	1.81	1.33	2.46
cm_Weight_loss	2.10	1.70	2.58
cm_Fluid_and_electrolyte_disorders	9.89	7.99	12.30
age_std	0.59	0.51	0.69
age_std2	1.03	0.90	1.19

Here are the coefficients and confidence intervals for the logistic-regression model for nonzero costs. Again, I've antilogged everything, so you're looking at odds ratios instead of log odds ratios.

(sns.set-style "white")

(for [spi (range 4)]
  (setv ax (plt.subplot 1 4 (inc spi)))
  (.xaxis.grid ax T)
  (setv dv (get (qw y_outpatient y_drugs y_inpatient_amount y_inpatient_prob) spi))
  (setv d (.drop (get pred-coefci dv)
    ["Intercept" "cm_Coagulopathy_hemophilia"]))
  (setv d (getl d (cut (sorted d.index :key (λ (.startswith it "cm_"))) None None -1)))
  (plt.axvline :x 0 :color "black" :zorder 1 :linewidth .5)
  (.hlines ax (range (len d)) ($ d lo) ($ d hi) :zorder 2)
  (.scatter ax ($ d estimate) (range (len d)) :zorder 3 :edgecolor "none")
  (plt.ylim [-.5 (+ (len d) -1 .5)])
  (plt.xlim [-1.02 3])
  (plt.xticks [-1 0 1 2])
  (.set-xlabel ax dv)
  (if spi
    (.yaxis.set_major_locator (plt.gca) (plt.NullLocator)))
    (plt.yticks (range (len d)) (if spi [] (list d.index))))

(sns.despine :left T :right T :top T :bottom T)

Here is a pictorial representation of the above two tables. The dots show the estimates and the horizontal line segments show the confidence intervals. This time, I haven't antilogged the coefficients, so the scale is visually symmetric for increases and decreases. (Antilogging numbers that range from -∞ to +∞ compresses the whole (-∞, 0) range to (0, 1) and further expands anything in (0, ∞).) And I excluded cm_Coagulopathy_hemophilia because it's so big it would require expanding the x-axis a lot just to accommodate it. Clearly, as in the tables, I'd need to clean up all the labels for publication.

Squared error

(display-pred-cv pred-cv-orx-rmse)

I	DV	IVs	twin	RMSE
0	y_outpatient	trivial	single	29,107
1	y_outpatient	trivial	twin	29,111
2	y_outpatient	demog	single	29,103
3	y_outpatient	demog	twin	29,111
4	y_outpatient	demog+cm	single	28,193
5	y_outpatient	demog+cm	twin	28,265
6	y_drugs	trivial	single	22,569
7	y_drugs	trivial	twin	22,533
8	y_drugs	demog	single	22,567
9	y_drugs	demog	twin	22,531
10	y_drugs	demog+cm	single	22,559
11	y_drugs	demog+cm	twin	22,526

Here I use lasso-regularized linear regression, and a multiplicative bias correction factor that I choose with another linear regression model in place of a smearing estimate or the normal bias correction factor or the like.

Here's what you'd get with OLS, without the lasso:

I	DV	IVs	twin	RMSE
0	y_outpatient	trivial	single	29,107
1	y_outpatient	trivial	twin	29,111
2	y_outpatient	demog	single	29,088
3	y_outpatient	demog	twin	29,104
4	y_outpatient	demog+cm	single	30,169
5	y_outpatient	demog+cm	twin	30,244
6	y_drugs	trivial	single	22,569
7	y_drugs	trivial	twin	22,533
8	y_drugs	demog	single	22,458
9	y_drugs	demog	twin	22,393
10	y_drugs	demog+cm	single	22,377
11	y_drugs	demog+cm	twin	22,381

Association of individual cost types

Here we have two tables of regression, one for each beneficiary status. Outpatient costs, inpatient costs (only among subjects with nonzero inpatient costs), drug costs, and subtotals (the sum of outpatient, inpatient, and drug costs) are fit with quantile regression: the central tendency is the median, the base error is the mean absolute deviation from the median (MAD) and the model error is the mean absolute error (MAE). The probability of nonzero inpatient costs is fit with logistic regression: the central tendency is the mean, the base error is the variance, and the model error is the mean squared error (i.e., the Brier score, which is a proper scoring rule; Brier, 1950; Bröcker, 2009).

For the quantile-regression models, the DVs are untransformed, so each coefficient can be interpreted directly as a linear increase of the conditional median, in dollars.

A few subjects were missing on the urban-rural variable. They were simply dropped.

Among Medicare beneficiaries, we only include subjects with Part D coverage for the full year.

Each row below "Intercept" shows the coefficient for the given variable. age_std is age standardized to have SD 1/2 (per Gelman, 2008), and age_std2 is its square, standardized again.

(display-assoc "mcare_only")

I	subtotal	outpatient	inpatient_isnonzero	inpatient_nonzero	drugs
(n)	1,551	1,551	1,551	183	1,551
(central tendency)	34,016	3,808	0.12	15,203	27,093
(base error)	18,482	6,324	0.10	18,425	11,932
(model error)	16,203	5,389	0.06	14,869	11,281
Intercept	19,166	894	−3.62	22,879	23,241
female	50	348	1.18	115	−4,339
age_std	1,991	336	−0.46	−373	−851
age_std2	−2,031	213	−0.44	2,509	−4,673
race_black	−5,277	−968	−0.33	−3,376	−4,969
race_hispanic	−3,487	−453	−0.04	2,385	−4,198
race_other_nonwhite	−2,003	−371	−0.59	−2,548	−608
urban	4,608	1,110	0.18	−15,012	1,729
hiv_docvol_50plus	2,246	173	0.02	−2,427	2,556
disabled	5,719	352	−0.28	4,695	937
cm_Congestive_heart_failure	3,176	857	−0.21	4,331	−539
cm_Cardiac_arrhythmias	7,122	2,890	1.47	6,490	1,553
cm_Valvular_disease	12,428	2,043	0.03	7,577	5,468
cm_Peripheral_vascular_disorders	1,664	1,431	0.19	10,999	247
cm_Hypertension__uncomplicated	2,092	521	0.78	975	1,159
cm_Hypertension__complicated	1,725	1,308	0.94	−3,613	−1,078
cm_Paralysis	33,092	5,713	3.11	4,841	1,152
cm_Other_neurological_disorders	5,711	2,401	1.14	2,748	1,810
cm_Pulmonary_circulation_disorders	18,202	5,635	2.57	−894	3,046
cm_Chronic_pulmonary_disease	5,519	1,567	0.45	10,382	1,836
cm_Diabetes__uncomplicated	3,488	490	0.30	−413	3,930
cm_Diabetes__complicated	8,099	2,404	−1.35	−496	−266
cm_Hypothyroidism	8,869	2,156	1.05	443	2,125
cm_Renal_failure	3,892	1,719	0.32	2,724	244
cm_Liver_disease	2,575	1,649	0.36	−2,368	−284
cm_Peptic_ulcer_disease	2,511	0	−12.88	0	2,871
cm_Lymphoma	12,123	6,075	0.45	23,220	−4,035
cm_Metastatic_cancer	24,862	10,077	0.61	0	−21
cm_Solid_tumor_without_metastasis	4,271	3,615	0.72	5,131	308
cm_Rheumatoid_arthritis	7,853	4,074	0.93	−1,214	641
cm_Coagulopathy	7,510	2,391	1.98	8,816	1,029
cm_Coagulopathy_hemophilia	58,655	97,335	0.40	15,886	7,497
cm_Blood_loss_anemia	13,421	3,237	1.19	14,520	4,753
cm_Deficiency_anemia	3,748	736	−0.23	29,003	−379
cm_Obesity	7	1,089	−0.16	23,943	−5,062
cm_Weight_loss	9,330	1,762	0.90	−384	3,017
cm_Fluid_and_electrolyte_disorders	13,241	3,773	3.02	1,062	−2,818

(display-bnc "mcare_only")

n_comorbs	subjects	median_subtotal
0	536	27,777
1	461	33,427
2	252	37,191
3	138	37,552
4	52	42,583
5	43	49,880
6	28	74,897
7	21	67,386
≥ 8	20	112,836

(display-assoc "dual")

I	subtotal	outpatient	inpatient_isnonzero	inpatient_nonzero	drugs
(n)	6,137	6,137	6,137	1,242	6,137
(central tendency)	35,630	4,674	0.20	18,692	27,347
(base error)	25,091	7,444	0.16	29,595	13,295
(model error)	20,224	5,901	0.10	25,463	12,920
Intercept	26,602	2,447	−2.95	4,919	23,926
female	−1,234	970	0.12	−763	−2,407
age_std	−1,060	−91	−0.47	−3,166	538
age_std2	−4,113	−63	0.17	12	−2,803
race_black	−3,072	−404	0.04	−877	−3,308
race_hispanic	−1,153	−459	−0.04	1,078	−1,055
race_other_nonwhite	354	−96	0.01	−315	670
urban	2,962	−292	0.10	3,097	1,116
hiv_docvol_50plus	211	174	−0.13	10	1,294
disabled	−971	224	−0.00	−1,842	810
cm_Congestive_heart_failure	11,874	3,107	0.39	4,336	−1,170
cm_Cardiac_arrhythmias	10,714	3,127	1.14	6,130	195
cm_Valvular_disease	15,066	5,068	0.65	14,762	406
cm_Peripheral_vascular_disorders	11,217	3,283	0.14	8,595	2,559
cm_Hypertension__uncomplicated	2,944	878	0.70	2,925	1,202
cm_Hypertension__complicated	14,482	5,381	0.97	3,746	−163
cm_Paralysis	21,317	5,188	1.06	6,961	2,361
cm_Other_neurological_disorders	16,184	3,367	1.51	8,319	1,875
cm_Pulmonary_circulation_disorders	18,709	3,687	−0.00	10,189	5,691
cm_Chronic_pulmonary_disease	7,827	2,356	1.14	2,334	1,800
cm_Diabetes__uncomplicated	2,705	435	0.14	−1,572	2,745
cm_Diabetes__complicated	8,199	2,768	0.16	4,942	479
cm_Hypothyroidism	4,939	1,488	0.28	−3,439	3,123
cm_Renal_failure	6,878	1,849	−0.12	4,742	1,254
cm_Liver_disease	5,160	1,785	0.62	3,092	1,041
cm_Peptic_ulcer_disease	19,533	5,758	1.13	54,846	−3,050
cm_Lymphoma	9,799	2,303	0.88	3,007	−1,233
cm_Metastatic_cancer	30,821	14,990	1.41	2,750	1,361
cm_Solid_tumor_without_metastasis	9,499	4,782	0.57	5,440	1,195
cm_Rheumatoid_arthritis	3,466	1,216	−0.15	−221	1,750
cm_Coagulopathy	20,571	2,699	1.35	13,974	−2,334
cm_Coagulopathy_hemophilia	170,713	79,780	−0.23	16,538	12,508
cm_Blood_loss_anemia	8,795	4,863	1.26	12,058	−1,788
cm_Deficiency_anemia	11,435	4,041	0.07	4,719	2,502
cm_Obesity	3,297	156	0.78	4,450	434
cm_Weight_loss	17,755	2,807	0.76	12,035	3,392
cm_Fluid_and_electrolyte_disorders	18,912	3,560	2.39	4,532	−1,214

(display-bnc "dual")

n_comorbs	subjects	median_subtotal
0	2,208	28,840
1	1,539	34,133
2	950	39,367
3	531	47,967
4	305	56,389
5	205	62,527
6	131	88,344
7	93	100,588
8	58	116,936
9	35	115,445
10	29	164,307
11	17	223,938
12	19	190,814
≥ 13	17	274,452

Methods paper

If we chose models on the basis of fit rather than predictive accuracy, then we know a priori that the most complex models would win. This means that for outpatient and drug costs, twin models with all IVs would win (which is similar to the real winning models, non-twin with all IVs), whereas for inpatient costs, all IVs would win (which is the same result we got with predictive accuracy).

We might also compare model selection based on predictive accuracy to model selection based on p-values. In particular, we can use the simple-minded method of trying all the predictors and keeping the significant ones. But how shall I test the significance of the coefficients? quantreg doesn't supply significance tests for models fit with a lasso penalty. Let's use the method of Redden, Fernández, and Allison (2004).

Scott: "Thinking of discussion points: Predictive analytics have not been big in HIV research in part because of smaller data sets we typically collect. However, large data sets, such as Medicaid / Medicare data, are becoming more of the norm than the exception in HIV research. Think of ATN and other large scale studies. NIH push to make data sets publicly available. Virtual cohorts with medical chart data. Basically, predictive analytic tools are needed."

Rob suggested trying a Box-Cox, but this would probably be a hassle to use with quantile regression or lasso linear regression, because it means there's another kind of parameter to fit. Besides, Box-Cox sort of ruins the interpretability of the coefficients, so it's not a good choice when coefficient interpretation is a goal.

I found that I wasn't able to get the sort of improvements on square error that I was able to get on absolute error. My impression is that this is basically a consequence of the high skew of the DVs: a few high values affect square error more than absolute error, so models that optimize mean square error will make a stupider sort of compromise than models that optimize absolute error. This could be discussed in the method paper with a simple, mathematically clean example, and with the RMSEs for outpatient costs.

Outline

Describe prediction and contrast it with association (as in Arfer & Luhmann, 2017)
Describe the method I used in the first paper on Comorbid
Contrast the models thus selected with those that would be selected if we just fit models with all the IVs and then dropped IVs with non-significant coefficients
- We can contrast the models in terms of predictive accuracy as well as just the coefficients
Contrast the fits of the models with their predictive accuracies
Discussion: predictive methods are more useful these days now that large datasets are more widely available (e.g., the ATN, virtual cohorts with medical charts, increasing interest in making datasets publicly accessible)

References

Arfer, K. B., & Luhmann, C. C. (2017). Time-preference tests fail to predict behavior related to self-control. Frontiers in Psychology, 8(150). doi:10.3389/fpsyg.2017.00150

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643), 1512–1519. doi:10.1002/qj.456

Buntin, M. B., & Zaslavsky, A. M. (2004). Too much ado about two-part models and transformation? Journal of Health Economics, 23(3), 525–542. doi:10.1016/j.jhealeco.2003.10.005

Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine, 27(15), 2865–2873. doi:10.1002/sim.3107

Newman, M. C. (1993). Regression analysis of log-transformed data: Statistical bias and its correction. Environmental Toxicology and Chemistry, 12(6), 1129–1133. doi:10.1002/etc.5620120618

Redden, D. T., Fernández, J. R., & Allison, D. B. (2004). A simple significance test for quantile regression. Statistics in Medicine, 23(16), 2587–2597. doi:10.1002/sim.1839