Читать книгу Real World Health Care Data Analysis - Uwe Siebert - Страница 9

На сайте Литреса книга снята с продажи.

Оглавление

Chapter 3: Data Examples and Simulations

3.1 Introduction

3.2 The REFLECTIONS Study

3.3 The Lindner Study

3.4 Simulations

3.5 Analysis Data Set Examples

3.5.1 Simulated REFLECTIONS Data

3.5.2 Simulated PCI Data

3.6 Summary

References

3.1 Introduction

In this chapter, we present both the core data sets that are used as examples throughout the book and demonstrate how to simulate data to mimic an existing data set. Simulations are a common tool for examining and comparing the operating characteristics of different statistical methods. One must know the true value of the parameter of interest when assessing how well a particular method performs. In simulations, as opposed to a case study from actual data, the true parameter values are known, and one can test the performance of methods across various data scenarios specified by the research. However, real world data is very complex – with complex distributions and correlations amongst the many variables, missing data patterns, and so on. Often, published simulations are performed with a limited number of variables using known parametric functions to generate values along with simple or no correlations between covariates or missing data. Thus, simulations based on actual data that retain the complex correlations and missing data patterns, often called “plasmode simulations” (Gadbury et al. 2008, Franklin et al. 2014), can provide a superior test of how methods perform under real world data settings.

This chapter is structured as follows. Sections 2 and 3 present background information about two observational studies (REFLECTIONS and Lindner) that serve as the basis of analyses throughout the book. Section 4 discusses options for simulating real world data from an existing study data set. Section 5 presents the SAS code and the analysis data sets generated for use in the later chapters.

3.2 The REFLECTIONS Study

The Real World Examination of Fibromyalgia: Longitudinal Evaluation of Cost and Treatments (REFLECTIONS) study was a prospective observational study conducted between 2008 and 2011 at 58 clinical sites in the United States and Puerto Rico (Robinson et al. 2012). The primary objective of the study was to examine the burden of illness, treatment patterns, and outcomes for patients initiating new treatments for fibromyalgia. Data was collected via physician surveys, a clinical report form completed at the baseline office visit, and computer-assisted telephone patient interviews at five time points over the one-year study. The physician surveys collected information about the clinical site and lead physician, including physician demographics and practice characteristics. At the baseline visit, data from a thorough clinical summary of the patient was captured. This included demographics, medical history, socio-economic and work/disability status, and treatment. Phone surveys at baseline and throughout the study included information from the patient regarding changes in treatments and disease severity using multiple validated patient rating scales.

The study enrolled a total of 1700 patients and 1575 met criteria for the analysis dataset. A summary of the demographics and baseline patient characteristics is provided in Section 3.5. One analysis conducted from the REFLECTIONS data was an examination of outcomes from patients initiating opioid treatments. Peng et al. (2015) used propensity score matching to compare Brief Pain Inventory (BPI) scores and other outcomes over the one-year follow-up period for patients initiating opioids versus those initiating other treatments for fibromyalgia. We use this example to demonstrate the creation of two simulated data sets based on the REFLECTIONS data: a one observation per patient data set used to demonstrate various propensity score-based analyses in Chapters 4–10 and a longitudinal analysis data set used to demonstrate marginal structural model and replicates analysis methods in Chapters 11 and 12.

3.3 The Lindner Study

The Lindner study was also a prospective observational study (Kereiakes et al. 2000). It was conducted in 1997 at a single site, the Lindner Center for Research and Education, Christ Hospital, Cincinnati, Ohio. Lindner staff members used their research database system to store detailed patient data, including patient feedback and survival information from at least six consecutive months of telephone follow-up.

Lindner doctors were high-volume practitioners of interventional cardiology involving percutaneous coronary intervention (PCI). Specifically, all Lindner operators performed >200 PCIs/year, and their average was 280 PCIs per operator in 1997. The only viable alternative to some PCI procedures is open-heart surgery, such as a coronary artery bypass graft (CABG).

Follow-up analyses of 1472 consecutive PCIs performed at the Lindner Center in 1997 found that their research database contained the “initial” PCIs for 996 distinct patients. Of these patients, 698 (roughly 70% of the 996) had received usual PCI care augmented with planned or rescue use of a new “blood thinner” treatment and are considered the treated group in later analyses. On the other hand, 298 patients (roughly 30% of the 996) did not receive the blood thinner during their initial PCI at Lindner in 1997; these 298 patients constitute the “usual PCI care alone” treatment cohort (control group). Details of the variables included in the data set are provided in section 3.5.2. The simulated PCI15K data set is used in the example analyses of Chapter 7 (stratification), Chapter 14 (generalizability), and Chapter 15 (personalized medicine).

3.4 Simulations

The term “plasmode” has come to represent data that is based on real data (Gadbury et al. 2008). In our case, we wanted a data set that contained no actual patient data – in order that we could freely share and allow readers to implement the various approaches in this book without confidentiality or ownership issues. However, we also wanted data that was truly representative of real world health care research – maintaining the complex correlation structures and addressing common research interests. Thus, “plasmode” simulations based on the REFLECTIONS and Lindner studies were used to generate the data sets used in the remainder of this book. In particular, the method of rank transformations of Conover and Iman (1976) as implemented by Wicklin (2013) serves as the basis for the programs.

3.5 Analysis Data Set Examples

3.5.1 Simulated REFLECTIONS Data

The Peng et al. (2015) analysis from the REFLECTIONS study included 1575 patients in 3 treatment groups based on their treatment at initiation: opioid treatments (378), non-narcotic opioid like treatment (215), and all other treatments (982). Each patient had up to 5 visits including baseline. Tables 3.1 and 3.2 list the key variables in the original analysis data set from which the simulated data was formed.

Table 3.1: List of Patient-wise Variables

Variable Name	Variable Label
SubjID	Subject Number
Cohort	Cohort
Gender	Gender
Age	Age in years
BMI_B	BMI at Baseline
Race	Race
Insurance	Insurance
DrSpecialty	Doctor Specialty
Exercise	Exercise
InptHosp	Inpatient hospitalization in last 12 months
MissWorkOth	Other missed paid work to help your care in last 12 months
UnPdCaregiver	Have you used an unpaid caregiver in last 12 months
PdCaregiver	Have you hired a caregiver in last 12 months
Disability	Have you received disability income in last 12 months
SymDur	Duration (in years) of symptoms
DxDur	Time (in years) since initial Dx
TrtDur	Time (in years) since initial Trtmnt
PhysicalSymp_B	PHQ 15 total score at Baseline
FIQ_B	FIQ Total Score at Baseline
GAD7_B	GAD7 total score at Baseline
MFIpf_B	MFI Physical Fatigue at Baseline
MFImf_B	MFI Mental Fatigue at Baseline
CPFQ_B	CPFQ Total Score at Baseline
ISIX_B	ISIX total score at Baseline
SDS_B	SDS total score at Baseline

Table 3.2: List of Visit-wise Variables

Variable Name	Variable Label
Visit	Visit
OPIyn	Opioids use continued/started at this visit
SatisfCare	Satisfaction with Overall Fibro Treatment
SatisfMed	Satisfaction with Prescribed Medication
PHQ8	PHQ8 total score
BPIPain	BPI Pain score
BPIInterf	BPI Interference score

For the REFLECTIONS simulated data set, simulation was performed separately for each treatment cohort. First, the original dataset was transformed from a vertical (one observation per patient per time-point) into a horizontal format (one record per patient). Next, a cohort-specific data set was created by random sampling (with replacement) from each original variable. The size of sample was 240, 140, and 620 for opioid, non-narcotic opioid, and other treatment cohort, respectively. The SAS/IML programming language was used to implement the Iman-Conover method following the code of Wicklin (2013) as shown in Program 3.1 using the sampled data (A) and the desired between variables rank-correlations (C).

Program 3.1: Iman-Conover Method to Create a Simulated REFLECTIONS Data Set

/* Use Iman-Conover method to generate MV data with known marginals

and known rank correlation. */

start ImanConoverTransform(Y, C);

X = Y;

N = nrow(X);

R = J(N, ncol(X));

/* compute scores of each column */

do i = 1 to ncol(X);

h = quantile(“Normal”, rank(X[,i])/(N+1));

R[,i] = h;

end;

/* these matrices are transposes of those in Iman & Conover */

Q = root(corr(R));

P = root(C);

S = solve(Q,P);

M = R*S; /* M has rank correlation close to target C */

/* reorder columns of X to have same ranks as M.

In Iman-Conover (1982), the matrix is called R_B. */

do i = 1 to ncol(M);

rank = rank(M[,i]);

tmp = X[,i];

call sort(tmp);

X[,i] = tmp[rank];

end;

return( X );

finish;

X = ImanConoverTransform(A, C);

The three cohort-specific simulated matrices (X) were concatenated and then the dropout and missing data were imposed at random in order to reflect the amount of dropout/missingness observed in the actual REFLECTIONS data. Then the structure of the simulated data was converted from horizontal to back to vertical.

The distributions of variables were almost identical for real and simulated data as displayed in Tables 3.3 and 3.4. This can be expected because the Iman-Conover algorithm simply rearranges the elements of columns of the data matrix. The descriptive statistics for real and simulated data are presented below.

Table 3.3: Comparison of Actual and Simulated REFLECTIONS Data for One Observation per Patient Variables

real	type
real	simulated
All	N	1575	1000
Cohort		13.65	14.00
NN opioid	ColPctN
opioid	ColPctN	24.00	24.00
other	ColPctN	62.35	62.00
Gender		94.54	93.20
female	ColPctN
male	ColPctN	5.46	6.80
Race		83.62	82.30
Caucasian	ColPctN
Other	ColPctN	16.38	17.70
Insurance		78.10	75.70
private/combination	ColPctN
public/no insurance	ColPctN	21.90	24.30
Doctor Specialty		17.65	17.60
Other Specialty	ColPctN
Primary Care	ColPctN	15.87	15.70
Rheumatology	ColPctN	66.48	66.70
Exercise		10.03	11.00
No	ColPctN
Yes	ColPctN	89.97	89.00
Inpatient hospitalization in last 12 months		89.84	90.70
No	ColPctN
Yes	ColPctN	10.16	9.30
Other missed paid work to help your care in last 12 months		77.71	79.60
No	ColPctN
Yes	ColPctN	22.29	20.40
Have you used an unpaid caregiver in last 12 months		62.86	60.50
No	ColPctN
Yes	ColPctN	37.14	39.50
Have you hired a caregiver in last 12 months		95.56	95.70
No	ColPctN
Yes	ColPctN	4.44	4.30
Have you received disability income in last 12 months		70.86	72.30
No	ColPctN
Yes	ColPctN	29.14	27.70
Age in years	NMiss	0	0
Mean	50.45	50.12
Std	11.71	11.56
BMI at Baseline	NMiss	0	0
Mean	31.30	31.36
Std	7.34	7.01
Duration (in years) of symptoms	NMiss	216	133
Mean	10.28	10.03
Std	9.26	9.02
Time (in years) since initial Dx	NMiss	216	133
Mean	5.73	5.29
Std	6.27	6.05
Time (in years) since initial Trtmnt	NMiss	216	133
Mean	5.22	5.26
Std	6.02	6.18
PHQ 15 total score at Baseline	NMiss	0	0
Mean	13.81	14.03
Std	4.64	4.79
FIQ Total Score at Baseline	NMiss	0	0
Mean	54.54	54.56
Std	13.43	13.47
GAD7 total score at Baseline	NMiss	0	0
Mean	10.81	10.64
Std	5.77	5.67
MFI Physical Fatigue at Baseline	NMiss	0	0
Mean	13.09	13.00
Std	2.28	2.17
MFI Mental Fatigue at Baseline	NMiss	0	0
Mean	11.51	11.52
Std	2.38	2.49
CPFQ Total Score at Baseline	NMiss	0	0
Mean	26.51	26.62
Std	6.44	6.43
ISIX total score at Baseline	NMiss	0	0
Mean	17.64	17.91
Std	5.97	5.74
SDS total score at Baseline	NMiss	0	0
Mean	18.27	18.28
Std	7.50	7.56

Table 3.4: Comparison of Actual and Simulated REFLECTIONS Data for Visit-wise Variables

real	type
real	simulated
Visit		1575	1000
1	N
Opioids use		76.00	76.00
No	ColPctN
Yes	ColPctN	24.00	24.00
Satisfaction with Overall Fibro Treatment		5.33	6.10
.	ColPctN
1	ColPctN	12.13	12.10
2	ColPctN	20.95	19.70
3	ColPctN	25.27	24.20
4	ColPctN	22.86	24.30
5	ColPctN	13.46	13.60
Satisfaction with Prescribed Medication		10.03	9.80
.	ColPctN
1	ColPctN	7.43	6.80
2	ColPctN	15.81	15.60
3	ColPctN	31.68	31.90
4	ColPctN	23.75	24.30
5	ColPctN	11.30	11.60
PHQ8 total score	NMiss	0	0
Mean	13.07	13.14
Std	6.04	6.02
BPI Pain score	NMiss	0	0
Mean	5.51	5.54
Std	1.74	1.76
BPI Interference score	NMiss	0	0
Mean	6.08	6.00
Std	2.17	2.15

real	type
real	simulated
Visit		1575	1000
2	N
Opioids use		3.11	2.70
	ColPctN
No	ColPctN	71.05	70.10
Yes	ColPctN	25.84	27.20
Satisfaction with Overall Fibro Treatment		5.65	4.80
.	ColPctN
1	ColPctN	16.13	16.60
2	ColPctN	25.33	26.50
3	ColPctN	27.30	28.10
4	ColPctN	18.48	17.00
5	ColPctN	7.11	7.00
Satisfaction with Prescribed Medication		6.29	6.10
.	ColPctN
1	ColPctN	11.37	10.50
2	ColPctN	24.38	24.00
3	ColPctN	30.48	31.90
4	ColPctN	19.56	20.50
5	ColPctN	7.94	7.00
PHQ8 total score	NMiss	50	22
Mean	11.88	11.86
Std	5.92	5.75
BPI Pain score	NMiss	62	47
Mean	5.33	5.34
Std	1.92	1.94
BPI Interference score	NMiss	49	36
Mean	5.54	5.50
Std	2.36	2.40

real	type
real	simulated
Visit		1483	950
3	N
Opioids use		4.99	5.05
	ColPctN
No	ColPctN	68.37	65.37
Yes	ColPctN	26.64	29.58
Satisfaction with Overall Fibro Treatment		8.50	6.63
.	ColPctN
1	ColPctN	16.66	16.74
2	ColPctN	25.62	25.47
3	ColPctN	26.50	26.84
4	ColPctN	16.45	16.84
5	ColPctN	6.27	7.47
Satisfaction with Prescribed Medication		8.02	9.47
.	ColPctN
1	ColPctN	12.74	13.47
2	ColPctN	23.40	21.58
3	ColPctN	31.63	31.89
4	ColPctN	17.87	16.32
5	ColPctN	6.34	7.26
PHQ8 total score	NMiss	74	44
Mean	12.18	12.31
Std	6.22	6.30
BPI Pain score	NMiss	95	52
Mean	5.23	5.13
Std	1.97	1.98
BPI Interference score	NMiss	74	51
Mean	5.47	5.64
Std	2.43	2.36

real	type
real	simulated
Visit		1378	888
4	N
Opioids use		3.85	4.62
	ColPctN
No	ColPctN	67.85	66.10
Yes	ColPctN	28.30	29.28
Satisfaction with Overall Fibro Treatment		8.13	9.91
.	ColPctN
1	ColPctN	18.87	16.55
2	ColPctN	25.47	25.23
3	ColPctN	27.07	28.38
4	ColPctN	15.46	15.20
5	ColPctN	5.01	4.73
Satisfaction with Prescribed Medication		7.84	6.98
.	ColPctN
1	ColPctN	13.13	14.41
2	ColPctN	26.85	25.34
3	ColPctN	31.20	29.95
4	ColPctN	15.89	17.23
5	ColPctN	5.08	6.08
PHQ8 total score	NMiss	56	34
Mean	11.48	11.65
Std	6.06	6.12
BPI Pain score	NMiss	72	48
Mean	5.20	5.15
Std	2.00	2.05
BPI Interference score	NMiss	53	40
Mean	5.39	5.59
Std	2.47	2.47

real	type
real	simulated
Visit		1189	773
5	N
Opioids use		0.25	0.13
	ColPctN
No	ColPctN	68.21	67.53
Yes	ColPctN	31.54	32.34
Satisfaction with Overall Fibro Treatment		3.03	3.36
.	ColPctN
1	ColPctN	16.82	14.62
2	ColPctN	27.75	27.30
3	ColPctN	28.85	30.53
4	ColPctN	16.06	16.04
5	ColPctN	7.49	8.15
Satisfaction with Prescribed Medication		4.79	4.79
.	ColPctN
1	ColPctN	13.46	12.42
2	ColPctN	27.33	25.49
3	ColPctN	33.56	35.58
4	ColPctN	14.89	15.14
5	ColPctN	5.97	6.60
PHQ8 total score	NMiss	0	0
Mean	11.91	11.70
Std	6.26	6.27
BPI Pain score	NMiss	18	11
Mean	5.16	5.10
Std	2.06	2.08
BPI Interference score	NMiss	1	0
Mean	5.31	5.34
Std	2.47	2.53

Figure 3.1 presents the full distribution of a continuous variable (BPI Pain score) for the real and simulated data by visit.

Figure 3.1: Histograms of BPI Pain Scores by Visit for Actual and Simulated REFLECTIONS Data

Figures 3.2 and 3.3 present the correlation matrices for the actual and simulated data sets. The correlation patterns are well preserved in the simulated data though the strength of the associations is slightly less. Again, the Iman-Conover method approximates the desired rank correlations.

Figure 3.2: Rank-correlation Matrix for Actual REFLECTIONS Data

Figure 3.3: Rank-correlation Matrix for Simulated REFLECTIONS Data

In addition to the visit-wise simulated REFLECTIONS data described previously (used for Chapters 11 and 12), we created a one observation per patient version of the data set with variables as shown in Table 3.5. This is referred to as the REFL data set and is used in Chapters 4–6 and 8–10.

Table 3.5: REFL Data Set Variables

Variable Name	Variable Label
SubjID	Subject Number
Cohort	Cohort
Gender	Gender
Age	Age in years
BMI_B	BMI at Baseline
Race	Race
Insurance	Insurance
DrSpecialty	Doctor Specialty
Exercise	Exercise
InptHosp	Inpatient hospitalization in last 12 months
MissWorkOth	Other missed paid work to help your care in last 12 months
UnPdCaregiver	Have you used an unpaid caregiver in last 12 months
PdCaregiver	Have you hired a caregiver in last 12 months
Disability	Have you received disability income in last 12 months
SymDur	Duration (in years) of symptoms
DxDur	Time (in years) since initial Dx
TrtDur	Time (in years) since initial Trtmnt
SatisfCare_B	Satisfaction with Overall Fibro Treatment over past month
BPIPain_B	BPI Pain score at Baseline
BPIInterf_B	BPI Interference score at Baseline
PHQ8_B	PHQ8 total score at Baseline
PhysicalSymp_B	PHQ 15 total score at Baseline
FIQ_B	FIQ Total Score at Baseline
GAD7_B	GAD7 total score at Baseline
MFIpf_B	MFI Physical Fatigue at Baseline
MFImf_B	MFI Mental Fatigue at Baseline
CPFQ_B	CPFQ Total Score at Baseline
ISIX_B	ISIX total score at Baseline
SDS_B	SDS total score at Baseline
BPIPain_LOCF	BPI Pain score LOCF
BPIInterf_LOCF	BPI Interference score LOCF

3.5.2 Simulated PCI Data

The objective in simulating a new PCI data set from the observational data was primarily to produce a larger data set allowing us to more effectively illustrate the unsupervised, nonparametric Local Control alternative to conventional propensity score stratification (Chapter 7) and machine learning methods (Chapter 15). Starting from the observational data on 996 patients who received their initial PCI at Ohio Heart Health, Lindner Center, Christ Hospital, Cincinnati (Kereiakes et al, 2000), we generated this much larger data set via plasmode simulation. The simulated data set contains 11 variables on 15,487 patients with no missing values and is referred to as the PCI15K simulated data set. The key variables in the data set are described in Table 3.6. The treatment cohort for later analyses is represented by the variable THIN and the outcomes by SURV6MO (binary) and CARDCOST (continuous). As details of a process for generating simulated data was described for the REFLECTIONS example, only a brief summary and listing of the final simulated dataset variables are provided for the PCK15K dataset.

Table 3.6: PCI Simulated Data Set Variables

Variable Name	Variable Label
patid	Patient ID number: 1 to 15487
surv6mo	Binary PCI Survival variable: 1 => survival for at least six months following PCI, 0 => survival for less than six months
cardcost	Cardiac related costs incurred within six months of patient’s initial PCI; numerical values in 1998 dollars; costs were truncated by death for the 404 patients with surv6mo = 0
thin	Numeric treatment selection indicator: thin = 0 implies usual PCI care alone; thin = 1 implies usual PCI care augmented by either planned or rescue treatment with the new blood thinning agent
stent	Coronary stent deployment; numeric, with 1 meaning YES and 0 meaning NO
height	Height in centimeters; numeric integer from 133 to 198
female	Female gender; numeric, with 1 meaning YES and 0 meaning NO
diabetic	Diabetes mellitus diagnosis; numeric, with 1 meaning YES and 0 meaning NO
acutemi	Acute myocardial infarction within the previous 7 days; numeric, with 1 meaning YES and 0 meaning NO
ejfract	Left ejection fraction; numeric value from 17 percent to 77 percent
ves1proc	Number of vessels involved in the patient’s initial PCI procedure; numeric integer from 0 to 5

Tables 3.7 and 3.8 summarize the outcome data from the original data and the simulated Lindner data. Data are similar with slightly narrower group differences in the simulated data. In Chapters 7, 14, and 15, the PCI simulated data set is used for analysis and is named PCI15K.

Table 3.7: Lindner STUDY (Kereiakes et al. 2000)

	Patients	Number Surviving Six Months	Percent Surviving Six Months	Average Cardiac Related Cost
Trtm = 0	298	283	94.97%	$14,614
Trtm = 1	698	687	98.42%	$16,127

Table 3.8: PCI Blood Thinner Simulation

	Patients	Number Surviving Six Months	Percent Surviving Six Months	Average Cardiac Related Cost
Thin = 0	8476	8158	96.25%	$15,343
Thin = 1	7011	6925	98.77%	$15,643

3.6 Summary

In this chapter, two observational studies were introduced: the REFLECTIONS one-year study of patients with fibromyalgia and the Lindner study of patients undergoing PCI. The concept of plasmode simulations, where one builds a simulated data set that retains the same variables and correlation structure as the original data, was introduced and applied to the REFLECTIONS and Lindner data sets. SAS IML code for the application to the REFLECTIONS data was provided and was demonstrated to retain the similarities of the original data. These two data sets (simulated REFLECTIONS and PCI15K) are used throughout the remainder of the book to demonstrate the various methods for real world data analyses demonstrated in each chapter.

References

Austin P (2008). Goodness-of-fit Diagnostics for the Propensity Score Model When Estimating Treatment Effects Using Covariate Adjustment With the Propensity Score. Pharmacoepi & Drug Safety 17: 1202-1217.

Conover WG and Iman RL (1976). Rank Transformations in Discriminant Analysis.

Franklin JM, Schneeweis S, Polinski JM, Rassen J (2014). Plasmode simulation for the evaluation of pharacoepidemiologic methods in complex healthcare databases. Comput Stat Data Anal 72:219-226.

Gadbury GL, Xiang Q, Yang L, Barnes S, Page GP, Allison DB (2008). Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates. PLoS Genet 4(6): e1000098.

Kereiakes DJ, Obenchain RL, Barber BL, Smith A, McDonald M, Broderick TM, Runyon JP, Shimshak TM, Schneider JF, Hattemer CH, Roth EM, Whang DD, Cocks DL, Abbottsmith CW (2000). Abciximab provides cost effective survival advantage in high volume interventional practice. American Heart J 140: 603-610.

Peng X, Robinson RL, Mease P, Kroenke K, Williams DA, Chen Y, Faries D, Wohlreich M, McCarberg B, Hann D (2015). Long-Term Evaluation of Opioid Treatment in Fibromyalgia. Clin J Pain 31: 7-13.

Robinson RL, Kroenke K, Mease P, Williams DA, Chen Y, D’Souza D, Wohlreich M, McCarberg B (2012). Burden of Illness and Treatment Patterns for Patients with Fibromyalgia. Pain Medicine 13:1366-1376.

Wicklin R (2013). Simulating Data with SAS®. Cary, NC: SAS Institute Inc.

Подняться наверх