Читать книгу Real World Health Care Data Analysis - Uwe Siebert - Страница 11

Оглавление

Chapter 5: Before You Analyze – Feasibility Assessment

5.1 Introduction

5.2 Best Practices for Assessing Feasibility: Common Support

5.2.1 Walker’s Preference Score and Clinical Equipoise

5.2.2 Standardized Differences in Means and Variance Ratios

5.2.3 Tipton’s Index

5.2.4 Proportion of Near Matches

5.2.5 Trimming the Population

5.3 Best Practices for Assessing Feasibility: Assessing Balance

5.3.1 The Standardized Difference for Assessing Balance at the Individual Covariate Level

5.3.2 The Prognostic Score for Assessing Balance

5.4 Example: REFLECTIONS Data

5.4.1 Feasibility Assessment Using the Reflections Data

5.4.2 Balance Assessment Using the Reflections Data

5.5 Summary

References

5.1 Introduction

This chapter demonstrates the final pieces of the design phase, which is the second stage in the four-stage process proposed by Bind and Rubin (Bind and Rubin 2017, Rubin 2007) and described as our best practice in Chapter 1. Specifically, this stage covers the assessment of feasibility of the research and confirmation that balance can be achieved by the planned statistical adjustment for confounders. It is assumed at this point that you have a well-defined research question, estimand, draft analysis plan, and draft propensity score (or other adjustment method) model. Both graphical and statistical analyses are presented along with SAS code and are applied as an example using the REFLECTIONS data.

In a broad sense, a feasibility assessment examines whether the existing data are sufficient to meet the research objectives using the planned analyses. That is, given the research objectives and the estimand of interest (see Chapters 1 and 2), are the data and planned analyses able to produce reliable and valid estimates?

Girman et al. (2013) summarized multiple pre-analyses issues that should be addressed before undertaking any comparative analysis of observational data. One focus of that work was to evaluate the potential for unmeasured confounding relative to the expected effect size (we will address this in Chapter 13). The Duke-Margolis Real-World Evidence Collaborative on the potential use of RWE for regulatory purposes (Berger et al. 2017) comments that “if the bias is too great or confounding cannot be adequately adjusted for then a randomized design may be best suited to generate evidence fit-for regulatory review.” To address this basic concern with confounding, we focus our feasibility analysis in this chapter on two key analytic issues: confirming that the target population of inference is feasible with the current data (common support, positivity assumption, clinical equipoise, and so on) and assessing the ability to address confounders (measured and unmeasured). Both of these are related to core assumptions required for the validity of causal inference based on propensity score analyses.

For instance, while researchers often want to perform analyses that are broadly generalizable, such as performing an analysis on the full population of patients in the database, a lack of overlap in the covariate distributions of the different treatment groups might simply not allow for quality causal inference analysis over the full sample. If there is no common support (no overlap in the covariate space between the treatment groups), this violates a key assumption necessary for unbiased comparative observational analyses. Feasibility analysis can guide researchers into appropriate comparisons and target populations that are possible to conduct with the data in hand.

Secondly, valid analyses require that the data are sufficient to allow for statistical adjustment for bias due to confounding. The primary goal of a propensity score-based analysis is to reduce the bias inherent in comparative observational data analysis that is due to measured confounders. The statistical adjustment must balance the two treatment groups in regards to all key covariates that may be related to both outcome and the treatment selection, such as age, gender, and disease severity measures. The success of the propensity score is judged by the balance in the covariate distributions that it produces between the two treatment groups (D’Agostino 2007). For this reason, assessing the balance produced by the propensity score has become a standard and critical piece of any best practice analysis.

Note that the feasibility and balance assessments are conducted as part of the design stage of the analysis. That is, such assessments can use the baseline data and thus are conducted “outcome free.” If the design phase is completed and documented prior to accessing the outcome data, then consumers of the data can be assured that no manipulation of the models was undertaken in order to produce a better result. Of course, this assessment may be an iterative process in order to find a target population of inference with sufficient overlap and a propensity model that produces good balance in measured confounders. As this feasibility assessment does not depend on outcomes data, the statistical analysis plan can then be finalized and documented after learning from the baseline data but prior to accessing the outcome data.

5.2 Best Practices for Assessing Feasibility: Common Support

Through the process of deriving the study objectives and the estimand, researchers will have determined a target population of inference. By this we mean the population of patients that the results of the analysis should generalize to. However, for valid causal analysis there must be sufficient overlap in baseline patient characteristics between the treatment groups. This overlap is referred to as the “common support.” There is no guarantee that the common support observed in the data is similar to the target population of inference desired by the researchers. The goal of this section is to demonstrate approaches to help assess whether there is sufficient overlap in the patient populations in each treatment group allowing for valid inference to a target population of interest.

Multiple quantitative approaches have been proposed to assess the similarity of baseline characteristics between the patients in one treatment group versus another. Imbens and Rubin (2015) state that differences in the covariate distributions between treatment groups will manifest in some difference of the corresponding propensity score distributions. Thus, comparisons of the propensity score distributions can provide a simple summary of the similarities of patient characteristics between treatments, and such comparisons have become a common part of feasibility assessments.

Thus, as a tool for feasibility assessment, we propose a graphical display comparing the overlap in the two propensity score distributions, supplemented with the following statistics discussed in the next section that provide quantitative guidance on selection of methods and the population of inference:

● Walker’s preference score (clinical equipoise)

● standardized differences of means

● variance ratios

● Tipton’s index

● proportion of near matches

Specific guidance for interpreting each summary statistic is provided in the sections that follow. In addition, guidance on trimming non-overlapping regions of the propensity distributions to obtain a common support is discussed.

5.2.1 Walker’s Preference Score and Clinical Equipoise

Walker et al. (2013) discuss the concept of clinical equipoise as a necessary condition for quality comparative analyses. They define equipoise as “a balance of opinion in the treating community about what really might be the best treatment for a given class of patients.” When there is equipoise, there is better balance between the treatments on measured covariates, less reliance on statistical adjustment, and perhaps more importantly, potentially less likelihood of strong unmeasured confounding. Empirical equipoise is observed similarity in types of patients on each treatment in the baseline patient population. Walker et al. argue that “Empirical equipoise is the condition in which comparative observational studies can be pursued with a diminished concern for confounding by indication …” To quantify empirical equipoise, they proposed the preference score, F, a transformation of the propensity score to standardize for the market share of each treatment,

,

where F and PS are the preference and propensity scores for Treatment A and P is the proportion of patients receiving Treatment A. Patients with a preference score of 0.5 are likely to receive either Treatment A or B in the same proportion of the market share for Treatments A or B. As a rule of thumb, it is acceptable to pursue a causal analysis if at least half of the patients in each treatment group have a preference score between 0.3 and 0.7 (Walker et al. 2013).

5.2.2 Standardized Differences in Means and Variance Ratios

Imbens and Rubin (2015) show that it is theoretically sufficient to assess imbalance in propensity score distributions as differences in the expectation, dispersion, or shape of the covariate distributions will be represented in the propensity score. Thus, comparing the distributions of the propensity scores for each treatment group has been proposed to help assess the overall feasibility and balance questions. In practice, the standardized difference in mean propensity scores along with the ratio of propensity score variances have been proposed as summary measures to quantify the difference in the distributions (Austin 2009, Stuart et al. 2010). The standardized difference in means (sdm) is defined by Austin (2009) as the absolute difference in the mean propensity score for each treatment divided by a pooled estimate of the variance of the propensity scores:


Austin suggests that standardized differences > 0.1 indicate significant imbalance while Stuart proposes a more conservative value of 0.25. As two very different distributions can still produce a standardized difference in means of zero (Tipton 2014), it is advisable to supplement the sdm with the variance ratio. The variance ratio statistic is simply the variance of the propensity scores for the treated group divided by the variance of the propensity scores for the control group. Acceptable ranges for the ratio of variances of 0.5 to 2.0 have been cited (Austin 2009).

5.2.3 Tipton’s Index

Tipton (2014) proposed an index comparing the similarity of two cohorts as part of work in the generalizability literature to assess how well re-weighting methods are able to generalize results from one population to another. Tipton showed that, under certain conditions, her index is a combination of the standardized difference and ratio of variance statistics. Thus, the Tipton index improves on using only the standardized difference by detecting differences in scale between the distributions as well. The Tipton Index (TI) is calculated by the following formula applied to the distributions of the propensity scores for each treatment group:


where for strata j = 1 to k, is the proportion of the Treatment A patients that are in stratum j () and is the proportion of Treatment B patients in stratum j () and A recommended stratum size for calculating the index is based on the total sample size: . The index takes on values from 0 to 1, with very high values indicating good overlap between the distributions. As a rule of thumb, an index score > 0.90 is roughly similar to the combination of a standardized mean difference < 0.25 and a ratio of variances between 0.5 and 2.0.

5.2.4 Proportion of Near Matches

Imbens and Rubin (2015) propose a pair of summary measures based on individual patient differences to assess whether the overlap in baseline patient characteristics between treatments is sufficient to allow for statistical adjustment. The two proposed measures are the proportion of subjects in Treatment A having at least one similar matching subject in treatment B and the proportion of subjects in Treatment B having at least one similar match in Treatment A. A subject is said to have a similar match if there is a subject in the other treatment group with a linearized propensity score value within 0.1 of that subject’s linearized propensity score. The linearized propensity score (lps) is defined as where ps is the propensity score for the patient given their baseline covariates. Note that this statistic is most relevant when matching with replacement is used for the analytical method.

5.2.4 Proportion of Near Matches

Imbens and Rubin (2015) propose a pair of summary measures based on individual patient differences to assess whether the overlap in baseline patient characteristics between treatments is sufficient to allow for statistical adjustment. The two proposed measures are the proportion of subjects in Treatment A having at least one similar matching subject in treatment B and the proportion of subjects in Treatment B having at least one similar match in Treatment A. A subject is said to have a similar match if there is a subject in the other treatment group with a linearized propensity score value within 0.1 of that subject’s linearized propensity score. The linearized propensity score (lps) is defined as where ps is the propensity score for the patient given their baseline covariates. Note that this statistic is most relevant when matching with replacement is used for the analytical method.

5.2.5 Trimming the Population

Patients in the tails of the propensity score distributions are often trimmed, or removed, from the analysis data set. One reason is to ensure the positivity assumption that each patient has a probability of being assigned to either treatment of greater than 0 and less than 1 is satisfied. This is one of the key assumptions for causal inference when using observational data. Secondly, when weighting-based analyses are performed, patients in the tails of the propensity distributions can have extremely large weights. This can result in inflation of the variance and reliance of the results on a handful of patients.

While many ad hoc approaches exist, Crump et al. (2009) and Baser (2007) proposed and evaluated a systematic approach for trimming to produce an analysis population. This approach balances the increase in variance due to reduced sample size (after trimming) versus the decrease in variance from removing patients lacking matches in the opposite treatment (and thus have large weights in an adjusted analysis). Specifically, the algorithms find the subset of patients with propensity scores between α and 1-α that minimizes the variance of the estimated treatment effects. Crump et al. (2009) state that for many scenarios the simple rule of trimming to an analysis data set including all estimated propensity scores between 0.1 and 0.9 is near optimal.

However, in some scenarios, the sample size is large and efficiency in the analysis is of less concern than excluding patients from the analysis. In keeping with the positivity assumption (see Chapter 2), a commonly used approach is to trim only (1) the Treatment A (treated) patients with propensity scores above the maximum propensity score in the Treatment B (control) group; and (2) the Treatment B patients with propensity scores below the minimum propensity score in the Treatment A group.

The PSMATCH procedure in SAS can easily implement the Crump rule of thumb and the min-max procedure and other variations using the Region statement (Crump: REGION=ALLOBS (PSMIN=0.1 PSMAX=0.9); min-max: REGION=CS(extend=0)). We fully implement the Crump algorithm in Chapter 10 in the scenarios with more than two treatment groups where it is difficult to visually assess the overlap in the distributions. In this chapter, we follow the approaches available in the PSMATCH procedure.

Recently, Li et al. (2016) proposed the concept of overlap weights to limit the need to trim the population simply to avoid large weights in the analysis. They propose an alternative to the target population of inference in addition to ATE and ATT. Specifically, the overlap weights up-weight patients in the center of the combined propensity distributions and down weight patients in the tails. This is discussed in more detail in Chapter 8, but mentioned here to emphasize the point that the need for trimming depends on the target population and planned analysis method (for example, matching with calipers will trim the population by definition). At a minimum, in keeping with the importance of the positivity assumption, we recommend trimming using the minimum/maximum method available in PSMATCH.

5.3 Best Practices for Assessing Feasibility: Assessing Balance

Once it has been determined that it is feasible to perform a comparative analysis, one last critical step in the design stage of the research is to confirm the success of the statistical adjustment (for example, propensity score) for measured confounders. The success of a propensity score model is judged by the degree to which it results in the measured covariates to be balanced between the treatment groups. Austin (2009) argues that comparing the balance between treatment groups for each and every potential confounder after the propensity adjustment is the best approach and that assessment of the propensity distributions alone is informative but not sufficient for this step. Also, a good statistic for this “balance” assessment should be both independent of sample size and a function of the sample (Ho et al. 2007). Thus, the common practice of comparing baseline characteristics using hypothesis tests, which are highly dependent on sample size, is not recommended (Austin 2009). For these reasons, computing the standardized differences for each covariate has become the gold standard approach to assessing the balance produced by the propensity score. However, simply demonstrating similar means for two distributions does not imply similar distributions. Thus, further steps providing a fuller understanding of the comparability of the distributions of the covariates between treatments is recommended. In addition, ensuring similar distributions in each treatment group for each covariate does not ensure that interactions between covariates are the same in each treatment group.

We follow a modified version of Austin’s (2009) recommendations as our best practice for balance assessment. For each potential confounder:

1. Compute the absolute standardized differences of the mean and the variance ratio.

2. Compare the absolute standardized differences of the mean and the variance ratios using the following:

a. Rule of thumb: absolute standardize differences < 0.1 indicate acceptable balance and variance ratios between 0.5 and 2.0 indicate acceptable balance.

b. Optional additional examination: compute the expected distribution of standardized differences and variance ratios under the assumption of balance (sdm = 0, variance ratio = 1) and assess the observed values in relation to the expected distribution.

3. Repeat steps 1 and 2 to compute and assess the standardized mean differences and variance ratios for 2-way interactions.

4. As a final check, graphically assess differences in the full distribution of each covariates between treatments using displays such as a Q-Q plot.

Of course, one could follow these instructions and substitute different statistics in each step – such as a formal KM test to compare distributions of the covariates instead of the graphical approach – or supplement the Q-Q plots with statistics for mean and max deviation from a 45-degree line as Ho et al. (2007) suggest. However, the goal is clear. A thorough check to confirm that the covariate distributions are similar between the treatment groups is necessary for quality comparative analysis from observational data.

In practice, the above steps may indicate some imbalance on select covariates and the balance assessment might become an iterative process. Imbalance on some covariates – such as those known to be strongly predictive of the outcome measure – may be more critical to address than imbalance on others. If imbalance is observed, then researchers have several options including revising the propensity model, using exact matching or stratification on a critical covariate, trimming the population, and modifying the analysis plan to incorporate the covariates with imbalance into the analysis phase to address the residual imbalance.

Real World Health Care Data Analysis

Подняться наверх