Читать книгу Individual Participant Data Meta-Analysis - Группа авторов - Страница 79

4.2.7 Developing a Data Dictionary for the IPD

Оглавление

In addition to preparing a list of variables that will be required for the analyses, it is important to consider carefully how best to define, collect and store these in an appropriate and unambiguous manner. The development of a detailed data dictionary for an IPD meta‐analysis project effectively establishes the structure of the meta‐analysis database, facilitates processing of IPD from each trial and ensures that the analyses can proceed as planned, with the greatest degree of flexibility. It also helps guide the trial teams in the preparation of IPD prior to transfer, and gives them the responsibility for modifying variables, lessening the likelihood of misinterpretation or coding errors. However, trial teams may not have the time to adhere to the data dictionary, and they should not be compelled to do so, particularly if their resources for preparing the IPD are limited. In such instances, it is advisable that the central research team accepts trial IPD in any (reasonable) workable form, and take responsibility for reformatting and re‐coding it themselves, according to the data dictionary.

Table 4.1 provides an excerpt from a data dictionary used in an IPD meta‐analysis examining the effects of chemoradiation for cervical cancer.93 Age at randomisation was collected straightforwardly as a continuous variable, with a missing data code of 999. Tumour stage was collected as a categorical variable with a single code for each stage and sub‐stage, and with a missing data code of 9. This afforded the greatest flexibility for subsequent analysis, as the sub‐stages could be used as supplied, or collapsed into broader‐stage categories as needed. While trial eligibility criteria indicate which participants a trial intends to recruit, it is worth suggesting a wider range of possibilities in the data dictionary, because recruitment of some ineligible participants might be inevitable. This could arise, for example, if eligibility is predicated on a positive diagnostic test, and false positives are identified at subsequent review, or as a result of a later diagnostic procedure. In the aforementioned cervical cancer IPD meta‐analysis, women with stage IVB stage were not eligible for any of the included trials. However, they were sometimes randomised erroneously, because initial clinical staging did not identify them as such, but subsequent surgical staging did, and so the data dictionary allowed for that possibility. If particular participant characteristics are collected on different scales, then it may be possible to convert to a common scale. In the cervical cancer IPD meta‐analysis, the included trials recorded performance status on different scales, so in the data dictionary it was made clear that all were permitted, and these were later converted into a common meta‐analysis scale.

Table 4.1 Excerpt from a data dictionary developed for an IPD meta‐analysis of chemoradiation for cervical cancer.93

Source: Claire Vale and Jayne Tierney.

Variable Variable name Definition
Age at randomisation Age Numeric Age in years 999 = unknown
Tumour stage TumStage Numeric Tumour stage categories 1 = Stage Ia 2 = Stage Ib 3 = Stage IIa 4 = Stage IIb 5 = Stage IIIa 6 = Stage IIIb 7 = Stage IVa 8 = Stage IVb 9 = unknown
Performance status PerfStat Numeric Provide the data as defined in the trial and supply full details of the system used
Survival status SurvStat Numeric 0 = Alive 1 = Dead
Date of death or last follow‐up DOLF Date in dd/mm/yy format unknown day = ‐‐/mm/yy unknown month = ‐‐/‐‐/yy unknown date = ‐‐/‐‐/‐‐

The data dictionary should use accepted coding conventions wherever possible, not only to facilitate the provision of data by trial teams, but also to avoid errors. For example, for binary and time‐to‐event outcomes, 0 is most commonly used to indicate no event, and 1 to indicate an event has happened. For time‐to‐event outcomes such as survival in cancer, or time free of seizures in epilepsy, it is important to collect the three component variables that make up the outcome for each participant (Table 4.1). These would comprise: a variable that indicates whether an event has happened (e.g. a death or a seizure); another that provides the date the event happened (e.g. date of death or date of seizure) and finally one that describes the date that the participant was last assessed for the outcome of interest (e.g. the date last seen in clinic). If an event has not occurred, the latter allows the participant to be included in the analysis, and censored at that time‐point. Together with the date of randomisation, these variables allow the time to event for each participant to be calculated, and provides the greatest flexibility for data checking (Section 4.5), risk of bias assessment (Section 4.6) and analysis (Part 2). Alternatively, the date of event and date of last follow‐up (censoring time) can be collected as a composite. As a bare minimum, the collection of an indicator variable for the occurrence of an event (yes/no) and the time to event (or censoring) will suffice. In fact, the latter may be all that trial teams are able to provide, for example, if they originate from a country or institute bound by stringent data protection regulations, or if the data are downloaded from a repository that prohibits the supply of exact dates in order to help to preserve participant confidentiality.

Special care is needed to avoid ambiguity in the data dictionary, otherwise it will lead to ambiguity in the supplied IPD from each trial, and then the IPD meta‐analysis database. For example, for an IPD meta‐analysis of the effects of anti‐platelet therapy for pre‐eclampsia in pregnancy,97 the data dictionary suggested that severe maternal morbidity be coded as a single variable. Unintentionally, this did not allow for the provision of more than one type of morbidity for an individual woman, which could occur, for example if she had eclampsia followed by a stroke (Table 4.2). In the same meta‐analysis, a missing data code of 9 was used for gestation at randomisation, which meant that (although unlikely) any women randomised at nine weeks’ gestation could potentially be regarded mistakenly as having missing gestation information (Table 4.2). Thus, an unambiguous missing data code such as 99 or, even better, a negative integer such as –9 would have been preferable. Furthermore, it is prudent to discriminate between different types of missing data, such as missing for the participant (e.g. –9 or 9), not applicable to the participant (e.g. –8 or 8) or not collected for the trial (e.g. –7 or 7). For example, in an IPD meta‐analysis of progesterone for pre‐term birth,79 if a baby was stillborn, certain baby outcomes were coded as 8 to signify that they could not be collected, and as 9 to indicate a true missing value. Although this could be inferred from the birth data, coding the IPD in this way made it easier to calculate the proportions of missing data and to cross‐check.

Table 4.2 Excerpt from a data dictionary developed for an IPD meta‐analysis project the effects of anti‐platelets for prevention of pre‐eclampsia in pregnancy97

Source: Lesley Stewart and Lisa Askie, based on the data dictionary used by Askie et al.97

Variable Definition Issue
Severe maternal morbidity 1 = none 2 = stroke 3 = renal failure 4 = liver failure 5 = pulmonary oedema 6 = disseminated intravascular coagulation 7 = HELP syndrome 8 = eclampsia 9 = not recorded Collection as a single variable did not allow for the provision of more than one morbidity for the same women
Gestation at randomisation Gestation in completed weeks9 = unknown Woman could be randomised at 9 weeks gestation

Individual Participant Data Meta-Analysis

Подняться наверх