Proposal for improved reporting of the RECOVERY trial
Taking advantage of the many good things pre print servers offer, and in a spirit of open science, we wrote a peer review with Andre Gillibert of the first paper issued from the RECOVERY trial. The pre-print and this comment can be found on the MedRXiv pre-print server.
Proposal for improved reporting of the RECOVERY trial
André GILLIBERT (M.D.)1, Florian NAUDET (M.D., P.H.D.)2
1/ Department of Biostatistics, CHU Rouen, F 76000, Rouen, France
2/ Univ Rennes, CHU Rennes, Inserm, CIC 1414 (Centre d’Investigation Clinique de Rennes), F- 35000 Rennes, France
We read with interest the pre-print of the article entitled “Effect of Dexamethasone in Hospitalized Patients with COVID-19: Preliminary Report”. This reports the preliminary results of a large scale randomized clinical trial (RCT) conducted in 176 hospitals in the United Kingdom. To our knowledge it is the largest scale pragmatic RCT comparing treatments of the COVID-19 in curative intent. The 28- days survival endpoint is objective, clinically relevant and should not be influenced by the measurement bias that may be caused by the open-label design. While 2,315 study protocols have been registered on ClinicalTrials.gov about COVID-19, as of June 24th 2020, Recovery is, to our knowledge, the only randomized clinical trial on COVID-19 that succeeded to include more than ten thousands patients. The open-label design and simple electronic case report form (e-CRF) may have helped to include a non-negligible proportion of all COVID-19 patients hospitalized in the United Kingdom (UK). Indeed, as of June 24th 2020, approximatively 43,000 patients died of COVID-19 in hospital in the UK, of whom approximatively 0.24 × 11,500 = 2,760, that is more than 6% of all hospital deaths of COVID-19, where included in the Recovery study.
Having read with interest version 6.0 of the publicly available study protocol (https://www.recoverytrial.net/files/recovery-protocol-v6-0-2020-05-14.pdf) we had hoped for more details in the reporting of methods and results of this trial and take advantage of the open-peer review process offered by pre-prints servers to suggest improving some aspects of the reporting before the final peer-reviewed publication. Please, find below some easy to answer comments that may help to improve the article overall.
Interim analyses and multiple treatment arms
The first information would be about interim analyses. The protocol (version 6.0) specifies that it is adaptive and that randomization arms may be added removed or paused according to decisions of the Trial Steering Committee (TSC) basing its decision on interim analyses performed by the Data Monitoring Committee (DMC) and communicated when “the randomised comparisons in the study have provided evidence on mortality that is strong enough [...] to affect national and global treatment strategies” (protocol, page 16, section 4.4, 2nd paragraph). The Supplementary Materials of the manuscript specifies that “the independent Data Monitoring Committee reviews unblinded analyses of the study data and any other information considered relevant at intervals of around 2 weeks”. This suggests that many interim analyses may have been performed from the start (March 9th) to the end (June 8th) of the study.
Statistically, interim analyses not properly taken in account generate an inflation of the type I error rate which may be increased again by the multiple treatment arms. Methods such as triangular tests make it possible to control the type I error rate. Most methods of control of type I error rate in interim analyses require that the maximal sample size be defined a priori and that the timing and number of interim analyses be pre-planned. This protocol being adaptive, new arms were added, implying new statistical tests in interim analyses, and no pre-defined sample size as seen in page 2 of the protocol: “[...] it may be possible to randomise several thousand with mild disease [...], but realistic, appropriate sample sizes could not be estimated at the start of the trial.” This make control of the type I error rate difficult. The fact that the study has been stopped on the final analysis as we understand from the current draft rather than interim analysis does not remove the type I error rate inflation. The multiple treatment arms lead to another inflation of the type I error rate.
The current manuscript does not specify any procedure to fix these problems. The Statistical Analysis Plans (SAP) V1.0 (in section 5.5) and V1.1 (in section 5.6) specify that “Evaluation of the primary trial (main randomisation) and secondary randomisation will be conducted independently and no adjustment be made for these. Formal adjustment will not be made for multiple treatment comparisons, the testing of secondary and subsidiary outcomes, or subgroup analyses.” and nothing is specified about interim analysis. Therefore, we conclude that no P-value adjustment for multiple testing has been performed, neither for multiple treatment arms nor for interim analysis. If an interim analysis assessing 4 to 6 treatment arms at the 5% significance level has been performed every 2 weeks from march to June, up to 50 tests may have been performed, leading to major inflation of type I error rate. In our opinion, the best way to assess and maybe fix the type I error rate inflation, is to report with maximal transparency every interim analysis that has been performed, with the following information:
Date of the interim analysis and number of patients included at that stage
Was the interim analysis planned (e.g. every 2 weeks as planned according to supplementary material) or unplanned (e.g. due to an external event, for instance the article of Mehra et al about hydroxychloroquine published in The Lancet, doi:10.1016/S0140-6736(20)31180-6), and if exceptional, why?
Which statistical analyzes, on which randomization arms, have been performed at each stage
If predefined, what criteria (statistical or not) would have conducted to early arrest of a randomization arm for inefficiency and what criteria would have conducted to arrest for proved efficacy?
If statistical criteria were not predefined, did the DMC provide a rationale for his choice to communicate or not the results to the TSC? If yes, could the rationale be provided?
The results of statistical analyzes performed at each step
The decision of the DMC to communicate or not the results to the TSC and which results have been reported as the case may be
The information about interim analyses and multiple randomization arms will help to assess whether the inflation of type I error rate is severe or not. A post hoc multiple testing adjustment, taking in account the many randomized treatments and interim analyses, should be attempted, and discussed, even though there may be technical issues due to the adaptative nature of the protocol.
Adjustment for age
An adjustment for age (in three categories <70 years, 70-79, ≥ 80 years, see legend of table S2) in a Cox model was performed for the comparison of dexamethasone to standard of care in the article. This adjustment was not specified in the version 6.0 of the protocol but was, according to the manuscript “added once the imbalance in age (a key prognostic factor) became apparent”. This is confirmed by the addition of a words ““However, in the event that there are any important imbalances between the randomised groups in key baseline subgroups (see section 5.4), emphasis will be placed on analyses that are adjusted for the relevant baseline characteristic(s).” in section 5.5 page 16 of the SAP V1.1 of June 20th compared to the SAP V1.0 of June 9th which specified a log-rank test. The SAP V1.0 of the 9th June may have been written before the database has been analyzed (data cut June 10th) but the SAP of the 20th has probably been written after preliminary analysis have been performed. This is consistent with the words “became apparent” of the manuscript. Therefore, in our opinion, this adjustment must be considered as a post hoc analysis rather than as the main analysis. Moreover, even though the SAP V1.1 specifies that an “important imbalance” will lead to an “emphasis” on adjusted analyses, it does not change the primary analysis (see section 5.1.1 page 14). It is not clear what “important imbalance” means. To interpret that, we will perform statistical tests to assess balance of key baseline subgroups specified in SAP V1.1 (see section 5.4):
Risk group (three risk groups with approximately equal number of deaths based on factors recorded at randomisation). Its distribution is shown in figure S2. A chi-square tests on the distribution of risk groups in Dexamethasone 1255/500/349 and Usual care 2680/926/715 groups, lead to a P-value=0.092. A chi-square test for trend yields a P-value equal to 0.23.
Requirement for respiratory support at randomisation (None; Oxygen only; Ventilation or ECMO). P-value=0.89 for chi-square test and P-value=0.86 for chi-square for trend.
Time since illness onset (≤7 days; >7 days). P-value=0.17
Age (<70; 70-79; 80+ years). P-value=0.016 for chi-square test, p=0.019 for chi-square test for trend
Sex (Male; Female). P-value=0.97 for chi-square test
Ethnicity (White; Black, Asian or Minority Ethnic). No data found.
The criteria to define “important imbalance” seems to be statistical significance at the 0.05 threshold, however that should have been stated and tests for all other variables should have been provided too.
First, this adjustment, from a theoretical point-of-view, was not necessary since the study was randomized; if the exact condition of imbalance triggering the adjustment was pre-specified in the protocol or SAP before the imbalance was known, it could induce a very slight reduction of the type I error rate and power. However, as it was performed when the imbalance was known, there is a risk that the sign of the imbalance (i.e. higher age in the dexamethasone group) have influenced the choice of adjustment. Indeed, an adjustment conditional to a higher age in the dexamethasone group will increase the estimated effect of dexamethasone in these conditions, and so, provide an inflation of the type I error rate. If the same conditional adjustment were further considered for other prognostic variables, the inflation could even be higher.
Unless there is strong evidence that the amendment to the SAP was performed without knowledge of the sign of the imbalance (higher age in the dexamethasone group), we suggest that the primary analysis be kept as originally planned, without adjustment, and that the age adjustment be performed
in a sensitivity analysis only. The knowledge of the sign of the unbalance is unclear in the last version of the SAP (V1.1, June 20th) and in the manuscript. In addition, in an open label trial, it is always better to stick to the protocol.
Results in other treatment arms
The manuscript specifies that “the Steering Committee closed recruitment to the dexamethasone arm since enrollment exceeded 2000 patients.” It is not stated whether any other treatment arm has exceeded 2000 patients or not and whether the study is still ongoing. Results of treatment arms that have been stopped should be provided (all arms having enrolled more than 2000 patients?). If not, the number of patients randomized in other treatment arms should, at least, be reported. If the study is completely stopped, all treatments should be analyzed and reported, unless there is a specific reason not to do so; that reason should be stated as the case may be. This data would be useful to provide evidence on other molecules. It would also clarify the number of statistical tests that have been performed or not, providing more information about the overall inflation of alpha risk.
The paragraph about the sample size suggests that inclusions were planned, at some time, to stop when 2000 patients were included in the dexamethasone arm. The amended protocol (May 14th), the SAP V1.0 (June 9th) and the SAP V1.1 (June 20th, 4 days after the results have been officially announced) all have a paragraph about the sample size but all specify that the sample size is not fixed and none specify any criteria of arrest of the research based on sample size. There are 2104 patients included in this arm, which is substantially larger than the target of 2000 patients. The exact chronology and methodology should be clarified: when was the sample size computed and what was the exact criteria to arrest the research? Could the document (internal report?) related to this sample size calculation and statistical or non-statistical decision of arrest of the research be published in supplementary material?
Indeed, assessment of the type I error rate requires knowing exactly when and why the research has been arrested: arrest for low inclusion rate of new patients or for reaching target sample size cannot be interpreted the same as arrest for high efficacy observed on an interim analysis.
Future of the protocol
With the new evidence about dexamethasone, the protocol will probably be stopped or evolve. The future recruitment may slow as the peak of the epidemic curve in United Kingdom is passed. The past, present and future of the protocol needs also to be known to assess the actual type I error rate. Indeed, future analyses, that have not yet been performed influence the overall type I error rate. That is why we suggest that author’s provide the daily or weekly inclusion rate from March to June and discuss the future of the study.
Loss to follow-up
Table S1 shows that the follow-up forms have been received for 1940/2104 (92.2%) patients of the dexamethasone group and 3973/4321 patients of the usual care group (91.9%). The patients without follow-up forms (8.5% overall) may either be lost to follow-up or have been included in the 28 last days before June 10th 2020 (data cut). The manuscript mentions that 4.8% of patients “had not been followed for 28 days by the time of the data cut”, suggesting that 8.5%-4.8% = 3.7% of patients are lost to follow-up, but that is our own interpretation. We suggest that authors report the actual number of loss to follow-up and how their data have been imputed or analyzed. The number of loss to follow-up may differ for different outcomes. For instance, if the Office of National Statistics (ONS) data has been used for vital status assessment, there should be no loss to follow-up on that outcome.
The current manuscript only specifies the data of the web-based case report (e-CRF) form, filled by hospital staff, as source of information, suggesting that it is the only source of information about the vital status. The document entitled “Definition and Derivation of Baseline Characteristics and Outcomes” provided at https://www.recoverytrial.net/files/recovery-outcomes-definitions-v1-0.pdf specifies many other sources. For instance, the vital status had to be assessed from the Office of National Statistics (ONS). Other sources, including Secondary Use Service Admitted Patient Care (SUSAPC) and e-CRF could be used for interim analysis. The ONS was considered as the defining source (most reliable). Whether the ONS data has been used or not should be clarified. If the ONS data have been used, statistics of agreement of the two data sources (e-CRF and ONS) may be provided to help assessing the quality of data. If the ONS data have not been used, this deviation from the planned protocol should be documented.
The manuscript as well as the recovery-outcomes-definitions-v1-0.pdf file specifies that the follow-up form of the e-CRF is completed at “the earliest of (i) discharge from acute care (ii) death, or (iii) 28 days after the main randomisation”. If the follow-up form is not updated further, patients discharged alive before day 28 (e.g. day 14) may have incomplete vital status information at day 28. The following information should be specified:
Whether the follow-up form of the e-CRF had to be updated by hospital staff at day 28 for these patients
If response to (1) is yes, whether there was a means to distinguish between a lost to follow- up at day 28 (form not updated) and a patient discharged and alive at day 28 (form updated to “alive at day 28”)
If response to (2) is yes, how many patients discharged before day 28 were lost to follow-up at day 28
If response to (2) is yes, how has their vital status at day 28 been imputed or managed in models with censorships (log-rank, Kaplan-Meier, Cox)
Of course, this information is really needed if the ONS and SUSAPC data have not been used.
The quality of the vital status information is critical in such a large scale open-label multi-centric trial, because there is a risk that one or more center selectively report death, biasing the primary analysis.
Inclusion distribution by center
A multicentric study provides stronger evidence than a single-center study but sometimes, few centers include most patients, with a risk of low-quality data or selection bias. The very high number of included patients in the Recovery trial suggests that many centers included many patients but the distribution of inclusions per center could be reported.
The protocol specifies that “in some hospitals, not all treatment arms will be available (e.g. due to manufacturing and supply shortages); and at some times, not all treatment arms will be active (e.g. due to lack of relevant approvals and contractual agreements).” This is further clarified in the SAP V1 (section 2.4.2 Exclusion criteria, page 8) by the sentence “If one or more of the active drug treatments is not available at the hospital or is believed, by the attending clinician, to be contraindicated (or definitely indicated) for the specific patient, then this fact will be recorded via the web-based form prior to randomisation; random allocation will then be between the remaining (or indicated) arms.” Showing that randomization arms may be closed on an individual basis, when the patient is included, with the argument of contraindication or definitive indication. It seems that the “standard of care” group could not be removed and that at least another randomization arm had to be kept as suggested by the words “random allocation will then be between the remaining arms (in a 2:1:1:1, 2:1:1 or 2:1 ratio)” in section 2.9.1 page 11 of the SAP V1.0. Even exclusion of a single randomization arm can lead to imbalance between groups. For instance, if physicians believed that a treatment was contraindicated for the most severe patients, only non-severe patients could be randomized to the treatment’s arm, while most severe patients would be randomized to other arms. Several things can be done to assess and fix this bias. First, report how many times this feature has been used and which randomization arms have been most excluded. If it has been used many times, provide the pattern of use that help to assess whether this is a collective measure (e.g. 2-weeks period of shortage of a treatment in a center→no major selection bias) or individual measure. If its use has been rare, a sensitivity analysis could simply exclude these patients. If it has been frequent, we suggest a statistical method to analyze this data without bias, based on the following principles: patients randomized between 3 randomization arms A, B and C (population X) are comparable for the comparisons of A to B. Patients randomized between A, B and D (population Y), are comparable for the comparisons of A to B. Population X and population Y may differ but, inside each population, A can be compared to B. Therefore, the within-X comparison of A to B and within-Y comparison of A to B are both valid and can be meta-analyzed to assess a global difference between A and B. This can be simply done with an adjustment on the population (X or Y) in a fixed effects multivariate model. Pooling of X and Y populations should not be performed without adjustment.
A second problem with randomization exists although the dexamethasone arm is the least affected. Randomization arms have been added in this adaptative trial. When a new randomization arm is added, new patients may be randomized to this arm and fewer patients are randomized to other arms. Consequently, the distribution of dates of inclusion may differ between groups. This may have some impact on the mortality at two levels: (1) the medical prescription of hospitalization may have evolved
as the epidemic evolved, with hospitalization reserved to most severe patients at the peak of epidemic and maybe wider hospitalization criteria at the start of epidemic and (2) evolution of patients included in the Recovery trial. Indeed, even if centers should have included as many patients as possible as soon as their inclusion criteria were met, it is possible that they have only included part of eligible patients and that this part evolved with time. This bias can be easily assessed and fixed: the curves of inclusions in the different arms and mortality rate in the Recovery trial can be drawn as a function of date (from March to June) and an adjustment on date of inclusion may be performed in a sensitivity analysis.
Recovery is the study with the best methodology that we have seen on COVID-19 treatments in curative intent and we salute the initiative of publishing transparently the protocol, its amendments, the statistical analysis plan and the first draft of the report. We hope that our reporting suggestions will be taken in account in the final version of the paper. We think that discussing these points will qualify the interpretation of results, further improve the transparent approach adopted by designers of the study and improve the reliability of the conclusions. We expect a high-quality reporting of these final results, with full transparency on interim analyses, statistical analysis plans and statistical analysis reports. We hope that these comments are helpful and again we acknowledge that this study is not solely outstanding in terms of importance of the results but is also a stellar example for the whole field of therapeutic research. We invite other researchers to provide comments to this article to engage in Open Science.