Reply to the Letter to the Editor: “An analysis of 11.3 million screening tests examining the association between recall and cancer detection rates in the English NHS breast cancer screening programme”

by R. G. Blanks, R. M. Given-Wilson, S. L. Cohen, J. Patnick, R. J. Alison, M. G. Wallis (roger.blanks@ndph.ox.ac.uk)

An analysis of 11.3 million screening tests examining the association between recall and cancer detection rates in the English NHS breast cancer screening programme

Thank you for your interest in our paper. We note that you plan an additional analysis which extends the data which will potentially include the data used here plus, we assume, the possibility of another two years data (2016/17 and 2107/18) as well as interval cancer data. It was our hope that the paper would provoke further discussion on recall rate and that by producing models that numerate the diminishing returns from higher recall rates the balance between benefit and harm can be made explicit. Further such models can be updated as appropriate when greater information becomes available, or changes in technology occur.

Our analysis of the data suggests two key findings. Firstly, that increasing the recall rate above 2.6% at prevalent screens and 4% at incident screens does not result in the detection of any more grade 3 invasive cancers. Crucially this suggests that the interval cancer rates from incident screens from grade 3 invasive cancers will be independent of the recall rate above 2.4%. This could be tested with a further analysis. Secondly, above about 4% at incident screens and 7% at prevalent screens the only appreciable increase in detection is in low/intermediate grade DCIS (all high grade DCIS being detected earlier). Again, this cold be subject to further evaluation. We have endeavoured to provide as much information as possible from this unique dataset used in the paper, however in response to your letter we have included an additional figure for prevalent screens (see below) showing 95% confidence limits around the cancer detection rate for each individual screening unit with the fitted model and this information can be considered along with Figure 1 in the paper. Note that the incident screen data is based on greater numbers of screens than the prevalent screens and the observed rates more statistically stable. The vast majority of confidence limits cross the fitted line and indeed the graph itself could have a QA use where a unit has 95% confidence limits that are wholly below the fitted line. We believe the models are a reasonable summary of the association between detection and recall rates. Further our conclusions that very high recall rates only increase the detection of non-invasive cancers are supported by comparisons between American and UK data as reported in the paper.

Our findings suggest that many small grade 3 invasive cancers destined to become interval cancers are not there (i.e. not entered the pre-clinical detectable phase), not visible (occult) or visible but not recognisable as possible cancers (i.e. look benign) using current practices and technology. This would explain why simply increasing recall rate does not influence the grade 3 detection rate.
It is important in terms of screening standards to ask the question; why do we see such variation in recall rates in England? Firstly screening units are trained or led by individuals who have their own concept of what is an appropriate sensitivity/specificity trade-off. This happens because the current recall rates are effectively anything less than 10% at prevalent screens and anything less than 7% at incident screens is acceptable providing detection rates are met. This allows a wide range of recall rates and for some units to use high recall rates to ‘chase’ the possibility of detecting a few more grade 3 invasive cancers (our models suggest with little success). Secondly, there Is a wide variation in screening unit size in England (allowed for in the models) and the smallest unit screens only one-tenth the number of women screened by the largest unit. The smallest units have statistically unstable cancer detection rates, but not recall rates. This can mean that during any one year the cancer detection rate may be low by chance and trigger a quality assurance action. This could result in an increase in recall rate to improve the detection rate and this will lead over time to some smaller units having excessively high recall rates, which is exactly what we see.

On further specific points you raise, yes, the Dutch recall rate is lower than all English units. However, inclusion or exclusion of the Dutch datapoint does not change the model i.e. the Dutch datapoint is predicted from the English data only model. The Dutch data is only used at prevalent screens and not incident screens because the screening interval is different. We have allowed for age, but we have assumed that background incidence at a specific age is similar between the Netherlands and England. Post screening this assumption is probably impossible to verify. The suggested target ranges of 4.0%–7.0% at prevalent screens and 2.6–4.0% at incident screens which correspond to 95% and 99% of the modelled maximum value are wholly within the English observed rates and we are not suggesting that the models should be used to set a target range at very low recall rates outside the English experience. We agree that, in Figure 1, for prevalent screens it is not possible to reliably predict the exact shape of the curve below a recall rate of 4% other than it reduces to (0, 0).

A practical difficulty is that in a long established screening programme how do you go about changing practice at screening units? The Dutch programme had (a high) specificity designed into the programme whereas the English and American programmes do not. Asking units to increase recall rates (and thereby increase false positive recalls) or decreasing recall rates (and thereby ‘miss’ the occasional grade 1 or grade 2 invasive cancer) will understandably meet resistance as it is not usual practice. It can of course be argued that it is the very lack of scientific justification and broad range of current recall rate targets that have led to the current wide range of practice we see between screening units. There is no doubt that providing further evidence and a re-appraisal and extension of the existing evidence could be important in achieving a consensus as to the optimum recall rate range. However, there is no substitute for an interventional study where some units are encouraged to change recall rates under close supervision. We do not underestimate the difficulty of this.

Prevalent screen cancer detection rate per 1000 against recall rate (%) with 95% confidence interval around detection rate for each screening unit and fitted model which goes through origin (0,0)