OUP user menu

Assessment of Manual Workload Limits in Gynecologic Cytology
Reconciling Data From 3 Major Prospective Trials of Automated Screening Devices

Andrew A. Renshaw MD, Tarik M. Elsheikh MD
DOI: http://dx.doi.org/10.1309/AJCP23YDFKKQNUAO 428-433 First published online: 1 April 2013

Abstract

Previous prospective studies have shown different results when comparing automated and manual screening of gynecologic cytology. The results of 3 large prospective studies were reviewed and relative sensitivity used as a gold standard. No significant differences could be shown in relative sensitivity between the ThinPrep Imaging System and the FocalPoint GS Imaging System (P > .05). When manual screening was restricted to less than 6 hours per day, 50 or fewer slides per day, and at least 6 minutes per slide (<10 slides/h), the relative sensitivity for automation was significantly lower for atypical squamous cells of undetermined significance and above (ASC+) (0.81; 95% confidence interval [CI], 0.79–0.83) than when manual screening was not restricted (1.07; 95% CI, 1.03–1.10). All 3 sites that screened 10 or more slides per hour manually had a relative sensitivity for automation that was significantly higher for high-grade squamous intraepithelial lesions and above (HSIL+) than for the remaining groups who screened less than 10 slides per hour (1.40 [95% CI, 1.22–1.60] vs 0.97 [95% CI, 0.95–1.00]). These results suggest that location finding of abnormalities (ASC+) may be more strongly associated with time spent screening per day, whereas classification/interpretation skills (HSIL+) may depend on time spent on an individual case. There is no evidence that automated screening devices are more sensitive than manual screening performed at lower well-defined workloads. More restricted workloads (≤41 slides/d, ≤4.5 h/d) for manual screening may perform significantly better than automated screening devices as measured by histologic cervical intraepithelial neoplasia 2 and above.

Key Words
  • Gynecologic cytology
  • Pap smear
  • Workload
  • Sensitivity
  • Performance

The results of previous large prospective trials using automated screening devices for gynecologic cytology are inconsistent. The ThinPrep Imaging System (TIS; Hologic, Marlborough, MA) was shown to significantly increase sensitivity at the threshold of atypical squamous cells of undetermined significance and above (ASC+) but not at high-grade squamous intraepithelial lesions and above (HSIL+).1 In contrast, the FocalPoint GS Imaging System (GS; Becton Dickinson, Franklin Lakes, NJ) has been shown to increase sensitivity at the threshold of HSIL+ but not at ASC+.2 Finally, the recent Manual Assessment Versus Automated Reading in Cytology (MAVARIC) study showed that both devices were significantly less sensitive than manual screening, using a histologic diagnosis of cervical intraepithelial neoplasia 2 and above (CIN2+). In addition, in this study, there was no significant difference between the 2 automated devices, although this study was underpowered for this conclusion. In addition, manual screening identified more cases as abnormal than did automated screening in all abnormal cytology categories.3,4

These results may be viewed as confusing if one evaluates only the results of the automation arm of the study compared with the manual arm. Nevertheless, we have previously shown that for manual screening, workload per day is correlated with screening sensitivity,5 and we noticed that there were significant differences in workload in the manual arms of these studies. As a result, we wondered whether the differences in relative sensitivity seen in these 3 studies could be explained by differences in workload and corresponding sensitivity in the manual arms of the studies.

Materials and Methods

Inclusion criteria for this study were studies in which sensitivity for automated screening and manual screening is determined using a double-blind crossover method and for which workload data are also available. Three studies were identified: the ThinPrep Imaging System US Food and Drug Administration (FDA) trial,1 the FocalPoint GS Imaging System FDA trial,2 and the MAVARIC study.3,4

Results presented in the package insert of the TIS1 and GS2 were reviewed. This provided us with data representing 7 sites where the studies were conducted (4 sites from the TIS study and 3 from the GS study). The results of the MAVARIC study, which combined data from both instruments, were also analyzed as a single site.3,4

Both the TIS and GS studies were double-blind crossover studies performed at 4 different sites in the United States in which the performance of automated screening was compared with that of manual screening. Manual screening was performed under the limitations of the Clinical Laboratory Improvement Amendments of 1988.6 The gold standard in each study was a consensus cytologic diagnosis for each slide. Consensus was determined in all discrepant cases and in a subset of concordant cases by a panel of experts. One site in the GS study (F3) did not include workload data and was excluded. The MAVARIC study was also a double-blind crossover study performed in the United Kingdom (UK) in which the performance of both the TIS and GS systems was compared with manual screening. Manual screening was performed under the limitations of standard practices for UK screening programs. In keeping with UK guidance, cytotechnologists were only allowed to undertake microscopy for up to a maximum of 5 hours per day.7 This study also included a separate manual arm that confirmed that the sensitivity of the manual arm in the paired study was similar to manual screening alone. Data from this manual-only arm were not analyzed in this article.

For this review, we report both net increases in sensitivity and the relative increase in sensitivity of detecting abnormal results at varying thresholds. Net sensitivity is defined as absolute sensitivity of automated screening minus the absolute sensitivity of manual screening as reported in both the TIS and GS package inserts. These data are presented for descriptive purposes and for comparison with the original data. Relative sensitivity is the ratio of the sensitivity of the automated arm compared with the manual arm. For relative sensitivity, a 95% confidence interval (CI) is also reported. Comparisons are made by comparing 2 proportions using a significance level of P < .05.

In the MAVARIC study, in contrast to the TIS and GS studies, the primary end point was histologic CIN2+ rather than consensus diagnosis of the cytology. As a result, there was no final consensus cytology diagnosis for all cases in this study. To provide an estimate of the relative sensitivity using cytology as a gold standard, we assumed that every abnormal result was a true positive and again calculated net and relative sensitivity.

Throughout this study, each slide was counted as 1 slide regardless of how it was screened. Only SurePath (Becton Dickinson) and ThinPrep slides were included in the analysis.

Workload was calculated as follows. For the TIS and GS studies, the average hours screened per day were reported in the studies. The number of slides screened per hour was calculated by taking the extrapolated slides screened for an 8-hour day and dividing this by 8 to determine average slides per hour. Minutes per slide were calculated by dividing 60 by the average slide per hour. Total slides screened per day were calculated by multiplying the average slides per hour by the average hours screened per day. For the MAVARIC study, the average number of hours screened per day and the average minutes per slide were reported. The total slides per day were calculated by dividing the time per day by the slides per minute.

Although there are no other workload data for manual screening that use automation, a body of work describes workload and sensitivity using rapid prescreening as a gold standard.810 This includes data for more than 40 individual cytotechnologists (see the individual studies for details). Although differences in the gold standard for determining sensitivity may limit the conclusions we can draw from these data, we thought it would be instructive to at least compare the data in this previous work with that derived in the current article to see whether they are at least not inconsistent. To do this, we plotted all of the data points for both studies and compared the linear (Spearman) correlation that was determined using only the data from rapid prescreening with that using all of the data combined. To compare the current estimates of sensitivity and manual screening workloads with these prior results, we set the absolute sensitivity of screening using automated devices at 85%,1 and we used the net difference in sensitivity for each manual arm from the automated arm to determine an absolute sensitivity.

Results

Data are summarized as follows. Table 1 shows descriptive data of the cases included in each study. Table 2 shows workload for each study. Table 3 reports sensitivity for each study site. Table 4 reports relative sensitivity and CIs for grouped sites based on specific characteristics.

View this table:
View this table:
View this table:
View this table:

In the MAVARIC study, there was no final consensus cytology diagnosis for all cases in this study. To address this, in Table 3 we assumed that every abnormal result was a true positive and again calculated net and relative sensitivity. As a result, the cytology gold standard showed a –7.4% decrease in net sensitivity for a diagnosis of moderate squamous dysplasia and above using automation (relative risk, 0.94), which compares well with the histologic gold standard of –8.0% for CIN2+. This method estimated a –26.0% decrease in net sensitivity for a diagnosis of any abnormality (equivalent to ASC+) (relative risk, 0.80) using automation.

There was significant overlap in the 95% CIs of relative sensitivity for each site at both the ASC+ and HSIL+ thresholds. For ASC+ only, the MAVARIC study was significantly different at the 95% CI level. No 95% CI failed to overlap with at least 1 other site at the threshold of HSIL+. When the TIS sites were grouped together and compared with the GS sites grouped together (see Table 4), again there were no significant differences at both ASC+ (relative sensitivity, 1.09 [95% CI, 1.05–1.13] vs 1.00 [95% CI, 0.96–1.06] for TIS and GS, respectively) and HSIL+ (relative sensitivity, 1.09 [95% CI, 1.00–1.20] vs 1.26 [95% CI, 1.06–1.50] for TIS and GS, respectively).

Overall, there was at least a 10% improvement in net sensitivity using automation in 5 of 8 sites using either ASC+ or HSIL+ as a threshold (T2, T3, T4, F2, F4). This included 2 sites using ASC+ as a threshold (T2, T3) and 4 sites using HSIL+ as a threshold (T3, T4, F2, F4).

When manual screening was restricted to less than 6 hours per day, 50 or fewer slides per day, and at least 6 minutes per slide (<10 slides/h) (F1 and M), no improvement of more than 10% in net sensitivity was identified with automated screening. Relative sensitivity for this group was significantly lower for ASC+ (0.81; 95% CI, 0.79–0.83) than for the remaining groups (1.07; 95% CI, 1.03–1.10; see Table 4). Relative sensitivity for HSIL+ for this group was less but also significantly lower (1.00 [95% CI, 0.98–1.02] vs 1.19 [95% CI, 1.10–1.29]; see Table 4).

All 3 sites that screened 10 or more slides per hour (T3, F2, F4) had at least a 10% improvement in net sensitivity for HSIL+ using automation. The relative sensitivity of this group was significantly higher for this group for HSIL+ (1.40 [95% CI, 1.22–1.60] vs 0.97 [95% CI, 0.95–1.00]; see Table 4). The relative sensitivity for this group for ASC+ was lower but still significantly higher for all other groups combined (1.06; [95% CI, 1.02–1.11] vs 0.87 [95% CI, 0.85–0.88]).

Previous studies using rapid prescreening to measure sensitivity, including more than 40 individual cytotechnologists, have shown a linear correlation of y = –0.329x + 99.8.810 When absolute sensitivity from the current article is added to that of the prior studies, the equation for the linear correlation remains very similar: y = –0.311x + 99.6 Figure 1, despite the difference in method used to determine sensitivity and the range of workloads examined.

Figure 1

Sensitivity of screening vs manual workload for gynecologic cytology using data from automated trials. y = −0.3113x + 99.649; R2 = 0.4146.

Discussion

Results of the previous large prospective studies using TIS and GS systems have been inconsistent. TIS showed increased sensitivity at the ASC+ threshold but not at HSIL+, whereas GS showed increased sensitivity at the HSIL+ threshold but not at ASC+.1,2 On the other hand, more recent results from the MAVARIC study showed that neither one improved sensitivity over manual screening, and no significant difference could be demonstrated between the 2 devices, although the study was underpowered for this last conclusion.4

In the current study, we report combined data from all 3 studies and reached 4 main conclusions. First, we are unable to show a significant difference between the TIS and GS instruments. Second, manual workloads restricted to less than 6 hours per day and 50 or fewer slides per day were significantly more sensitive than all other workloads combined at both ASC+ and HSIL+, but to a greater degree at ASC+. Third, when manual screening was performed in less than 6.0 minutes per slide, there was a significant decrease in sensitivity compared with automation at both HSIL+ and ASC+, but more so at HSIL+. Finally, the performance of manual screening in these studies is not incompatible with the performance demonstrated using rapid prescreening.

Overall, our analysis is consistent with the contention of the MAVARIC study that there is no significant difference in the performance of either automated screening device, and neither device significantly improves sensitivity when compared with manual screening performed at highly restricted workloads used in some sites in these trials. However, our analysis strongly suggests that manual screening is very strongly dependent on workload. In order for automation to show improvement, there must be a deficit of sensitivity in the manual arm. If the manual arm is highly sensitive and does not miss any cases, then it will be difficult for automation to improve upon it. Our data show that increased overall sensitivity (as shown by an increase in ASC+) is achieved when manual screening is limited to 50 or fewer slides per day and 6 hours or less of screening per day. Automation cannot improve upon manual screening at these workloads. In addition, an improvement in classification (as shown by an increase in HSIL+ with or without increased ASC+) is achieved when an average manual screening speed of less than 10 slides per hour is used.

Other studies have demonstrated that a sensitivity of 95% to 100% can be reached with manual screening, using rapid prescreening as a gold standard, with more restricted workload limits (limiting workload to 30 slides/d).5 This is in contrast to studies performed with automated devices, which have an absolute sensitivity estimated to be approximately 80% to 85%.1,2 This is supported by data from the MAVARIC study, which demonstrated that manual workloads of less than 41 slides per day and screening for 4.5 hours or less per day performed significantly better than automated screening devices. Whether the cost of this intensive manual screening is worth the increased sensitivity,11 especially when alternatives such as human papillomavirus testing are available,12,13 is not clear. For example, because one of the major costs of gynecologic screening is the cost of cytotechnologist screening time, this change in workload may have major impacts on the cost of the test. If a cytotechnologist who routinely screens 82 slides per day is forced to screen only 41 slides per day, the cost of his or her services has doubled. Reducing the cytotechnologist’s workload to only 30 slides per day would increase this even more. However, recent recommendations have suggested extended intervals for Papanicoloau (Pap) smears, and it is possible that a more sensitive Pap smear may allow these intervals to be extended further, perhaps reducing the overall cost.

Our analysis suggests that a very likely cause for the differences in performance in these 3 trials is differences in the performance of the manual arms rather than the automation arms. However, this does not mean that the automated arms are not affected by workload. Indeed, previous studies have clearly shown that automated devices are affected by workload.9 Nevertheless, assuming that the cytotechnologists involved in this study did not screen additional cases on the same day that they screened for the studies, we note that the actual workload in many of the automated arms in these studies was relatively light, the majority of sites involved in these studies screened for 6 hours or less per day (7 of 8 sites), and the vast majority of the latter screened for 5.1 hours or less per day (6 of 7 sites) (Table 2). Other studies using longer periods of screening hours per day showed significant decreases in screening performance with these devices.14,15 This suggests that screening using automated devices may be affected in the same way as manual screening (ie, sensitivity begins to drop after 4 or 5 hours of screening). In addition, other factors that may contribute to decreased sensitivity of automated devices, including epithelial cell abnormality–adjusted workload rate9,16,17 and time of day,18 were not addressed in the current study. Further studies appear warranted.

The results from various studies on workload led to the recent consensus guidelines on workload released by the American Society of Cytopathology (ASC), which recommended that cytotechnologists’ gynecologic screening workday should not exceed 8 hours, including breaks totaling 1 hour, and that laboratory average cytotechnologist workload should not exceed 70 slides per day (slides counted per FDA 2011 bulletin recommendation; ie, imaged-only slides count as “one-half” and manually screened slides count as “1”).19 Although there remain rare occasions in which an individual cytotechnologist or laboratory may be able to perform well despite increasing workload, these are exceptions and are not well documented; hence, ASC recommendations were suggested as a “laboratory average” rather than an individual quota.

It is well known that screening a slide requires 2 separate skills—location-finding skills to identify an existing abnormality on a slide and classification skills to interpret and correctly classify the abnormality. Our analysis suggests that location finding (increased ASC+) may be more strongly associated with time spent screening per day, whereas classification skills (increased HSIL+ with or without increased ASC+) may depend on the length of time spent on an individual case. Further study of these features appears warranted.

Finally, it is possible that other factors may also be important in explaining the differences between these 3 trials. There are significant differences in how cases were screened in the MAVARIC trial compared with how cases were screened in the other trials and in the United States in general. It is unclear in the MAVARIC trial whether the 15% quality control review that was an integral part of the GS trial was used. On the other hand, the MAVARIC trial does make use (with success) of the “no further review” (in which a portion of cases are signed out as negative without manual screening) function, which was not used in the GS trial. In addition, rapid rescreen was an important part of the MAVARIC trial but was not used in either the TIS or the GS trial, although it appears the impact of this was relatively small. The end point in the MAVARIC trial was a histologic diagnosis of CIN2+ rather than a cytologic consensus diagnosis. Finally, the MAVARIC trial was done in a setting with much more restrictive workload limits for screening not only in the trial itself but also for screening in general. All of these features make the MAVARIC trial unique.

In conclusion, we have shown that improvements in screening sensitivity in the 3 prospective study trials of automated screening devices are strongly correlated with differences in workload in the manual arms of those trials. Although the performance of the automated devices is similar to that of manual screening using moderately restricted workload limits, more severely restricted workloads allow manual screening to perform significantly better than automated screening.

References

View Abstract