Wide inter-reader variations exist in cancer diagnosis with mammography     . The retrospective identification of cancers missed  and inter-reader variations in cancer diagnosis reported both clinically  and in experimental studies  demonstrates that human factors are a major limitation to consistent outcomes for imaging modalities     . Inter-reader variation and its associated errors can result in false negatives   , false positives    and over-diagnosis   . False negative diagnosis prevents early detection and treatment of cancer, which may negatively impact upon survival outcomes  . False positive diagnosis has been shown to cause patient anxiety and results in additional examination and cost  . Over-diagnosis of the disease may result in overtreatment   , which may further expose patients to risk from ionizing radiation and treatment   . Evidence is available that early diagnosis of breast cancer is associated with 30% - 40% reduction in mortality from the disease   . Improving reader performance may also reduce recommendations for further diagnostic work-up such as additional imaging and biopsy and lower the cost of screening for breast cancer. Therefore, it is important to identify strategies to improve early detection and characterization of breast cancer with mammograms, and to improve reader performance in the diagnosis of the disease. The current work aims to identify factors that may improve reader performance and potentially improve the ability of radiologists to detect and characterize lesions on mammograms.
The literature demonstrates considerable degree of radiologists’ errors and inter-radiologists’ variability in mammography interpretation     . Studies have shown that the proportion of breast cancer missed on mammography range from 1.3% to 39%    . Depending on the type and radiographic presentation of cancer, error rates may increase to 45%, and are common with subtle mammographic lesions such as architectural distortion    . Furthermore, some lesions may be visible in a mammogram and seen by radiologists, but may be overlooked because they are atypical. Thus, substantial proportions of missed or unreported malignant lesions can be seen on mammograms retrospectively   . Even when malignant lesions are visible, some breast readers dismiss them due to insufficient prompts generated by such lesions or variability in knowledge and perceptions of readers with regards to the prompts    . Therefore, reader factors arise not only because of inadequate search, but also due to perceptual and decision-making errors    . Thus, the variability in search, perceptual, and decision-making patterns of radiologists may also be responsible for the wide inter-reader variability in detection and characterization of potentially visible breast cancer as benign or malignant       .
Inter-reader variability in mammography interpretation has been shown to be a global phenomenon   , and underlines the need for practical approaches to improve cancer detection using mammography, including technological factors, reader characteristics, and other interventions. An understanding of parameters that limit breast cancer detection with mammography and ways of improving mammography performance may be crucial to reducing false positive and false negative diagnoses as well as inter-reader variability. This will in turn facilitate early treatment and further reduce mortality from the disease  . Whilst previous research  -  investigated the relationship between radiologists’ performance and readers characteristics in UK and USA and Australia, this work will measure for the first time Jordanian reader performance in reading mammography and will determine whether the key readers characteristics that increase the detection of breast cancer are the same as previously reported. The data should contribute insights towards an improvement to the service women receive and help reduce radiology reporting variability in the future.
Institutional ethics review board approval was obtained (Grant No. 20170326). This study was conducted in Amman, Jordan.
2.1. Image Set
The test set comprised 60 mammograms cases, comprising a total number of 240 images, each case consisting of four images: left and right caudal cranial (CC) and mediolateral oblique (MLO) projections for each breast.
Twenty of the cases had biopsy-proven cancer, either ductal carcinoma in situ or invasive cancer with four of these cases containing multiple lesions. The forty remaining images were normal confirmed by follow up mammograms produced two years later. The normal cases contained incidental benign findings including calcified duct ectasia, calcified oil cysts, benign calcified fibro adenoma and intramammary lymph nodes.
2.2. Radiologist’s Experience Details
A total group of 27 board-certified radiologists randomly participated in this study. Self-reported experience parameters including age, number of years since qualification as a radiologist, number of years reading mammograms, number of mammograms read per year and number of hours reading mammograms per week were recorded (Table 1).
2.3. Test Environment
Radiologists interpreted the images in a room 180 m2 and with walls painted in light grey and brown matte colours to minimize specular reflection. A built-in
Table 1. Mean, Standard deviation (sd) for years certified, years reading mammograms, number of mammogram per year, number of mammogram per week, others modality score along with upper and lower 95% CI of mean.
Integrated Front Sensor (IFS) measures brightness and gray scale tones to calibrate to DICOM Part 14. A calibrated photometer (Model Konica Minolta CL-200, Ramsey, NJ) was used to assess ambient light, which was maintained around 20 - 30 lux. Specifications of the workstation used for the work, such as monitor model, size, video card and calibration are described in Table 2.
2.4. Study Description
Radiologists were asked to localize and assess breast abnormalities according to the BI-RADS assessment categories used in Australia. The software platform used in the test was the Breast Reader Assessment Strategy (BREAST), which permits reading of digital images, determining of lesion location and providing an assessment category for breast lesions. The assessment categorization involved giving any perceived lesion a score of 2 (benign), 3 (equivocal), 4 (suspicious) and 5 (malignant). No information concerning the number of abnormal or normal cases was provided and the test software was explained to all radiologists before commencing the test. No time limit was imposed for the assessment of images and radiologists could freely access the panning, zooming and windowing post-processing tools. After a decision had been reached, radiologists located any perceived lesion, using a mouse-controlled cursor, on a laptop that simultaneously presented the same image as the one displayed on the high resolution monitors. If the decision about the case were “normal”, radiologists could just click on “next case” and the category score 1 (negative) would automatically appear for this case.
The web-based software provided general instructions on the process of reviewing, lesion marking and rating of the mammograms. Information on confidence level ratings to be used in the study was also provided to the readers. A short survey was included as part of the software to gather some general details on the participants’ demographic and clinical involvement. Overall demonstration of the software was given to each reader before the start of any readings. This platform allows radiologists to assess a mammographic test-set and obtain
Table 2. Shows workstation specifications.
feedback on their performance, with the radiologists’ correct decision, and errors made matched against the truth as shown in Figure 1.
2.5. Data Analysis
The numbers of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for each reader were counted. Sensitivity and specificity were then calculated. Sensitivity was calculated by dividing the number of TPs by the sum of TPs and FNs (TP/(TP + FN)). Specificity was calculated as a ratio of TN and the sum of FP and TN (TN/(FP + TN)). We also calculated location sensitivity (the proportion of true positives marked in the correct location as defined by a 75 pixel radius from the centre of the lesion). Jackknife Alternative Free-Response Receiver Operating Characteristic (JAFROC) software (Version 4.1) was used to calculate JAFROC figure of merit (FOM) values. A power analysis showed that with the sample size used in this study (60 cases and 27 radiologists) the detectable differences were 0.04, 0.07, and 0.05 for JAFROC, location sensitivity, and specificity, respectively, at 80% power.
Radiologists’ performance was calculated using the pervious metrics and correlated against key reader characteristics such as experience, qualifications, frequency of reading other modalities per week, and breast reading practices using Spearman techniques. Further analysis included a stepwise linear regression to predict the independent effect of the significant findings of the radiologist’s experiences on JAFROC scores.
Additional analyses were performed to further assess key characteristics for specific mammographic reading volumes by categorizing readers in two subgroups on the basis of the number of mammographic readings per year: fewer than 500, and more than 500. JAFROC data, location sensitivity, and specificity and compared using the t-test.
All statistical analyses were performed using the software IBM SPSS Statistics
Figure 1. Example of BREAST interface showing readers’ selection and the true location of cancer within the breast.
(version 22.0, for MAC; SPSS). Results were considered statistically significant when the p-value was ≤0. 5.
Mean JAFROC, location sensitivity and specificity scores across all 27 readers are shown in Table 3, along with upper and lower 95% confidence intervals of the mean.
Higher performance in term of JAFROC scores was directly related to number of years since professional qualification (r = 0.433; p = 0.024), number of years reading breast images (r = 0.62; p = 0.001) and number of mammography images read per year (r = 0.69; p = 0.001). On the other hand, higher performance was inversely linked to the frequency of reading other modalities per week (r = −0.48; p = 0.010). No other statistical differences were significant (Table 4).
The stepwise regression revealed for JAFROC that a combination of the positive predictor which number of mammography images read per year (r2 = 0.416, p = 0.001) and the negative predictor which is frequency of reading other modalities per week (r2 = −0.608, p = 0.008), as a group, were more accurately predicative of JAFROC than was either variable alone. The line equation was JAFROC = 0.780 + (Y. 0.009) − (H. 0.003) where Y is Mammograms read per year, H is frequency of reading other modalities per week.
Compared with the 14 (52%) readers who always maintained a total interpretive volume of at least 500 mammograms per year, the 13 (48%) readers who consistently had volume less than 500 mammograms per year experienced an 11% reduction in JAFROC (Table 5).
The variability in radiologists’ performance when reading mammograms is a concern across both screening and diagnostic mammography. Identifying causal factors for this variability is a first step towards optimising diagnostic efficacy. It is generally accepted that experience of the radiologist is a determinant of performance. Training, number of years since qualification, years of interpreting
Table 3. Mean TPs, specificity JAFROC, and location sensitivity specificity scores along with upper and lower 95% CI of mean.
Table 4. Spearman correlation Analysis of the JAFROC, location sensitivity and specificity value with readers parameters are shown r values are shown in the table and p values are given in parentheses. Values shown in bold font are statistically significant.
Table 5. Correlation analysis of JAFROC, location sensitivity, and specificity values for radiologists with less and more readings than 500 per year (national requirement).
mammograms and/or the number of mammograms read per year has been used as criteria for assessing radiologists’ performance  . Many studies have assessed the impact of volume read per year in cancer detection with conflicting outcomes. Some studies have shown that volume read per year increases performance, and has potential for the optimization of screening mammography programs  -  . Other studies have reported deceased or no change in radiologists’ performance irrespective of the volume read per year     .
The current work, investigated variations in diagnostic accuracy among readers who are currently involved in reporting breast images in Jordan. Higher levels of reader performance were found to be linked to numbers of years as certified radiologists, years of experience and hours readings per week.
The results of this work show that, although the number of cases read per year increased the ability of radiologists to correctly detect cancer in mammograms, it did not prevent them from making false positive errors (reporting the presence of cancer where there is none). It has been shown that perception of cancer and diagnostic decision-making relies on the reader’s previous reader knowledge and experience   . Therefore, improvement in sensitivity could be attributed to increased exposure to a wide range of mammographic features of cancer from increased number of cases read. The heterogeneity of the breast parenchyma and the mimicking of cancer by normal tissue may be partly implicated in the higher number of false positives. Because of the medico legal implication of false-negative diagnosis  , mean radiologists tend to report perturbations in the mammogram that may be suspicious of cancer and increase their recall rates. Additionally, radiologists who participated in the study reported in this thesis were assessed in a “laboratory” setting, not in their normal clinical setting. In such a setting, radiologists tend to expect more abnormal cases of cancer, prompting interpretation of normal parenchymal perturbations that are suspicious as cancer  .
Although we found that increases in the number of mammography images read per year are associated with higher performance, previous work has shown widely varying results  -  . Such discrepancies in findings may be explained, at least in part, by different methods employed. In addition, most studies are based on a selected sample of radiologists      or excluded some radiologists on the basis of their experience or their volume    . Finally, some studies did not adjust for potential confounders   such as ambient light and viewing conditions. The US and Canada have similar interpretive volume requirements of at least 480 mammograms per year   . Our results provide evidence in support of this annual requirement.
Number of mammography images read per year is also associated with improvement in location sensitivity. Understanding whether mammography-screening accuracy can be affected by the degree of radiologist involvement of a radiologist in diagnostic investigation of abnormal screening mammograms, including imaging and biopsies, is an important question in need of further study.
One educational project, which offers readers educational experiences and feedback, is BREAST  . The matching of errors against the truth provides radiologists with feedback about the nature of lesions missed. They can review their correct and incorrect cases and thus learn from the feedback. This may in turn facilitate tailoring of training regimens to improve mammographic interpretation performance. Feedback on performance is also very useful to employers as well, enabling them to identify areas of need for further education of their employees. Previous research has demonstrated that test-set reading interventions like BREAST are useful as they provide immediate feedback on correct and incorrect diagnosis. It is hoped that through multiple interventions like BREAST the accuracy of mammography interpretation would significantly improve  .
It should be acknowledged that there are limitations in the work. Firstly, the number of cases assessed was relatively small, and the case mix was not typical of a screening environment having many more abnormal than would normally be expected. Also, prior cases were not included, which could have had some influence on the results.
In summary, radiologists’ performance improves with increasing number of mammograms read per week, and by focusing their duties towards mammogram reading. The use of interventional educational programs, such as BREAST, could be applied to compensate for low reading volumes and help to expand radiological skills necessary to accurately identify breast patterns and lesions. The results have potential implications for breast screening efficacy and women’s anxiety.
1) Appreciation and thanks are due to Jordan University of Sciences and Technology for their research grant (Grant No. 20170326). 2) The authors acknowledge the kind support of the Breast Screen Reader Assessment Strategy (BREAST) for providing a platform. 3) The authors would like to thank Jordan Breast Cancer Program for their support towards the completion of this study.