Journal of Cytology

: 2022  |  Volume : 39  |  Issue : 3  |  Page : 110--115

Diagnostic accuracy and agreement between inter- and intra-observers in the absence of clinical history for liquid-based preparation of gynecology specimens

Nur Amirah Roslan1, Mohd Nazri Abu1, Farid Ridhuan Ismail2,  
1 Centre for Medical Laboratory Technology Studies, Faculty of Health Sciences, Universiti Teknologi MARA, Selangor Branch, Puncak Alam Campus, 42300 Puncak Alam, Selangor, Malaysia
2 Malaysia Ministry of Health Training Institute Kuala Lumpur (Medical Laboratory Technology), D/A Institute Medical Research, Jalan Pahang, 50588 Kuala Lumpur, Malaysia

Correspondence Address:
Dr. Mohd Nazri Abu
Centre for Medical Laboratory Technology Studies, Faculty of Health Sciences, Universiti Teknologi MARA, Selangor Branch, Puncak Alam Campus, 42300 Puncak Alam, Selangor


Context: The clinical history in cytology is the best source of information to ensure the accuracy of diagnosis, facilitating a slide observer to interpret and relate their findings in screening gynecology slides. Aims: This study aims to evaluate the performance of slide observers to screen-blinded gynecology slides without providing any information on clinical history. Setting and Design: A correlational study design was conducted at the cytology laboratory, Universiti Teknologi MARA Selangor, Puncak Alam Campus. Methods and Materials: Fity-seven liquid-based preparation slides from gynecology specimens were screened blindly by five slide observers among Medical Laboratory Technology students who completed the enrollment of the cytology course. Statistical Analysis Used: The inter- and intra-observer reliability testing was measured using the kappa value of Fleiss' and Cohen's kappa value, respectively, while the diagnostic accuracy without a clinical history was determined by the receiver operating characteristic (ROC) curve. Results: The value of Fleiss' kappa (κ) was 0.221—this represents a fair strength of agreement between inter-observers. An intra-observer reliability test for each slide observer was analyzed using Cohen's kappa statistic and revealed that the kappa value varied between 0.116 and 0.696, indicating slight-to-substantial agreement between intra-observers. Additionally, the sensitivity value of 94.28%, specificity value of 72.40%, a positive predictive value (PPV) of 37.28%, a negative predictive value (NPV) of 72.40%, a likelihood ratio of 14.43, and the diagnostic accuracy of 75.09% were recorded. Conclusions: In conclusion, the students (slide observers) from the Centre of Medical Laboratory Technology Studies who took part in this study were able to interpret, classify, and diagnose the LBP gynecologic cytopathological cases into several categories (NILM and ECA) based on the 2001 Bethesda System reporting guideline.

How to cite this article:
Roslan NA, Abu MN, Ismail FR. Diagnostic accuracy and agreement between inter- and intra-observers in the absence of clinical history for liquid-based preparation of gynecology specimens.J Cytol 2022;39:110-115

How to cite this URL:
Roslan NA, Abu MN, Ismail FR. Diagnostic accuracy and agreement between inter- and intra-observers in the absence of clinical history for liquid-based preparation of gynecology specimens. J Cytol [serial online] 2022 [cited 2022 Nov 26 ];39:110-115
Available from:

Full Text


The common practice of slide observers in the cytology laboratory is to ensure the presence of clinical histories such as hormonal status, treatment history, and cervical condition received together with the slides in the laboratory.[1] The clinical history and patient demographic data are used to relate to the laboratory findings for accurate results.[2] A previous study revealed that the absence of clinical history in cytological interpretation leads to lower diagnostic accuracy because the slide observer underdiagnoses malignant lesions.[3] Precision and accuracy are determined by comparing inter- and intra-observer accuracy and reliability.[4] Intra-observer observations are those made by the same observer.[4],[5]

On the other hand, inter-observer refers to observations made by many observers.[4],[5] The capacity of new screeners to produce an accurate diagnosis through their experience and abilities was assessed in this study to determine their reliability.[6] Variations that are unavoidable due to inter- and intra-observer differences may be regarded as an inherent feature of the reporting system.[7] Failure to accurately measure the variation seen in a study can lead to erroneous interpretation, influenced by the reporting laboratory's performance quality and patient treatment.[8],[9]

The Centre for Medical Laboratory Technology Studies, Faculty of Health Sciences, Universiti Teknologi MARA (UiTM), received donation slides from the local government hospital without complete documentation, including demographic data and clinical history for teaching purposes.[5]

In this study, the students were instructed to blindly screen gynecology slides without any knowledge of clinical history in order to mimic the current practice in the teaching laboratory. Hence, the absence of clinical history in diagnostic accuracy and agreement between inter- and intra-observer in gynecology specimens was determined.

 Subjects and Methods

Cases selectionFifty-seven liquid-based preparation (LBP) slides from the gynecology specimens already stained with Papanicolaou (Pap) stain were selected randomly in cytology slide storage at the Universiti Teknologi MARA Selangor Puncak Alam campus. These slides were selected based on the slide selection criteria of an unbroken slide and a satisfactory smear of LBP. Each slide was relabeled randomly with a number from 1 to 57.

Selection of slide observer

All screeners were participated voluntarily and chosen based on the inclusion criteria. The study was approved by the Research Ethics Committee, Institute of Research Management and Innovation (RMI), UiTM (reference no: 600-RMI (5/1/16)). Five slide observers were chosen randomly to screen the gynecology slides.[1],[5] Since this study requires screening LBP slides without clinical history, the slide observers must complete enrollment in the cytology course. Moreover, the slide observers must recognize the cellular morphology characteristics and have at least one year of experience in screening cytology slides.

Screening session

Each of the observers blindly screened all LBP slides using the 2001 Bethesda System and classified them into one of the categories in a given standardized form as shown in [Table 1]: (35) negative for intraepithelial lesion or malignancy (NILM); (nine, 9) fungal organisms morphologically consistent with Candida spp.; (seven, 7) shift in flora suggestive of bacterial vaginosis; (two, 2) atypical squamous cells of undetermined significance (ASC-US); (one, 1) low-grade squamous intraepithelial lesion (LSIL); (three, 3) squamous cell carcinoma (SCC); and (one, 1) atypical glandular cell (AGC). The screening session comprised of two parts: screening sessions A and B using the same cases of 57 gynecology specimens. The screening session A was conducted to obtain the agreement between slide observers for the inter-observer reliability test, whereas screening session B was conducted to obtain the results from the same slide observers but on different occasions for the intra-observer reliability test. Then the data were tabulated according to the categories and was compared to the actual reference result to determine inter-observer and intra-observer reliability and diagnostic accuracy.[10],[11]

Statistical analysis

The data were analyzed using Statistical Package for the Social Sciences (SPSS) version 25.0 software for Windows (IBM Corp, Armonk, New York, USA).[3] Fleiss' kappa was used to evaluate the inter-observer, while Cohen's kappa was used to evaluate intra-observer reliability or agreement for categorical scales in diagnosing gynecology cases based on cellular morphological characteristics of the LBP slides. Moreover, the receiver operating characteristic (ROC) curve was constructed to determine the diagnostic accuracy in interpretation among inter-observers without clinical history provided for gynecology cases by measuring the area under the ROC curve.{Table 1}


The total number of cases diagnosed by five slide observers in the interpretation of gynecology specimens are shown in [Table 2]. Unanimous agreement among all slide observers was reached in 7/57 cases (12.28%). Slide observers disagreed within one category in 12 cases (21.05%). In 38 cases (66.6%), the disagreement spanned more than one category among five slide observers. Fleiss' kappa was carried out to determine the level of agreement between the five slide observers in diagnosing gynecology cases on LBP for screening session A. The value of Fleiss' kappa (κ) =0.221 represents a fair strength of agreement between inter-observers in classifying cytologic diagnosis in gynecology specimens. Intra-observer reliability test for each slide observer was analyzed using Cohen's kappa statistic and interpreted in a range of agreement strengths tabulated in [Table 3]. The data was collected from screening sessions A and B, where the same cases were given for diagnosis but on different occasions.{Table 2}{Table 3}

The calculated value of operating parameters used in this study for all slide observers when clinical history was not provided are shown in [Table 4]. Increased sensitivity value decreased the specificity value as the relationship between both parameters was intertwined. Furthermore, the likelihood ratio (LR) value indicates the increase from pre-test to post-test probability. On the other hand, the interpretation of gynecology cases on LBP without clinical history was clinically significant (P < 0.01) for all slide observers.{Table 4}

The overall performance for all slide observers with a reference line is shown in [Figure 1]. The area under the colored line indicates individual diagnostic accuracy in interpreting gynecology cases on LBP without knowledge of clinical history. The closer the curve follows the left-hand border and then the top border of the ROC space, the highest the accuracy in diagnosis. All slide observers' mean accuracy (Az) value was 0.834 with a standard error of 0.067.{Figure 1}


Inter-observer reliability

In this study, five slide observers volunteered among the students from the Centre for Medical Laboratory Technology to diagnose gynecology cases without the patient's clinical history on LBP with the interpretation according to the 2001 Bethesda System reporting guidelines. The outcome from the diagnosis was evaluated using Fleiss' kappa test to assess the agreement in diagnosis between the slide observers. Our results demonstrated a fair strength of inter-observer agreement in diagnosing gynecology cases (P value < 0.000). This finding is similar to another study on inter-observer variability: they found an average kappa value which was a fair strength of agreement among them in detecting cervical epithelial abnormalities.[5],[10]

The inter-observer reliability is negative for intraepithelial lesion or malignancy (NILM), and epithelial cell abnormalities (ECA) categories in cervical Pap smears on liquid-based preparation. Unlike the organisms present, such as Candida spp. or bacterial vaginosis, the benign cellular changes associated with atrophy and reactive metaplastic squamous cells are significant diagnostic pitfalls in diagnosing atypical squamous cells of undetermined significance (ASC-US) in the ECA category.[3],[12] Atrophic cells had a regular nuclear membrane and fine chromatin. In contrast, reactive metaplastic squamous cells had a slight high nucleus to cytoplasm (N: C) ratio, smooth nuclear membrane, spiderweb-like cytoplasm with a nucleolus [Figure 2] compared to the slightly irregular nuclear membrane with two to three times nuclear enlargement than those of normal intermediate squamous cells, and chromatin is fine and evenly distributed in ASC-US cytomorphology. The atrophic cells and reactive metaplastic squamous cells were classified under NILM. Therefore, the kappa value of NILM shows an increased value compared to the ASC-US category, thus indicating that some of the slide observers were able to classify benign and precancerous cells based solely on cell morphology characteristics. Additionally, the organisms that observed and agreed with all slide observers were Candida spp. and bacterial vaginosis. Our findings on the diagnosis of gynecology cases based on cellular characteristics hint that the teaching technique during the practical session using donated slides from the hospital helped to implement adequate knowledge and skills for students in screening the gynecology slides.{Figure 2}

Intra-observer reliability

Good intra-observer reliability for all categories in the diagnosis of gynecology Pap smear has been described for cytology, even with or without an experienced cytopathologist.[2] Even though we did not replicate the previously reported good intra-observer reliability, our results suggest that the kappa value varied between 0.116 and 0.696, indicating fair to the substantial agreement between intra-observers. The slide observers diagnosed each gynecology case twice during the whole two months to ensure that the slide observers did not remember the diagnosis made during thefirst screening session. Therefore, some of the slide observers recognised the cell morphology characteristics and changes associated with screening sessions A and B from these results. There was a significant result in intra-observer reliability in classifying the interpretation of gynecology for slide observers. The result provides evidence for the study'sfirst objective, where some slide observers were able to make the same diagnosis in screening sessions A and B and were able to recognize the cells and interpret them in the diagnosis of gynecology cases for LBP.

Diagnostic accuracy

From the short review above, key findings emerge that some slide observers could make the same diagnosis as others or on the same occasion. However, it cannot determine the accuracy of the diagnosis for each case since the reliability test did not compare to the actual diagnosis or result. Thus, operating parameters such as true positive and true negative were used to measure the accuracy of the diagnosis in gynecology cases for each slide observer. We describe the results of true positive in which some of the slide observers were able to make a correct diagnosis without wrong interpretation of true positive cases. The results prove that the slide observers recognized the changes or abnormal cells. For example, LSIL is a precancerous condition of squamous cells associated with or without human papillomavirus (HPV) infection. LSIL can occur singly or in an intermediate squamous cell–type cytoplasm [Figure 3]. They had an enlargement of the nucleus more than three times that of normal intermediate nuclei and nuclear-to-cytoplasmic ratio, which shows a slight increase. The nuclei were hyperchromatic, evenly distributed chromatin, and had smooth or irregular nuclear membranes. LSIL had a common binucleation or multinucleation with absent or inconspicuous nucleoli. Diagnosis of LSIL is a must if the cells are present with nuclear abnormalities and koilocytosis.[13]{Figure 3}

Further findings found in the present study are false positive and false negative, which are significant errors in screening tests. False positive is defined as a test result that mistakenly indicates a specific condition is present, while false negative is defined as a test result that mistakenly indicates a specific condition is absent. In this study, slide observer 5 had the lowest misinterpretation in screening gynecology slides, whereby only five cases were diagnosed with a false-positive result. For example, several cases of reactive squamous cells were mistakenly diagnosed as an ASC-US, and the other was an atrophic case misinterpreted as an ASC-H, which the case is not present in this study. Reactive squamous cells, atrophy cells, and ASC-US have been described before, except ASC-H. ASC-H cells usually occur singly or in a crowded sheet pattern with round pale nuclei, 1.2–2.5 times larger than normal intermediate cells, metaplastic cells, and lower nuclear-to-cytoplasmic ratio. It also has prominent nuclear irregularity and nuclear groove.

Moreover, slide observer 1 had misinterpreted a NILM case with SCC and slide observer 4 had misdiagnosed a bacterial vaginosis case with atypical glandular cells. Generally, AGC cells occur in sheets with crowded cells, overlapping nuclear with or without pseudostratification. The nuclear abnormalities include enlargement of the nucleus up to three times than the normal endocervical or endometrial nuclei, mild nuclear hyperchromasia, irregular of chromatin, and occasional presence of nucleoli. Distinct borders of AGC cells are frequently discernible. The criteria for cancerous cells are different from benign cells. Therefore, they must have difficulty identifying the cells based on cell morphology characteristics in some cases.

From the review above, key findings emerge that the true positive, true negative, false positive, and false negative have proven the accuracy of diagnostic procedure by comparing the result of each case with the actual reference result obtained from the hospital where the donated slides were received. Furthermore, other parameters such as sensitivity and specificity were used to assess the diagnostic accuracy of interpreting gynecology cases without providing any patient's clinical history during the screening session.

Generally, the screening test of cervical Pap smear had to maximize the diagnostic sensitivity at the expense of lowering the specificity of the diagnosis. This study shows that the mean sensitivity was 94.28% and specificity was 72.40%. This indicates that a trade-off occurred between sensitivity and specificity. Another study reported pooled sensitivity, specificity, and positive and negative predictive values of 93%, 73%, and 90% and 73%, respectively, for detecting low-grade cervical intraepithelial neoplasia, atypia or HPV changes.[11] Overall, these findings are as follows: the sensitivity ranged from 69% to 97% and specificity from 9% to 46% in detecting normal and abnormal squamous cells in Pap smears.[3]

Additionally, our study's overall diagnostic accuracy in LBP for gynecology cases was 75.09%, agreeing with another published data that approached a better diagnostic accuracy in cervical Pap smear.[3] This is an essential finding in the understanding of slide observers in the interpretation of gynecology cases that depend solely on cell morphological characteristics without knowledge of the patient's clinical history. The results reflected the teaching technique in which experienced cytology lecturers delivered the knowledge as they blindly used slides in practical sessions. This study proved that most slide observers could identify and classify malignant and benign diagnoses without providing the patient's clinical history, which can facilitate the slide observers in evaluating the gynecology sample.


One of the limitations of this study was the selection of gynecology cases. Since this study required samples from liquid-based preparation, there was difficulty obtaining the perfect slides without rejection criteria such as broken slides. The availability of the liquid-based preparation slides is limited compared to conventional Pap smear in the slide storage. The actual diagnosis of each random liquid-based preparation slide was recorded after the screening session was completed. Therefore, many of the cases emerged as benign cases where the diagnosis is easily made without any doubt because the cytology criteria for benign cases are comprehension as a normal cell morphological characteristic. Thus, this study has not equally distributed the distribution of malignant and benign cases.


In conclusion, 75.09% diagnostic accuracy was obtained in interpreting gynecology cases, although the diagnosis was made without the patient's clinical history. Therefore, students who took part in this study as slide observers could interpret and classify gynecology cases into several categories based on the 2001 Bethesda System reporting guideline during the screening of the liquid-based preparation slides. The fair agreement in reliability testing indicates that some slide observers could classify the same categories precisely in several cases. Furthermore, with a basic knowledge of screening gynecology slides, MLT students (slides observers) from the Centre of Medical Laboratory Technology Studies were competent to screen and diagnose cytology slides.


The author wishes to thank Dr. Mohd Nazri Bin Abu for his valuable suggestion for the study, Dr. Khairil Anuar Bin Md Isa's guidance in statistical analysis and the Centre for Medical Laboratory Technology Studies for the facilities.

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.


1Hwang H, Follen M, Guillaud M, Scheurer M, MacAulay C, MacAulay C, et al. Cervical cytology reproducibility and associated clinical and demographic factors. Diagn Cytopathol 2020;48:35-42.
2Oyedokun A, Adeloye D, Balogun O. Clinical history-taking and physical examination in medical practice in Africa: still relevant? Croat Med J 2016;57:605-7.
3Raab SS, Oweity T, Hughes JH, Salomao DR, Kelley CM, Flynn CM, et al. Effect of clinical history on diagnostic accuracy in the cytologic interpretation of bronchial brush specimens. Am J Clin Pathol 2000;114:78-83.
4Mohamad SN, Abu MN, Ahmad M, Roslan NA, Daklan N, Roslan NN, et al. The effect of the absence of clinical history and demographic data of genitourinary cases in discovering agreement between inter- and intra-observers. Health Scope 2020;3:51-3.
5Roslan NN, Abu MN, Abd Malek NN, Roslan NA, Alihad NA, Mohamad SN, et al. Effect of absence clinical history in diagnostic accuracy of thyroid fine-needle aspiration cytology. Mal J Med Health Sci 2021;17(SUPP 3):162-7.
6Maxim LD, Niebo R, Utell MJ. Screening tests: A review with examples. Inhal Toxicol 2014;26:811-28. Erratum in: Inhal Toxicol 2019;31:298.
7Alikhassi A, Esmaili Gourabi H, Baikpour M. Comparison of inter- and intra-observer variability of breast density assessments using the fourth and fifth editions of Breast Imaging Reporting and Data System. Eur J Radiol Open 2018:5:67-72.
8Mrazek C, Lippi G, Keppel MH, Felder TK, Oberkofler H, Haschke-Becher E,et al. Errors within the total laboratory testing process, from test selection to medical decision-making-A review of causes, consequences, surveillance and solutions. Biochem Med (Zagreb) 2020;30:020502.
9Lubin I, Astles J, Shahangian S, Madison B, Parry R, Schmidt R, et al. Bringing the clinical laboratory into the strategy to advance diagnostic excellence. Diagnosis 2021;8:281-94.
10Gupta DK, Komaromy-Hiller G, Raab SS, Nath ME. Interobserver and intraobserver variability in the cytologic diagnosis of normal and abnormal metaplastic squamous cells in Pap smears. Acta Cytol 2001;45:697-703.
11Repp AC, Deitz DE, Boles SM, Deitz SM, Repp CF. Differences among common methods for calculating interobserver agreement. J Appl Behav Anal 1976;9:109-13.
12Kudva V, Guruvare S, Prasad K, Kulkarni K, TS P, Kamath A, et al. Inter-observer variability among gynaecologist in manual cervix image analysis for detection of cervical epithelial abnormalities. Clin Epidemiol Glob Health 2019;7:500-3.
13Cobucci R, Maisonnette M, Macêdo E, Santos Filho FC, Rodovalho P, Nóbrega MM, et al. Pap test accuracy and severity of squamous intraepithelial lesion. Indian J Cancer 2016;53:74–6.