Functional gastrointestinal disorders (FGIDs) are common clinical entities in children and adolescents.1,2 There are no biological markers for FGIDs, and diagnosis relies on symptom-based criteria. In children older than 10 years of age, the Rome IV committee recommends self-report of symptoms to establish diagnosis and to assess the outcomes in clinical trials (patient-reported-outcome).3 The Questionnaire on Pediatric Functional Gastrointestinal Disorders, Rome IV version (QPGS-IV) was adapted from the Questionnaire on Pediatric Gastrointestinal Symptoms by Walker et al4 to facilitate diagnoses. The QPGS-IV has been used to estimate the prevalence of FGIDs in multiple studies in children.1,5 Studies comparing the prevalence of FGIDs in children using the Rome III and Rome IV criteria have shown differences in prevalence predominantly in abdominal migraine, irritable bowel syndrome (IBS), and functional dyspepsia.1,2 It is important to understand if those differences resulted from modifications of the Rome criteria or innate errors in measures. The results of all previous epidemiological studies were based on a single time survey per each respondent.1,2 Relying on a one-time self-report measure does not account for the reliability or validity of the instrument nor the ability of the respondent to complete the survey. Therefore, we cannot definitively establish if a repeated measure could have resulted in a different response. This is particularly important in children, who may have poor recall of symptoms, limited understanding of questions or lose interest and get distracted during the completion of the survey. A common measure of reliability is the intra-rater reliability (consistency), in which the same subject is given the same measure on more than one occasion (test-retest-reliability). The objective of our study is to examine the intra-rater reliability for the diagnosis of an FGID in children and adolescents. In addition, we examined the intra-rater reliability for 2 major diagnostic groups and explored the intra-rater reliability for individual FGIDs.
We conducted a prospective school-based cohort study evaluating the intra-rater reliability of an FGID diagnoses as measured by the QPGS-IV at a public school in Cali, Colombia. A Spanish version of the QPGS-IV was already available to us, as we translated the questionnaire for a previous study according to the guidelines of the Rome foundation for translation and localization.1,6 In short, 2 bilingual physicians provided reverse translation of the questionnaire. After initial translation, the questionnaire was adapted by a randomly selected focus group of 20 clinic patients from Cali with an equal division between sexes and with ages ranging from 8 to 18 years of age. The adapted version was then translated back into English and a member of the research team who was not involved in the translation reviewed the final version to assure fidelity with the original English version of the QPGS-IV. Informative material, a questionnaire covering the child’s medical history, and consent and assent forms were sent to the homes of schoolchildren/adolescents between 11 and 18 years of age. The need for the second questionnaire was explained to the families in accordance with the aim of our study; we explained that we wanted to investigate if the children answered similarly 2 days apart. Children with reported organic gastrointestinal disorders (eg, gastritis, inflammatory bowel disease, and Hirschsprung disease), gastrointestinal complaints that could mimic FGIDs by causing abdominal pain, or comorbid painful conditions frequently associated with FGIDs were excluded. Children of families that were eligible and consented, completed the self-report Spanish version of the QPGS-IV in class on day 0 (baseline) and on day 2 (48 hours later). After completion of both questionnaires, children who did not follow instructions on the questionnaire due to misunderstanding of instructions, inappropriate reading comprehension, or not paying attention to the instructions, were excluded from analyses. Sections of the questionnaire instructed in bold letters that a specific answer to a previous question should prompt to skip a specific section. Children who failed to follow those instructions were considered unable to complete the questionnaire accurately.
In each classroom and for each administration, 2 members of our research team distributed self-report paper surveys. Parents were not present in the classroom. The researchers provided instructions on completion of the survey without disclosing the objective of the study and repeated this again after 48 hours. As the accuracy of recall of symptoms of children has been questioned,7,8 we chose a short interval between survey administrations to facilitate that the content of the responses would not change, as questions had a 30-day reference period. On the other hand, the interval was considered long enough to decrease the likelihood that a child would automatically repeat the answers on the second administration based on their recall of recent answers.
The questionnaires were reviewed to assess prevalence of FGIDs and the intra-rater reliability of the diagnoses. The intra-rater reliability was analyzed for the presence of an FGID in general, for the presence of an FGID in one of the major diagnostic groups (functional abdominal pain disorders and functional defecation disorder), and for individual FGID diagnoses. The intra-rater reliability was not analyzed when less than 5.0% of the children were diagnosed with one particular disorder due to difficulties in interpreting the reliability measurements. With a high percentage of responses falling into a single nominal category (in this case “no”) the percentage of agreement is by definition high and so is the expected/chance agreement (resulting in possible low kappa values which may not necessarily reflect low rates of overall agreement).9
The statistical analysis included measurements of central tendency (average, standard deviation, and percentage) and measurements reflecting the intra-rater reliability. The percentages of agreement and the coefficient of Cohen’s kappa (κ) including 95% confidence intervals were calculated. Kappa values for agreement were interpreted according to the following magnitude guidelines: 0.00-0.20 (none), 0.21-0.39 (minimal), 0.40-0.59 (weak), 0.60-0.79 (moderate), 0.80-0.90 (strong), and > 0.90 (almost perfect).10
Sample size was calculated by using the previously reported prevalence rate for an FGID diagnosis of 21.2%.1 Preliminary outcomes of intra-rater reliability within a 2-week interval by van Tilburg et al11 showed chance-corrected agreements mostly in the range of 0.64-0.78. Because of our shorter time-interval we expected an average agreement rate of around 75.0%. The minimum acceptable level of agreement was set at 40.0%. In according with guidelines on the minimum sample size requirements for kappa agreement test we used a two-tailed test with 80.0% power at α = 0.05 for kappa statistics and calculated that a sample size of 53 subjects was needed to establish agreement between 2 ratings for the presence of an FGID diagnoses in general.12 The sample size needed to measure the intra-rater reliability of the 2 largest diagnostic subgroups based on their reported prevalence (functional abdominal pain disorders 8.2% and functional defecation disorders 10.8%) was 117 subjects. As we did not know how many children were to be excluded, we oversampled to obtain enough data. Sample sizes calculations for the subgroup of nausea and vomiting disorders, and individual FGIDs exceeded our possible study sample.
The study was approved by school’s teachers and principal and by the local Institutional Review Board (No. 024-2019).
We invited 287 children to participate in the study (Figure). Parents of 269 (93.7%) children gave consent. Out of those children, 37 children (13.8%) were excluded from participation. After exclusions, 232 children completed the first questionnaire, 17 of them (7.3%) were not present to complete the second questionnaire. Both questionnaires were completed by 215 children, of which 97 (45.1%) were excluded from analysis due to not following instructions of the questionnaire. In conclusion, data of 118 children were analyzed, mean age of 15.0 years (± SD 1.8 years), 58.5% were boys. Children who were excluded because of not following the instructions of the questionnaire had a mean age of 14.6 years (± 1.7 years) and 45.8% were boys, this was not different compared to the included children. Mean response time of the questionnaires was 12.0 ± 6.0 minutes.
On the first day of testing, 43 children (36.4%) met the diagnostic criteria for at least one FGID (Table 1). The most common diagnosis was functional dyspepsia (19 children, 16.1%). Nine children (7.6%) met diagnostic criteria for 2 FGIDs; the most common combination was functional dyspepsia and functional constipation (5 children, 4.2%). On the second day of testing, 31 children (26.3%) met diagnostic criteria for at least 1 FGID. The most common diagnosis was again functional dyspepsia (14 children, 11.9%). Eight children (6.8%) met criteria for 2 FGIDs. The most common combination was functional dyspepsia and functional constipation (4 children, 3.4%). When comparing the results of the 2 testing days, the number of children meeting criteria for an FGID was higher on the first testing day (36.4% vs 26.3%), this difference was not statistically significant. On both testing days, functional dyspepsia was the most common FGID, followed by functional constipation and IBS. There were no significant differences between the prevalence of a diagnosis on testing day 1 and day 2.
The reliability of the presence of an FGID diagnosis was considered moderate with an agreement of 83.1% and a kappa value of 0.61 (Table 2). The agreement between both testing days for the diagnostic group of functional abdominal pain disorders was moderate (κ = 0.65) and the agreement for the diagnostic group of functional defecation disorders was weak (κ = 0.49). We found a moderate intra-rater reliability for the diagnosis of functional dyspepsia (κ = 0.61), and a weak intra-rater reliability for IBS (κ = 0.54), functional constipation (κ = 0.46), and the postprandial distress syndrome subtype of functional dyspepsia (κ = 0.49).
This is the first study to measure the intra-rater reliability of a self-report questionnaire for the pediatric Rome IV criteria. We found that 45.1% of children were not able to follow the instructions on the questionnaire. Once those children were excluded, the intra-rater reliability of diagnosing a child with an FGID using the QPGS was moderate when corrected for chance (κ = 0.61). We found a moderate intra-rater reliability for the diagnostic group of functional abdominal pain disorders (κ = 0.65) and a weak intra-rater reliability for the diagnostic group of functional defecation disorders (κ = 0.49). Although we knew that we would not be able to obtain a sample size large enough to analyze each of the diagnoses with confidence, we thought the analysis was still worthy and could provide useful information for future studies. The found intra-rater reliability was moderate for diagnosing functional dyspepsia (κ = 0.65) and weak for diagnosing functional constipation (κ = 0.46), IBS (κ = 0.54), and postprandial distress syndrome (κ = 0.49). Functional dyspepsia, functional constipation, and IBS were consistently the most common diagnoses on both days of testing (day 0 and day 2).
Together, our findings speak of the utility of the use of questionnaires for diagnosing FGIDs while providing a cautionary note at the time of interpreting the results. Only 54.9% of the children in our population followed all the instructions of the questionnaire; that alone questions the reliability of the questionnaire. Adequate completion of the questionnaire seemed unrelated to the age of the participants. A thorough explanation of the questionnaire and coaching of the children throughout completion may help the understanding of the children, which could improve the reliability and the validity of the diagnoses. Indeed, higher intra-rater reliability rates of the previous version of the QPGS-III were found after face-to-face interviews with children.13
Alternatively, it is reasonable to consider completion of the questionnaires by the parents. A previous study by van Tilburg et al11 looked into this alternative and studied the intra-rater reliability of 40 parents and 18 children. They compared the children’s diagnoses according to the children and their parents, who completed the QPGS-III at baseline and after a 2-week follow up. They found that children had a moderate intra-rater reliability for most diagnoses and the intra-rater reliability of their parents were considerably lower for all diagnoses, except functional dyspepsia. These results are in line with our intra-rater reliability scores, and those of other studies which have shown that children do not have lower intra-rater reliability compared to their parents. This indicates that completion of the questionnaire by parents will likely not improve the reliability of the diagnoses.14
If the results of pediatric studies do not reach perfect agreement, even when parents are included, a question could be raised about whether the absence of perfect reliability is exclusively inherent to the pediatric Rome criteria or to the Rome criteria in general. A comparison with adult literature can help solve this conundrum. Palsson et al15 assessed the intra-rater reliability of the adult Rome IV criteria as measured by the Rome IV Diagnostic Questionnaire at baseline and 20-40 days follow up in 140 adults. Compared to Palsson et al,15 we found a higher reliability in diagnosing children with functional dyspepsia (κ = 0.65 vs κ = 0.53), and similar reliability in diagnosing children with IBS (κ = 0.54 vs κ = 0.51), and functional constipation (κ = 0.46 vs κ = 0.44). This implies that the limitations in reliability are not limited to the pediatric population.
Strengths of our study include the novelty of the study, the high participation rate (93.7%), the low dropout rate (12.6%), and the assessment of adequate completion of the questionnaire, an aspect that has not been previously studied in children completing questionnaires using the Rome criteria. However, multiple limitations should be considered. First, this study included children within a specific age range (11-18 years) located in 1 public school in Cali, Colombia, and the questionnaire was translated into a Spanish version. Therefore, the results cannot be generalized to all age groups, languages, or other geographic areas. Second, supplementary investigations for organic disease were not performed, and some of the diagnoses, therefore, may be inaccurate. Third, we used the exact same questionnaire to evaluate intra-rater reliability and a relatively short interval (48 hours), by this, some children may have remembered what they filled out the first time, which may have falsely increased our found levels of agreement. Fourth, we did not exclude children with 2 FGIDs from analysis (overlap) which may have caused an overestimation of the agreement in diagnosing an FGID in general. However, excluding those children would have underestimated the diagnosis of an FGID in general and could have resulted in inaccurate reliability outcomes for our other measures. Lastly, the current reported reliability outcomes of individual diagnosis have to be interpreted with caution and should be considered preliminary. However, the inclusion of these data may be valuable for the conceptualization of the problem and to guide sample size calculations in future studies.
In conclusion, our study shows that the child-reported Spanish version of the QPGS-IV in a Colombian population has a moderate intra-rater reliability for an FGID diagnosis in general. We recommend not relying on this questionnaire exclusively to select, recruit, or evaluate children for research purposes. In addition, we advise to provide a thorough explanation of the questionnaire, and possibly use mock questions to assure the understanding of the questions. Larger studies are needed to investigate the accuracy, as well as the reliability and validity of the pediatric Rome IV criteria, to assess the reliability of low-prevalent diagnoses, and to compare with previous versions of the Rome criteria.
Carlos A Velasco-Benítez, Laura M Méndez-Guzmán, and Miguel Saps: design of the work; Carlos A Velasco-Benítez and Laura M Méndez-Guzmán: acquisition and analysis of data for the work; Desiree F Baaleman, Carlos A Velasco-Benítez, Marc A Benninga, and Miguel Saps: interpretation of data for the work; Desiree F Baaleman, Carlos A Velasco-Benítez, Laura M Méndez-Guzm, and Miguel Saps: drafted the initial manuscript; and Desiree F Baaleman, Marc A Benninga, and Miguel Saps: critically revised the manuscript for important intellectual content. All authors approve of the final version of the manuscript as submitted and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.