| Table 1. WORQ measurement properties in Hansen et al. | |||
| Measurement property | COSMIN category | Operationalisation | In Hansen et al.? |
|---|---|---|---|
| Internal consistency | Reliability | Domain-level α and/or ω for WORQ domains | Yes |
| Test–retest reliability | Reliability | ICC for domain scores with ≈14-day retest | Yes |
| Measurement error | Reliability | SEM and SDC (95%) for domain scores | Yes |
| Construct validity — hypothesis testing | Validity | Correlations with WAI, EQ-5D-5L, 6MWT, 30-s STS1 | Yes |
| Construct validity — known-groups | Validity | Group differences by sick leave / disability level | Yes |
| Floor/ceiling effects | Interpretability | Extremes (%) and scale-width considerations | Yes |
| MIC (minimal important change) | Interpretability | Distribution-based interpretability metric | No |
| Structural validity | Validity | Factor structure / invariance | No |
| Responsiveness | Responsiveness | Change over time; longitudinal sensitivity to change | No |
| 1 Abbreviations: WAI = Work Ability Index; EQ-5D-5L = EuroQol 5 Dimensions, 5 Levels; 6MWT = 6-Minute Walk Test; 30-s STS = 30-Second Sit-to-Stand test. | |||
Exercise on questionnaire selection - answers
Questionnaire selection
Please read the Exercise instructions tab first. Then answer the questions in the ‘Questions’ tab.
Your research group is planning a randomized controlled trial (RCT) to evaluate a work-focused intervention versus distribution of an advice booklet for patients with work-related chronic low back pain (CLBP). The trial’s primary outcome is self-reported physical function.
You have identified the Work Rehabilitation Questionnaire (WORQ), available in a validated Danish version. Before selecting WORQ for your RCT, you want to determine whether its measurement properties are adequate for your purpose.
We will base our appraisal on the following study:
Hansen A, Lauridsen HH, Escorpizo R, Søgaard K, Søndergaard J, Schiøttz-Christensen B, et al. Reliability and Construct Validity of the Work Rehabilitation Questionnaire Domains in Patients with Persistent Low Back Pain. J Occup Rehabil. 2024 Nov 8;1–11.
The WORQ questionnaire can be downloaded here
Start with the questions in tab 1. Your trial and, when finished, continue to the next.
Please write your answers to the questions in a Word document. The exercise including answers can be downloaded after the course as a html-document from the webpage.
Questions
1.1 Using the exercise instructions and the information in the PROM-selection vodcast:
What do you want to measure in your trial?
What is the measurement aim (i.e. discriminative, evaluative, or predictive) of the PROM in this RCT?
What does that imply in terms of which measurement properties are important to look at?
The trials primary outcome is self-reported physical function. This implies that a questionnaire would be appropriate to use in comparison to an observational method (e.g. observing what the patients can actually do) or capacity tests (e.g. physical tests, accelerometers etc.).
You need a PROM with an evaluative aim, i.e. one that can detect within-person change over time in work-related physical functioning to compare intervention vs. booklet groups. Therefore, you are evaluating treatment effects and want to include an evaluative instrument.
In terms of measurement properties, your instrument should be selected on the basis of:
- Content validity for your target population (relevance, comprehensiveness, comprehensibility in Danish CLBP workers)
- Feasibility in the trial (burden, mode, recall period, licensing)
- Structural validity (sound dimensional structure; preferably longitudinal invariance)
- Internal consistency for each dimension
- Responsiveness to change (ideally with an established MIC for interpretation)
- Reproducibility (primarily measurement error/SEM/SDC as we expect it to be used longitudinally)
Questions
2.1 What are the underlying constructs measured by WORQ in patients with low back pain, and how are they operationalised?
- Physical functioning
- Construct: Self-reported limitations in bodily activities relevant to work participation (e.g., walking, transfers, endurance, mobility).
- Operationalisation: 10 items scored 0–10 (0 = no difficulty; 10 = extreme difficulty), linearly transformed to 0–100; higher = worse functioning. Items include, for example, endurance, day-to-day activities, short-/long-distance walking, mobility.
- Psychological well-being
- Construct: Self-reported psychological distress affecting work functioning (e.g., mood, anxiety/depression–related difficulties).
- Operationalisation: 5 items scored 0–10, transformed to 0–100 (paper reports 0–50 range for raw domain sum, but domain scores are presented on a 0–100 interpretive scale); higher scores = worse psychological well-being.
- Cognitive ability
- Construct: Self-reported cognitive limitations relevant to work (e.g., attention, memory, planning) that may hinder performance.
- Operationalisation: 10 items scored 0–10, transformed to 0–100; higher = worse. Shows expected associations with work ability and anxiety/depression; domain defined within the ICF-based framework.
2.2 How does WORQ fit the purpose of your RCT?
Your trial has self-reported physical function as its primary outcome, and this matches the physical function subscale of WORQ well.
WORQ is explicitly designed to capture work-related functioning within the ICF framework. Its Physical functioning domain targets limitations (e.g., walking, mobility, endurance, day-to-day activities) that are directly relevant to a work-focused intervention in patients with work-related CLBP.
Lastly, WORQ is a self-administered PROM, making it practical for repeated measurement across RCT time points. It provides the patient’s perspective on functional limitations. You can consider to complement it with capacity tests (e.g., Sit-to-stand test (STS), 6 min. walk test (6MWT)) if objective performance is also a trial objective.
2.3 Is WORQ a generic or disease-specific PROM? How should we classify its physical function subscale?
The WORQ can be described as follows:
The WORQ is a generic, multi-dimensional PROM of work-related functioning (developed within the ICF framework). It is intended for various patient populations and assesses three domains: Physical Functioning, Psychological Well-being, and Cognitive Ability.
The physical function subscale can also best be described as generic (work-related physical functioning), which in our situation is applied to a specific population (adults with chronic LBP).
The Hansen et al. study aimed to evaluate the measurement properties of the Danish version of WORQ, specifically reliability and construct validity, to support its use in assessing work-related functioning.
Questions
3.1 COSMIN properties
Which measurement properties does Hansen et al. evaluate and how is each evaluated (i.e. which coefficients are reported)?
Which relevant properties are not evaluated?
Hansen et al. evaluate WORQ’s internal consistency, test–retest reliability, measurement error, and construct validity (hypothesis testing and known-groups) and floor/ceiling effects. They do not assess structural validity or responsiveness in the study.
An outline of the properties tested in the study can be found in Table 1 below.
3.2 In which patient population was WORQ evaluated in Hansen et al.?
The study evaluated WORQ in working-age adults with chronic LBP and excluded those not active in the labour market. The patients were recruited from a Spine Centre in the Region of Southern Denmark. This is a population very close to your RCT target population.
3.3 What practical considerations (feasibility) support using WORQ (Physical Function domain) in our RCT? What does not support its use?
WORQ exists in a validated Danish version and is straightforward to self-administer (electronically or on paper) at multiple RCT time points. Furthermore, scoring is simple (items 0–10; domain 0–100) and the physical function domain’s 10 items keep respondent burden modest.
However, using only one domain changes the instrument’s intended multi-domain structure; evidence of structural validity/invariance and some reliability parameters was established for the full instrument and may not generalise to an isolated subscale.
Next, you decide to look closer at the measurement properties of internal consistency and reliability. You use the COSMIN Risk of Bias checklist for this, and first you look at ‘Box 4. Internal consistency’.
Questions
3.4 What do you need to consider before evaluating Box 4 on internal consistency?
To answer this you need to look in the COSMIN Risk of Bias manual. You can find it here: COSMIN RoB manual 2.0. Find the section on ‘Internal consistency’.
In the COSMIN Risk of Bias Manual version 2.0 you need to look at section 6.3 - Internal consistency on page 167. Here is a summary of what you need to consider:
- Structural validity should be performed before internal consistency is determined
- Internal consistency is only applied to unidimensional scales
- The internal consistency cannot be higher than the quality of the structural validity
- Internal consistency applied to multidimensional scales should be ignored
3.5 Based on COSMIN Box 4 (internal consistency), how does Hansen et al. score on each of the four checklist items? Justify your answer.
Hansen et al. did not assess structural validity in this study. Instead they referenced a prior study in the same CLBP population (reference 11, J Occup Rehabil, 2023) that established WORQ’s factor structure.
Using that structure, they reported Cronbach’s \(\alpha\) and McDonald’s \(\omega\) for each WORQ domain (continuous scores), which meets COSMIN’s “very good” criterion for Item 1. Items 2–3 are not applicable (no dichotomous or IRT/theta scoring). No important methodological flaws affecting internal consistency were noted (“very good”). See Table 2 for details.
| COSMIN Box 4. Internal Consistency (Hansen et al.) | ||||
| Q# | Checklist item | Findings | COSMIN rating | Rationale |
|---|---|---|---|---|
| 1 | For continuous scores: Was α or ω calculated? | Yes—domain-level Cronbach’s α and McDonald’s ω were reported. | Very good | Meets criterion: α/ω calculated for continuous domain scores. |
| 2 | For dichotomous scores: Was α or KR-20 calculated? | Not applicable (WORQ uses 0–10 items). | NA | Items are not dichotomous; KR-20 not relevant. |
| 3 | For IRT-based scores: SE(θ) or reliability of θ calculated? | Not applicable (no IRT/theta scoring). | NA | Classical scoring used; no IRT indices reported. |
| 4 | Any other important design/statistical flaws? | No important flaws noted affecting internal consistency. | Very good | Appropriate methods and reporting for internal consistency. |
| Ratings follow COSMIN Box 4 categories: very good / adequate / doubtful / inadequate / NA. | ||||
Lastly, you turn to evaluating the reliability of WORQ. Here you use Box 6. Reliability from the COSMIN RoB checklist.
Questions
3.6 Using COSMIN Box 6 (Reliability), score all eight items for Hansen et al. and justify each rating.
Hansen et al. performed a \(\approx\) 2 week test–retest of WORQ domain scores. Retest was completed electronically under similar conditions, and ICCs (two-way, absolute agreement) were reported. No dichotomous/nominal/ordinal reliability analyses were applicable. The answers to the eight questions are found in the table below.
| COSMIN Box 6. Reliability (Hansen et al.) | ||||
| Q# | Checklist item | Finding in Hansen et al. | COSMIN rating | Justification |
|---|---|---|---|---|
| 1 | Were patients stable between repeated measurements? | Stability assumed; no explicit GRC (global rating of change) anchor reported. | Adequate | COSMIN: ‘Assumable that patients were stable’ → Adequate. |
| 2 | Was the time interval appropriate? | Retest at ~14 days. | Very good | Common and appropriate interval for PROM test–retest. |
| 3 | Were measurement conditions similar for repeated measurements? | Same self-administered mode/context across time points. | Very good | Comparable conditions/mode across T1–T2. |
| 4 | For continuous scores: was the appropriate ICC calculated (with evidence of no systematic change)? | ICC\(_{agreement}\) (two-way) reported; no explicit test of no change. | Very good | Model is appropriate (ICC\(_{agreement}\)) and includes systematic error |
| 5 | For dichotomous scores: was kappa calculated? | Not applicable. | NA | WORQ items are 0–10; domains are continuous. |
| 6 | For nominal scores: was an unweighted kappa calculated? | Not applicable. | NA | No nominal outcomes. |
| 7 | For ordinal scores: was a weighted kappa calculated? | Not applicable. | NA | No ordinal rater-agreement analysis required. |
| 8 | Any other important design/statistical flaws? | No important flaws affecting reliability noted. | Very good | Design/reporting suitable for test–retest. |
| GRC = Global Rating of Change. ICC\(_{agreement}\) = intraclass correlation coefficient (absolute agreement). Ratings per COSMIN categories: very good / adequate / doubtful / inadequate / NA. | ||||
3.7 Based on what you have learned from this article, do you think the WORQ would be appropriate as a primary outcome for your RCT? Give reasons for your answer.
The WORQ physical function subscale can serve as a defensible primary PROM for a work-focused CLBP RCT, but only if you plan for interpretation of change and are transparent about using a single domain.
Several strengths are noted: It targets work-relevant physical functioning (your primary outcome), has solid internal consistency and acceptable test–retest reliability, is validated in Danish, and is practical for repeated administration.
On the downside is: Responsiveness and MIC haven’t been established, and using only one domain departs from the intended multi-domain structure and needs clear justification. Lastly, you must ensure expected effects exceed measurement error and aren’t masked by floor/ceiling issues.