Exercise on questionnaire selection - answers

Questionnaire selection

Please read the Exercise instructions tab first. Then answer the questions in the ‘Questions’ tab.

Your research group is planning a randomized controlled trial (RCT) to evaluate a work-focused intervention versus distribution of an advice booklet for patients with work-related chronic low back pain (CLBP). The trial’s primary outcome is self-reported physical function.

You have identified the Work Rehabilitation Questionnaire (WORQ), available in a validated Danish version. Before selecting WORQ for your RCT, you want to determine whether its measurement properties are adequate for your purpose.

We will base our appraisal on the following study:

Hansen A, Lauridsen HH, Escorpizo R, Søgaard K, Søndergaard J, Schiøttz-Christensen B, et al. Reliability and Construct Validity of the Work Rehabilitation Questionnaire Domains in Patients with Persistent Low Back Pain. J Occup Rehabil. 2024 Nov 8;1–11.

Download Hansen et al. here

The WORQ questionnaire can be downloaded here

Start with the questions in tab 1. Your trial and, when finished, continue to the next.

Please write your answers to the questions in a Word document. The exercise including answers can be downloaded after the course as a html-document from the webpage.

Questions

1.1 Using the exercise instructions and the information in the PROM-selection vodcast:

What do you want to measure in your trial?
What is the measurement aim (i.e. discriminative, evaluative, or predictive) of the PROM in this RCT?
What does that imply in terms of which measurement properties are important to look at?

Answer

The trials primary outcome is self-reported physical function. This implies that a questionnaire would be appropriate to use in comparison to an observational method (e.g. observing what the patients can actually do) or capacity tests (e.g. physical tests, accelerometers etc.).

You need a PROM with an evaluative aim, i.e. one that can detect within-person change over time in work-related physical functioning to compare intervention vs. booklet groups. Therefore, you are evaluating treatment effects and want to include an evaluative instrument.

In terms of measurement properties, your instrument should be selected on the basis of:

Content validity for your target population (relevance, comprehensiveness, comprehensibility in Danish CLBP workers)
Feasibility in the trial (burden, mode, recall period, licensing)
Structural validity (sound dimensional structure; preferably longitudinal invariance)
Internal consistency for each dimension
Responsiveness to change (ideally with an established MIC for interpretation)
Reproducibility (primarily measurement error/SEM/SDC as we expect it to be used longitudinally)

Questions

2.1 What are the underlying constructs measured by WORQ in patients with low back pain, and how are they operationalised?

Answer

Physical functioning
- Construct: Self-reported limitations in bodily activities relevant to work participation (e.g., walking, transfers, endurance, mobility).
- Operationalisation: 10 items scored 0–10 (0 = no difficulty; 10 = extreme difficulty), linearly transformed to 0–100; higher = worse functioning. Items include, for example, endurance, day-to-day activities, short-/long-distance walking, mobility.
Psychological well-being
- Construct: Self-reported psychological distress affecting work functioning (e.g., mood, anxiety/depression–related difficulties).
- Operationalisation: 5 items scored 0–10, transformed to 0–100 (paper reports 0–50 range for raw domain sum, but domain scores are presented on a 0–100 interpretive scale); higher scores = worse psychological well-being.
Cognitive ability
- Construct: Self-reported cognitive limitations relevant to work (e.g., attention, memory, planning) that may hinder performance.
- Operationalisation: 10 items scored 0–10, transformed to 0–100; higher = worse. Shows expected associations with work ability and anxiety/depression; domain defined within the ICF-based framework.

2.2 How does WORQ fit the purpose of your RCT?

Answer

Your trial has self-reported physical function as its primary outcome, and this matches the physical function subscale of WORQ well.

WORQ is explicitly designed to capture work-related functioning within the ICF framework. Its Physical functioning domain targets limitations (e.g., walking, mobility, endurance, day-to-day activities) that are directly relevant to a work-focused intervention in patients with work-related CLBP.

Lastly, WORQ is a self-administered PROM, making it practical for repeated measurement across RCT time points. It provides the patient’s perspective on functional limitations. You can consider to complement it with capacity tests (e.g., Sit-to-stand test (STS), 6 min. walk test (6MWT)) if objective performance is also a trial objective.

2.3 Is WORQ a generic or disease-specific PROM? How should we classify its physical function subscale?

Answer

The WORQ can be described as follows:

The WORQ is a generic, multi-dimensional PROM of work-related functioning (developed within the ICF framework). It is intended for various patient populations and assesses three domains: Physical Functioning, Psychological Well-being, and Cognitive Ability.

The physical function subscale can also best be described as generic (work-related physical functioning), which in our situation is applied to a specific population (adults with chronic LBP).

The Hansen et al. study aimed to evaluate the measurement properties of the Danish version of WORQ, specifically reliability and construct validity, to support its use in assessing work-related functioning.

Questions

3.1 COSMIN properties

Which measurement properties does Hansen et al. evaluate and how is each evaluated (i.e. which coefficients are reported)?
Which relevant properties are not evaluated?

Answer

Hansen et al. evaluate WORQ’s internal consistency, test–retest reliability, measurement error, and construct validity (hypothesis testing and known-groups) and floor/ceiling effects. They do not assess structural validity or responsiveness in the study.

An outline of the properties tested in the study can be found in Table 1 below.

Measurement property	COSMIN category	Operationalisation	In Hansen et al.?
Table 1. WORQ measurement properties in Hansen et al.
Internal consistency	Reliability	Domain-level α and/or ω for WORQ domains	Yes
Test–retest reliability	Reliability	ICC for domain scores with ≈14-day retest	Yes
Measurement error	Reliability	SEM and SDC (95%) for domain scores	Yes
Construct validity — hypothesis testing	Validity	Correlations with WAI, EQ-5D-5L, 6MWT, 30-s STS¹	Yes
Construct validity — known-groups	Validity	Group differences by sick leave / disability level	Yes
Floor/ceiling effects	Interpretability	Extremes (%) and scale-width considerations	Yes
MIC (minimal important change)	Interpretability	Distribution-based interpretability metric	No
Structural validity	Validity	Factor structure / invariance	No
Responsiveness	Responsiveness	Change over time; longitudinal sensitivity to change	No
¹ Abbreviations: WAI = Work Ability Index; EQ-5D-5L = EuroQol 5 Dimensions, 5 Levels; 6MWT = 6-Minute Walk Test; 30-s STS = 30-Second Sit-to-Stand test.

3.2 In which patient population was WORQ evaluated in Hansen et al.?

Answer

The study evaluated WORQ in working-age adults with chronic LBP and excluded those not active in the labour market. The patients were recruited from a Spine Centre in the Region of Southern Denmark. This is a population very close to your RCT target population.

3.3 What practical considerations (feasibility) support using WORQ (Physical Function domain) in our RCT? What does not support its use?

Answer

WORQ exists in a validated Danish version and is straightforward to self-administer (electronically or on paper) at multiple RCT time points. Furthermore, scoring is simple (items 0–10; domain 0–100) and the physical function domain’s 10 items keep respondent burden modest.

However, using only one domain changes the instrument’s intended multi-domain structure; evidence of structural validity/invariance and some reliability parameters was established for the full instrument and may not generalise to an isolated subscale.

Next, you decide to look closer at the measurement properties of internal consistency and reliability. You use the COSMIN Risk of Bias checklist for this, and first you look at ‘Box 4. Internal consistency’.

Questions

3.4 What do you need to consider before evaluating Box 4 on internal consistency?

Tip

To answer this you need to look in the COSMIN Risk of Bias manual. You can find it here: COSMIN RoB manual 2.0. Find the section on ‘Internal consistency’.

Answer

In the COSMIN Risk of Bias Manual version 2.0 you need to look at section 6.3 - Internal consistency on page 167. Here is a summary of what you need to consider:

Structural validity should be performed before internal consistency is determined
Internal consistency is only applied to unidimensional scales
The internal consistency cannot be higher than the quality of the structural validity
Internal consistency applied to multidimensional scales should be ignored

3.5 Based on COSMIN Box 4 (internal consistency), how does Hansen et al. score on each of the four checklist items? Justify your answer.

Answer

Hansen et al. did not assess structural validity in this study. Instead they referenced a prior study in the same CLBP population (reference 11, J Occup Rehabil, 2023) that established WORQ’s factor structure.

Using that structure, they reported Cronbach’s \(\alpha\) and McDonald’s \(\omega\) for each WORQ domain (continuous scores), which meets COSMIN’s “very good” criterion for Item 1. Items 2–3 are not applicable (no dichotomous or IRT/theta scoring). No important methodological flaws affecting internal consistency were noted (“very good”). See Table 2 for details.

Q#	Checklist item	Findings	COSMIN rating	Rationale
COSMIN Box 4. Internal Consistency (Hansen et al.)
1	For continuous scores: Was α or ω calculated?	Yes—domain-level Cronbach’s α and McDonald’s ω were reported.	Very good	Meets criterion: α/ω calculated for continuous domain scores.
2	For dichotomous scores: Was α or KR-20 calculated?	Not applicable (WORQ uses 0–10 items).	NA	Items are not dichotomous; KR-20 not relevant.
3	For IRT-based scores: SE(θ) or reliability of θ calculated?	Not applicable (no IRT/theta scoring).	NA	Classical scoring used; no IRT indices reported.
4	Any other important design/statistical flaws?	No important flaws noted affecting internal consistency.	Very good	Appropriate methods and reporting for internal consistency.
Ratings follow COSMIN Box 4 categories: very good / adequate / doubtful / inadequate / NA.

Lastly, you turn to evaluating the reliability of WORQ. Here you use Box 6. Reliability from the COSMIN RoB checklist.

Questions

3.6 Using COSMIN Box 6 (Reliability), score all eight items for Hansen et al. and justify each rating.

Answer

Hansen et al. performed a \(\approx\) 2 week test–retest of WORQ domain scores. Retest was completed electronically under similar conditions, and ICCs (two-way, absolute agreement) were reported. No dichotomous/nominal/ordinal reliability analyses were applicable. The answers to the eight questions are found in the table below.

Q#	Checklist item	Finding in Hansen et al.	COSMIN rating	Justification
COSMIN Box 6. Reliability (Hansen et al.)
1	Were patients stable between repeated measurements?	Stability assumed; no explicit GRC (global rating of change) anchor reported.	Adequate	COSMIN: ‘Assumable that patients were stable’ → Adequate.
2	Was the time interval appropriate?	Retest at ~14 days.	Very good	Common and appropriate interval for PROM test–retest.
3	Were measurement conditions similar for repeated measurements?	Same self-administered mode/context across time points.	Very good	Comparable conditions/mode across T1–T2.
4	For continuous scores: was the appropriate ICC calculated (with evidence of no systematic change)?	ICC\(_{agreement}\) (two-way) reported; no explicit test of no change.	Very good	Model is appropriate (ICC\(_{agreement}\)) and includes systematic error
5	For dichotomous scores: was kappa calculated?	Not applicable.	NA	WORQ items are 0–10; domains are continuous.
6	For nominal scores: was an unweighted kappa calculated?	Not applicable.	NA	No nominal outcomes.
7	For ordinal scores: was a weighted kappa calculated?	Not applicable.	NA	No ordinal rater-agreement analysis required.
8	Any other important design/statistical flaws?	No important flaws affecting reliability noted.	Very good	Design/reporting suitable for test–retest.
GRC = Global Rating of Change. ICC\(_{agreement}\) = intraclass correlation coefficient (absolute agreement). Ratings per COSMIN categories: very good / adequate / doubtful / inadequate / NA.

3.7 Based on what you have learned from this article, do you think the WORQ would be appropriate as a primary outcome for your RCT? Give reasons for your answer.

Answer

The WORQ physical function subscale can serve as a defensible primary PROM for a work-focused CLBP RCT, but only if you plan for interpretation of change and are transparent about using a single domain.

Several strengths are noted: It targets work-relevant physical functioning (your primary outcome), has solid internal consistency and acceptable test–retest reliability, is validated in Danish, and is practical for repeated administration.

On the downside is: Responsiveness and MIC haven’t been established, and using only one domain departs from the intended multi-domain structure and needs clear justification. Lastly, you must ensure expected effects exceed measurement error and aren’t masked by floor/ceiling issues.