Exercise on reliability and validity answers

Please read the Exercise introduction tab first. Then answer the questions in the ‘Questions’ tab.

Imagine you are part of a research project looking at the effect of a multidisciplinary intervention for patients with chronic whiplash associated disorder (WAD). These patients are seen both in:

the primary sector (medical doctors, physiotherapists, chiropractors) and
the secondary sector (outpatient hospital units)

within the Danish health care system.

To measure pain-related function, you decide to use the Neck Disability Index (NDI). The NDI has 10 items, each with 6 response options scored from 0–5. The total score is converted to a 0–100 scale, where a higher score means more disability.

Because you are not yet sure whether the NDI is the right tool for your study, you want to check how reliable and valid it is in your sample.

You can view the NDI in full by here: Neck Disability Index

Start with the questions in tab 1. Internal consistency and, when finished, continue to the next.

Please write your answers to the questions in a Word document. The exercise including answers can be downloaded after the course as a html-document from the webpage.

You calculate the internal consistency (Cronbach’s alpha) and find the following results in Table 1 and Table 2.

Warning in response.frequencies(x, max = max): response.frequency has been
deprecated and replaced with responseFrequecy.  Please fix your call

Item	Name	Alpha	95% CI
Table 1. Cronbach’s Alpha at item level
n1	Pain intensity	0.933	-
n2	Personal care	0.897	-
n3	Lifting	0.896	-
n4	Reading	0.898	-
n5	Headache	0.937	-
n6	Concentration	0.897	-
n7	Work	0.899	-
n8	Driving car	0.897	-
n9	Sleep	0.897	-
n10	Recreation	0.897	-
Total	All items	0.914	[0.899, 0.928]
Note: n = 300

Questions

1.1 The Cronbach’s alpha for the NDI is presented in Table 1. What does this value mean for the questionnaire?

Answer

Cronbach’s alpha tells us how consistently the items in a questionnaire measure the same underlying construct (pain-related disability).

Our Chronbach’s alpha of 0.91 is very high. Usually, values between 0.7 and 0.9 are considered acceptable internal consistency. As our alpha is marginally > 0.9, there may be items which are being too similar (i.e. overlapping). Whether this is true should be evaluated by looking at item wording taking content validity into account.

Remember, alpha depends on both the number of items and the sample size. With only 10 items, the number of questions has limited influence on alpha. But with a fairly large sample (n = 300), we expect alpha to be on the higher side — and that’s exactly what we see here.

The confidence interval is quite narrow (0.9–0.93), which tells us the estimate is precise. It also shows that the “true” alpha almost certainly lies in the acceptable range of 0.7–0.9 (and in this case, a bit above).

1.2 Look at the alpha for n1 to n10 and explain what you see.

Answer

Alpha for n1 to n10 shows us what the total Cronbach’s alpha would be if a specific item was removed from the scale. Most items here are around 0.897–0.899, which means removing them would hardly change the overall alpha of 0.91.

However, n1 (Pain intensity) and n5 (Headache) stand out. If either of these were removed, alpha would increase to about 0.933–0.937. This suggests that these two items don’t quite “fit” as smoothly with the others. Conceptually, you could argue that pain intensity and headache may capture slightly different aspects than the broader construct of “pain-related neck function”.

This doesn’t mean we should automatically throw them out. Instead, it’s a signal that items with unusually high alpha values may need a closer look. The next step would be to explore this further using, for example, a factor analysis, to see if these items load differently from the rest.

Your data includes baseline (t1) and follow-up (t2) data 2 weeks later. You have ensured that the respondents are stable. The summary statistics of the test-retest data are presented in Table 2.

N	Min score	Max score	Mean score t1	Mean score t2	Mean difference	SD difference	95% prediction interval
Table 2. NDI test-retest summary statistics
300	0	100	48.5	43.3	−5.2	9.8	19.2

Questions

2.1 Draw a Bland & Altman limits of agreement plot using Table 2.

Answer

The answer can be seen in Figure 1. Notice the numbers on the x- and y-axis. Also notice that some of the dots are outside the LoA interval as it is a 95% interval.

2.2 What is the measurement error with 95% prediction intervals and which unit does it have?

Answer

Measurement error (95% prediction/LoA): ± 19.2 NDI points. This value is 1.96 × SD of the (t2 − t1) differences and represents the random measurement error (the half-width of the 95% limits of agreement).

Bias (systematic difference): −5.2 NDI points.

95% limits of agreement (LoA): −5.2 ± 19.2 = [−24.3; 14.0] NDI points. The interpretation is that for a future pair of measurements on the same person, the difference (t2 − t1) is expected to lie between −24.4 and 14.0 points 95% of the time.

Units: All quantities are in NDI points (0–100), so the error and limits are directly interpretable on the original scale.

Notes: The ± 19.2 points reflect random error only (i.e. measurement error of consistency) whereas the bias (−5.2) is reported separately. These statements assume the differences are approximately normally distributed with constant variance across the measurement range.

2.3 Are you happy with the size of the measurement error?

Answer

The measurement error is 19.2 points. This corresponds to roughly 1/5 of the NDI scale range which is not trivial.

Whether this is acceptable is partly a clinical judgement. Measurement error sets the boundary for what we call a “true change” — a change that goes beyond what could just be random variation. However, remember the Bland & Altman plot only gives the measurement error of consistency as systematic error is not included in the 19.2 points.

So, if we expect only a small treatment effect, then a large measurement error is problematic because the effect may be lost in the “noise”. However, if we expect a large treatment effect, then we can live with a bigger measurement error because the effect is still visible despite the variability.

Therefore, the short answer is that “it depends”. Statistically we can report the measurement error, but whether it’s too big or acceptable is a question that requires clinical knowledge and context.

2.4 What have you learned about the systematic error?

Answer

The systematic error is the mean difference between test and retest. In our data it is about -5.2 points on the 0–100 NDI scale.

This tells us that, on average, patients scored slightly lower at retest compared to baseline. The size of this shift is relatively small compared to the full scale, but it still matters because it represents a consistent bias rather than random noise.

The bias could reflect a design issue in the test–retest study (e.g. patients were not truly stable, or the time interval allowed real change). However, it could also reflect measurement issues, such as items being phrased in a way that makes responses vary depending on context (e.g. time of day, patient interpretation).

The systematic error is -4% which is borderline of what I will consider acceptable. It may point to a) a problem with the design of test-retest study (e.g. ‘unstable’ patients, time-of-day when answering) or b) that the items are formulated poorly allowing answer variance in stable patients.

In summary, the systematic error is -5.2 points. Whether this is acceptable depends on how much bias you are willing to tolerate, but ideally it should be close to zero.

The patients included in the study are from both the primary sector (physiotherapy, chiropractic and GP practices) and the secondary sector (ambulatory hospital units). You decide to calculate the ICC for both groups and find the results shown in Table 3.

Population	ICC_consistency	95% CI	ICC_agreement	95% CI
Table 3. Reliability of the NDI in two different populations
Primary sector patients	0.93	[0.91, 0.95]	0.91	[0.82, 0.95]
Secondary sector patients	0.83	[0.75, 0.88]	0.80	[0.63, 0.88]

Questions

3.1 Why do you see a difference between ICC_consistency and ICC_agreement

Answer

The difference comes from how the two ICCs treat systematic error.

ICC_agreement takes into account both random variation and any systematic shift between the two measurements. In our data, the average difference (systematic error) is -5.2 points. This lowers the ICC because the method sees that patients, on average, scored differently at retest than at baseline.

ICC_consistency, on the other hand, ignores the systematic difference. It only looks at whether people keep the same relative ranking across the two measurements. So even if everyone shifts up or down a bit, the consistency ICC will stay higher.

3.2 Why do you think the ICCs are lower in the secondary sector patients?

Answer

ICCs depend on how heterogeneous the patient population is.

In the primary sector (GPs, physiotherapists, chiropractors), WAD patients cover a wide spectrum of severity - from very mildly affected to severely affected. This variability increases the between-patient variance, which tends to raise ICC values.

Contrary, in the secondary sector (hospital spinal units), the patients are a more select and homogeneous group, usually representing those with more severe problems. With less variation between patients, the ICC drops.

3.3 Which parameter do you prefer: ICC_consistency or ICC_agreement? Justify your choice.

Answer

In most cases, ICC_agreement is preferred, because it takes both random variation and systematic error into account. This gives a more realistic picture of reliability when we are using the scale to track actual patient scores.

ICC_consistency can still be useful, but only in special cases:

If we only care about the ranking of patients (e.g. whether high scorers stay high and low scorers stay low), not their absolute scores.
If we are certain there is no systematic error between measurements (which is quite rare in practice).

In our example, we did see a systematic difference between test and retest. That’s why the consistency ICC is higher, but the agreement ICC is more trustworthy.

Question

4.1 Validation is a continuous process. Give at least two reasons why this is so.

Answer

Validation never really stops, it’s something we need to revisit whenever circumstances change. Two main reasons I can think of are:

New context or population: An instrument that works well in one setting may not perform the same in another. For example, if we apply the NDI to a different patient group or for a new purpose, we need to re-check its validity.
Advances in knowledge: Over time, theories evolve and new evidence appears. This allows us to test stronger or more precise hypotheses, or even challenge earlier assumptions about what the instrument is measuring.

Imagine you have had intermittent neck pain for the past 2-3 years. Right now you have neck pain radiating into the left shoulder region. You want to know if the Neck Disability Index works for your problem. The content of the 10 items of the NDI is outlined in Table 4.

Item	Name
Table 4. Content of the NDI
n1	Pain intensity
n2	Personal care
n3	Lifting
n4	Reading
n5	Headache
n6	Concentration
n7	Work
n8	Driving car
n9	Sleep
n10	Recreation

(You can view the NDI in full by here: Neck Disability Index)

Questions

5.1 Are the included items relevant for you as a neck pain patient? If not, please state why?

Answer

This question is subjective and depends on each person’s daily life and situation. For example, I personally find some items harder to relate to:

Recreation: I am not sure what exactly is meant by this? Leisure activities can be very different from person to person.
Lifting: I rarely experience neck pain when lifting, so this item would feel irrelevant to me.
Work: It is unclear whether this refers to a paid job specifically, or to any type of work activity (like housework). That makes the item harder to interpret.

5.2 Are there any missing areas/domains/constructs which would be relevant for a neck pain patient? If yes, please state what is missing and why?

Answer

Yes, the NDI does not cover several areas that many neck pain patients experience. Some important examples are:

Physical symptoms

Mechanical dysfunction: stiffness, creaking, or locking of the neck, which often limits daily movement.
Pain characteristics: type of pain (burning, stabbing, stinging, etc.) can give valuable clinical information.
Radiating symptoms: pain, tingling, or numbness in the arms, or jaw pain, may suggest nerve involvement.

Neurological and vestibular symptoms

Dizziness and balance problems: common in cervicogenic conditions, affecting walking and stability.
Nausea/vomiting: sometimes linked with vestibular involvement.

Cognitive and emotional impact

Fatigue: chronic pain is draining and affects energy.
Fear of movement (kinesiophobia): common in neck pain, leading to avoidance and deconditioning.
Mood and social effects: depression, irritability, and isolation can develop when activities are reduced.

Functional limitations

Work tasks: computer use, prolonged sitting, or overhead activities are often triggers.
Recreational/household tasks: sports, gardening, chores like vacuuming or lifting can all be problematic (the NDI only covers a general recreation item).

Lifestyle - Sensitivity to light/sound: often linked with neck-related headaches.

In summary, the NDI focuses on some core functional areas but misses several domains that could be very relevant for patients, especially physical symptoms, neurological complaints, emotional impacts, and lifestyle factors.

5.3 After having considered the questions included in the NDI, how do you consider the content validity?

Answer

Content validity depends on how clearly we define what the instrument is supposed to measure. For the NDI, the developers stated the target was “pain-related function”, but this was never sharply defined.

Looking at the items:

The first item (n1, pain intensity) asks about pain, which is really a symptom, not function. It also does not delineate which types of pain it is (e.g., burning, stabbing, radiating, etc.).
Other common symptoms (e.g. stiffness, dizziness, radiating arm pain) are not included, so the NDI does not give a full picture of the patient’s symptom state.
One item, n5 – headache, appears to be a slightly misfitting item. While headaches are often linked with neck problems, they may represent a somewhat different construct than “pain-related function.” This raises the question of whether the item truly belongs in the scale or if it should be treated separately.
On the function side, the items capture some daily activities but miss several important ones, meaning the functional domain is only partly covered.

Therefore, while the NDI captures some aspects of pain-related function, the inclusion of symptoms like pain intensity and headache, plus the absence of other important symptoms and functional tasks, makes its content validity questionable — especially if the instrument is used to monitor patients over time.

You also want to measure criterion validity for the NDI.

Question

6.1 Name 2-3 good criteria for measuring function of the neck? Indicate why you think they are good.

Answer

Finding a truly valid and reliable external criterion is challenging, because the NDI was not developed from a clear conceptual model. This means we cannot be certain what the NDI is really measuring, which makes the choice of a “gold standard” criterion difficult.

If we assume that the NDI is intended to measure “pain-related function”, then possible external criteria could be:

Cervical range of motion: measurable with goniometers or electronic devices. This is a direct indicator of how much the neck can move, which may (or may not) relate to function. This is certainly not a good gold standard.
The Global perceived effect (GPE) scale specific to function: a simple patient-reported question about whether their neck-related functioning has improved or worsened. This provides a patient-centered external anchor.
Pressure pain threshold (algometry): an objective test of pain sensitivity. While more symptom-oriented, it could be used as a physiological correlate of pain-related limitation.

As you can see, choosing external criteria is tricky, but these examples illustrate how we can link the NDI to both objective tests (like range of motion or algometry) and patient-reported outcomes (like the GPE).

You have decided to include a generic multidimensional outcome measure in addition to the NDI. This is the SF-36 (Short form 36 items) which has been validated in Danish. The SF-36 consists of 8 scales and two summary scales as follows:

Each scale of the SF-36 is briefly described below:

Physical Functioning (PF). Assess limitations of normal physical activities (lifting, climbing stairs, bending, kneeling, walking moderate distance), and is designed to estimate the severity of the limitation (10 questions).

Role/Physical (RP). Assess work function limitations caused by physical health problems. ‘Role’ applies to work or everyday responsibilities (a job, community activity or volunteer work) typical for a specific age (4 questions).

Bodily Pain (BP). Assess the severity of pain and the extent to which it interferes with daily activities (2 questions).

General health (GH). Assess physical health status (current and prior health), and has been documented to be a good predictor of health care expenditure (10 questions).

Vitality/Energy (VT). Assess a subjective feeling of well-being including energy and fatigue (4 questions).

Social Functioning (SF). Assess the quantity and quality of interaction with others (social relationships) extending measurements beyond exclusively physical and mental health concepts (2 questions).

Role/Emotional (RE). Assess ‘role’ (see above for explanation of ‘role’) limitations due to emotional problems (3 questions).

Mental Health /Emotional well-being (MH). Assess the 4 major mental health dimensions of anxiety, depression, loss of behavioral or emotional control and psychological well-being (5 questions).

Summary measures The SF-36 also provides 2 important summery measures of health-related quality of life: Physical Component Summary (PCS) and Mental Component Summary (MCS) scales. The strength of both summary measures lies in their ability to distinguish a physical from a mental outcome.

Question

7.1 You want to test construct validity (hypothesis testing) of the NDI. Please describe at least 3 a priori hypotheses which are specific (i.e. have direction, strength and reason).

Answer

Three well-formulated hypotheses could be:

Strong positive correlation with physical functioning

Hypothesis: The NDI will correlate negatively and strongly (r < -0.50) with the Physical Functioning (PF) subscale of the SF-36.

Reason: Higher NDI scores = more disability, while higher SF-36 PF scores = better functioning. Since they measure opposite ends of physical function, we expect a strong negative correlation.

Moderate negative correlation with vitality (energy/fatigue)

Hypothesis: In primary sector patients, the NDI will correlate moderately and negatively (-0.30 to -0.50) with the Vitality (VT) subscale.

Reason: Neck disability is expected to influence energy and fatigue, but the relationship is less direct than with physical functioning, so a moderate association is expected.

Moderate-to-strong negative correlation with mental health in secondary sector patients

Hypothesis: In secondary sector patients, the NDI will correlate moderately to strongly and negatively (< -0.50) with the Mental Component Summary (MCS) score of the SF-36.

Reason: More severely affected patients often experience stronger emotional and mental health impacts from disability, so the relationship with mental health is expected to be more pronounced in this group.