Exercise on reliability and validity answers

Back to the main page

Please read the Exercise introduction tab first. Then answer the questions in the ‘Questions’ tab.

Imagine you are part of a research project looking at the effect of a multidisciplinary intervention for patients with chronic whiplash associated disorder (WAD). These patients are seen both in:

  • the primary sector (medical doctors, physiotherapists, chiropractors) and
  • the secondary sector (outpatient hospital units)

within the Danish health care system.

To measure pain-related function, you decide to use the Neck Disability Index (NDI). The NDI has 10 items, each with 6 response options scored from 0–5. The total score is converted to a 0–100 scale, where a higher score means more disability.

Because you are not yet sure whether the NDI is the right tool for your study, you want to check how reliable and valid it is in your sample.

You can view the NDI in full by here: Neck Disability Index

Start with the questions in tab 1. Internal consistency and, when finished, continue to the next.

Please write your answers to the questions in a Word document. The exercise including answers can be downloaded after the course as a html-document from the webpage.

You calculate the internal consistency (Cronbach’s alpha) and find the following results in Table 1 and Table 2.

Warning in response.frequencies(x, max = max): response.frequency has been
deprecated and replaced with responseFrequecy.  Please fix your call
Table 1. Cronbach’s Alpha at item level
Item Name Alpha 95% CI
n1 Pain intensity 0.933 -
n2 Personal care 0.897 -
n3 Lifting 0.896 -
n4 Reading 0.898 -
n5 Headache 0.937 -
n6 Concentration 0.897 -
n7 Work 0.899 -
n8 Driving car 0.897 -
n9 Sleep 0.897 -
n10 Recreation 0.897 -
Total All items 0.914 [0.899, 0.928]
Note: n = 300

Questions

1.1 The Cronbach’s alpha for the NDI is presented in Table 1. What does this value mean for the questionnaire?

ImportantAnswer

Cronbach’s alpha tells us how consistently the items in a questionnaire measure the same underlying construct (pain-related disability).

Our Chronbach’s alpha of 0.91 is very high. Usually, values between 0.7 and 0.9 are considered acceptable internal consistency. As our alpha is marginally > 0.9, there may be items which are being too similar (i.e. overlapping). Whether this is true should be evaluated by looking at item wording taking content validity into account.

Remember, alpha depends on both the number of items and the sample size. With only 10 items, the number of questions has limited influence on alpha. But with a fairly large sample (n = 300), we expect alpha to be on the higher side — and that’s exactly what we see here.

The confidence interval is quite narrow (0.9–0.93), which tells us the estimate is precise. It also shows that the “true” alpha almost certainly lies in the acceptable range of 0.7–0.9 (and in this case, a bit above).

1.2 Look at the alpha for n1 to n10 and explain what you see.

ImportantAnswer

Alpha for n1 to n10 shows us what the total Cronbach’s alpha would be if a specific item was removed from the scale. Most items here are around 0.897–0.899, which means removing them would hardly change the overall alpha of 0.91.

However, n1 (Pain intensity) and n5 (Headache) stand out. If either of these were removed, alpha would increase to about 0.933–0.937. This suggests that these two items don’t quite “fit” as smoothly with the others. Conceptually, you could argue that pain intensity and headache may capture slightly different aspects than the broader construct of “pain-related neck function”.

This doesn’t mean we should automatically throw them out. Instead, it’s a signal that items with unusually high alpha values may need a closer look. The next step would be to explore this further using, for example, a factor analysis, to see if these items load differently from the rest.

Your data includes baseline (t1) and follow-up (t2) data 2 weeks later. You have ensured that the respondents are stable. The summary statistics of the test-retest data are presented in Table 2.

Table 2. NDI test-retest summary statistics
N Min score Max score Mean score t1 Mean score t2 Mean difference SD difference 95% prediction interval
300 0 100 48.5 43.3 −5.2 9.8 19.2

Questions

2.1 Draw a Bland & Altman limits of agreement plot using Table 2.

ImportantAnswer

The answer can be seen in Figure 1. Notice the numbers on the x- and y-axis. Also notice that some of the dots are outside the LoA interval as it is a 95% interval.

2.2 What is the measurement error with 95% prediction intervals and which unit does it have?

ImportantAnswer

Measurement error (95% prediction/LoA): ± 19.2 NDI points. This value is 1.96 × SD of the (t2 − t1) differences and represents the random measurement error (the half-width of the 95% limits of agreement).

Bias (systematic difference): −5.2 NDI points.

95% limits of agreement (LoA): −5.2 ± 19.2 = [−24.3; 14.0] NDI points. The interpretation is that for a future pair of measurements on the same person, the difference (t2 − t1) is expected to lie between −24.4 and 14.0 points 95% of the time.

Units: All quantities are in NDI points (0–100), so the error and limits are directly interpretable on the original scale.

Notes: The ± 19.2 points reflect random error only (i.e. measurement error of consistency) whereas the bias (−5.2) is reported separately. These statements assume the differences are approximately normally distributed with constant variance across the measurement range.

2.3 Are you happy with the size of the measurement error?

ImportantAnswer

The measurement error is 19.2 points. This corresponds to roughly 1/5 of the NDI scale range which is not trivial.

Whether this is acceptable is partly a clinical judgement. Measurement error sets the boundary for what we call a “true change” — a change that goes beyond what could just be random variation. However, remember the Bland & Altman plot only gives the measurement error of consistency as systematic error is not included in the 19.2 points.

So, if we expect only a small treatment effect, then a large measurement error is problematic because the effect may be lost in the “noise”. However, if we expect a large treatment effect, then we can live with a bigger measurement error because the effect is still visible despite the variability.

Therefore, the short answer is that “it depends”. Statistically we can report the measurement error, but whether it’s too big or acceptable is a question that requires clinical knowledge and context.

2.4 What have you learned about the systematic error?

ImportantAnswer

The systematic error is the mean difference between test and retest. In our data it is about -5.2 points on the 0–100 NDI scale.

This tells us that, on average, patients scored slightly lower at retest compared to baseline. The size of this shift is relatively small compared to the full scale, but it still matters because it represents a consistent bias rather than random noise.

The bias could reflect a design issue in the test–retest study (e.g. patients were not truly stable, or the time interval allowed real change). However, it could also reflect measurement issues, such as items being phrased in a way that makes responses vary depending on context (e.g. time of day, patient interpretation).

The systematic error is -4% which is borderline of what I will consider acceptable. It may point to a) a problem with the design of test-retest study (e.g. ‘unstable’ patients, time-of-day when answering) or b) that the items are formulated poorly allowing answer variance in stable patients.

In summary, the systematic error is -5.2 points. Whether this is acceptable depends on how much bias you are willing to tolerate, but ideally it should be close to zero.

The patients included in the study are from both the primary sector (physiotherapy, chiropractic and GP practices) and the secondary sector (ambulatory hospital units). You decide to calculate the ICC for both groups and find the results shown in Table 3.

Table 3. Reliability of the NDI in two different populations
Population ICCconsistency 95% CI ICCagreement 95% CI
Primary sector patients 0.93 [0.91, 0.95] 0.91 [0.82, 0.95]
Secondary sector patients 0.83 [0.75, 0.88] 0.80 [0.63, 0.88]

Questions

3.1 Why do you see a difference between ICCconsistency and ICCagreement

ImportantAnswer

The difference comes from how the two ICCs treat systematic error.

ICCagreement takes into account both random variation and any systematic shift between the two measurements. In our data, the average difference (systematic error) is -5.2 points. This lowers the ICC because the method sees that patients, on average, scored differently at retest than at baseline.

ICCconsistency, on the other hand, ignores the systematic difference. It only looks at whether people keep the same relative ranking across the two measurements. So even if everyone shifts up or down a bit, the consistency ICC will stay higher.

3.2 Why do you think the ICCs are lower in the secondary sector patients?

ImportantAnswer

ICCs depend on how heterogeneous the patient population is.

In the primary sector (GPs, physiotherapists, chiropractors), WAD patients cover a wide spectrum of severity - from very mildly affected to severely affected. This variability increases the between-patient variance, which tends to raise ICC values.

Contrary, in the secondary sector (hospital spinal units), the patients are a more select and homogeneous group, usually representing those with more severe problems. With less variation between patients, the ICC drops.

3.3 Which parameter do you prefer: ICCconsistency or ICCagreement? Justify your choice.

ImportantAnswer

In most cases, ICCagreement is preferred, because it takes both random variation and systematic error into account. This gives a more realistic picture of reliability when we are using the scale to track actual patient scores.

ICCconsistency can still be useful, but only in special cases:

  • If we only care about the ranking of patients (e.g. whether high scorers stay high and low scorers stay low), not their absolute scores.
  • If we are certain there is no systematic error between measurements (which is quite rare in practice).

In our example, we did see a systematic difference between test and retest. That’s why the consistency ICC is higher, but the agreement ICC is more trustworthy.

Question

4.1 Validation is a continuous process. Give at least two reasons why this is so.

ImportantAnswer

Validation never really stops, it’s something we need to revisit whenever circumstances change. Two main reasons I can think of are:

  • New context or population: An instrument that works well in one setting may not perform the same in another. For example, if we apply the NDI to a different patient group or for a new purpose, we need to re-check its validity.
  • Advances in knowledge: Over time, theories evolve and new evidence appears. This allows us to test stronger or more precise hypotheses, or even challenge earlier assumptions about what the instrument is measuring.

Imagine you have had intermittent neck pain for the past 2-3 years. Right now you have neck pain radiating into the left shoulder region. You want to know if the Neck Disability Index works for your problem. The content of the 10 items of the NDI is outlined in Table 4.

Table 4. Content of the NDI
Item Name
n1 Pain intensity
n2 Personal care
n3 Lifting
n4 Reading
n5 Headache
n6 Concentration
n7 Work
n8 Driving car
n9 Sleep
n10 Recreation

(You can view the NDI in full by here: Neck Disability Index)

Questions

5.1 Are the included items relevant for you as a neck pain patient? If not, please state why?

ImportantAnswer

This question is subjective and depends on each person’s daily life and situation. For example, I personally find some items harder to relate to:

  • Recreation: I am not sure what exactly is meant by this? Leisure activities can be very different from person to person.
  • Lifting: I rarely experience neck pain when lifting, so this item would feel irrelevant to me.
  • Work: It is unclear whether this refers to a paid job specifically, or to any type of work activity (like housework). That makes the item harder to interpret.

5.2 Are there any missing areas/domains/constructs which would be relevant for a neck pain patient? If yes, please state what is missing and why?

ImportantAnswer

Yes, the NDI does not cover several areas that many neck pain patients experience. Some important examples are:

Physical symptoms

  • Mechanical dysfunction: stiffness, creaking, or locking of the neck, which often limits daily movement.
  • Pain characteristics: type of pain (burning, stabbing, stinging, etc.) can give valuable clinical information.
  • Radiating symptoms: pain, tingling, or numbness in the arms, or jaw pain, may suggest nerve involvement.

Neurological and vestibular symptoms

  • Dizziness and balance problems: common in cervicogenic conditions, affecting walking and stability.
  • Nausea/vomiting: sometimes linked with vestibular involvement.

Cognitive and emotional impact

  • Fatigue: chronic pain is draining and affects energy.
  • Fear of movement (kinesiophobia): common in neck pain, leading to avoidance and deconditioning.
  • Mood and social effects: depression, irritability, and isolation can develop when activities are reduced.

Functional limitations

  • Work tasks: computer use, prolonged sitting, or overhead activities are often triggers.
  • Recreational/household tasks: sports, gardening, chores like vacuuming or lifting can all be problematic (the NDI only covers a general recreation item).

Lifestyle - Sensitivity to light/sound: often linked with neck-related headaches.

In summary, the NDI focuses on some core functional areas but misses several domains that could be very relevant for patients, especially physical symptoms, neurological complaints, emotional impacts, and lifestyle factors.

5.3 After having considered the questions included in the NDI, how do you consider the content validity?

ImportantAnswer

Content validity depends on how clearly we define what the instrument is supposed to measure. For the NDI, the developers stated the target was “pain-related function”, but this was never sharply defined.

Looking at the items:

  • The first item (n1, pain intensity) asks about pain, which is really a symptom, not function. It also does not delineate which types of pain it is (e.g., burning, stabbing, radiating, etc.).
  • Other common symptoms (e.g. stiffness, dizziness, radiating arm pain) are not included, so the NDI does not give a full picture of the patient’s symptom state.
  • One item, n5 – headache, appears to be a slightly misfitting item. While headaches are often linked with neck problems, they may represent a somewhat different construct than “pain-related function.” This raises the question of whether the item truly belongs in the scale or if it should be treated separately.
  • On the function side, the items capture some daily activities but miss several important ones, meaning the functional domain is only partly covered.

Therefore, while the NDI captures some aspects of pain-related function, the inclusion of symptoms like pain intensity and headache, plus the absence of other important symptoms and functional tasks, makes its content validity questionable — especially if the instrument is used to monitor patients over time.

You also want to measure criterion validity for the NDI.

Question

6.1 Name 2-3 good criteria for measuring function of the neck? Indicate why you think they are good.

ImportantAnswer

Finding a truly valid and reliable external criterion is challenging, because the NDI was not developed from a clear conceptual model. This means we cannot be certain what the NDI is really measuring, which makes the choice of a “gold standard” criterion difficult.

If we assume that the NDI is intended to measure “pain-related function”, then possible external criteria could be:

  • Cervical range of motion: measurable with goniometers or electronic devices. This is a direct indicator of how much the neck can move, which may (or may not) relate to function. This is certainly not a good gold standard.
  • The Global perceived effect (GPE) scale specific to function: a simple patient-reported question about whether their neck-related functioning has improved or worsened. This provides a patient-centered external anchor.
  • Pressure pain threshold (algometry): an objective test of pain sensitivity. While more symptom-oriented, it could be used as a physiological correlate of pain-related limitation.

As you can see, choosing external criteria is tricky, but these examples illustrate how we can link the NDI to both objective tests (like range of motion or algometry) and patient-reported outcomes (like the GPE).

You have decided to include a generic multidimensional outcome measure in addition to the NDI. This is the SF-36 (Short form 36 items) which has been validated in Danish. The SF-36 consists of 8 scales and two summary scales as follows:

Box 4. Internal consistency

Box 4. Internal consistency

Each scale of the SF-36 is briefly described below:

Physical Functioning (PF). Assess limitations of normal physical activities (lifting, climbing stairs, bending, kneeling, walking moderate distance), and is designed to estimate the severity of the limitation (10 questions).

Role/Physical (RP). Assess work function limitations caused by physical health problems. ‘Role’ applies to work or everyday responsibilities (a job, community activity or volunteer work) typical for a specific age (4 questions).

Bodily Pain (BP). Assess the severity of pain and the extent to which it interferes with daily activities (2 questions).

General health (GH). Assess physical health status (current and prior health), and has been documented to be a good predictor of health care expenditure (10 questions).

Vitality/Energy (VT). Assess a subjective feeling of well-being including energy and fatigue (4 questions).

Social Functioning (SF). Assess the quantity and quality of interaction with others (social relationships) extending measurements beyond exclusively physical and mental health concepts (2 questions).

Role/Emotional (RE). Assess ‘role’ (see above for explanation of ‘role’) limitations due to emotional problems (3 questions).

Mental Health /Emotional well-being (MH). Assess the 4 major mental health dimensions of anxiety, depression, loss of behavioral or emotional control and psychological well-being (5 questions).

Summary measures The SF-36 also provides 2 important summery measures of health-related quality of life: Physical Component Summary (PCS) and Mental Component Summary (MCS) scales. The strength of both summary measures lies in their ability to distinguish a physical from a mental outcome.

Question

7.1 You want to test construct validity (hypothesis testing) of the NDI. Please describe at least 3 a priori hypotheses which are specific (i.e. have direction, strength and reason).

ImportantAnswer

Three well-formulated hypotheses could be:

Strong positive correlation with physical functioning

Hypothesis: The NDI will correlate negatively and strongly (r < -0.50) with the Physical Functioning (PF) subscale of the SF-36.

Reason: Higher NDI scores = more disability, while higher SF-36 PF scores = better functioning. Since they measure opposite ends of physical function, we expect a strong negative correlation.

Moderate negative correlation with vitality (energy/fatigue)

Hypothesis: In primary sector patients, the NDI will correlate moderately and negatively (-0.30 to -0.50) with the Vitality (VT) subscale.

Reason: Neck disability is expected to influence energy and fatigue, but the relationship is less direct than with physical functioning, so a moderate association is expected.

Moderate-to-strong negative correlation with mental health in secondary sector patients

Hypothesis: In secondary sector patients, the NDI will correlate moderately to strongly and negatively (< -0.50) with the Mental Component Summary (MCS) score of the SF-36.

Reason: More severely affected patients often experience stronger emotional and mental health impacts from disability, so the relationship with mental health is expected to be more pronounced in this group.