9.26 Selection bias arises from underlying differences between the treatment and comparison groups, which might cause them to have different outcomes irrespective of the policy. Bias affects all members of a group, on average, in the same direction. For example, with an urban redevelopment initiative the treatment areas might be more deprived than the comparison areas. The success with which a research design is able to address these systematic differences is called the strength of the design. Strength is a subjective concept and is not a numerical quantity.
9.27 In addition, there are also random differences between individual members of both groups which affect their outcomes independently. For example, some pupils taking a school test might do well just through luck or less well due to "having a bad day", irrespective of underlying ability. These kinds of differences appear as random fluctuations or "noise" in the outcome measure. The power of a design is its ability to detect policy effects in the midst of "noise". Power is a numerical quantity - it is defined as the probability that if the true effect is of a given size, then the design will detect it with a given level of confidence, or at a given "significance level".13 The relationship between power and strength is shown in Table 9.B.
| Weak design Poor counterfactual or none at all | Strong design Realistic counterfactual estimate | ||
| Low power Small number of observations and / or policy effect small relative to noise | Unlikely to detect difference between groups or over time. And even if we do, we have no confidence in attributing it to the policy. | Unlikely to detect difference between groups. But if we do, then we have confidence in attributing it to the policy. | |
| High power Large number of observations and / or policy effect large relative to noise | Very likely to find a significant difference between groups but this does not mean it can be attributed to the policy. |
|
9.28 Power depends both on the size of the effect on the outcome relative to the natural
variation in that outcome (or the "signal-to-noise ratio") and on the number of observations. It also depends on the research design being used. As an illustrative example, Box 9.E is concerned with the power of a simple test of difference between two means (based on an unpaired t-test)14 as might be used to analyse the results of an RCT. It shows the number of observations required to achieve a power of 80 per cent at a significance level of five per cent for a range of signal-to-noise ratios. What is quite striking is that if the size of the policy effect is similar to or greater than the noise, then quite small sample sizes (e.g. 15 treated and 15 controls to give a combined sample of 30) are adequate; but as the relative signal size decreases, the number of observations required to detect it increases dramatically. For example, a signal-to-noise ratio of 1:8 would require a combined sample size of 2000.
|
The table shows the combined sample size (treatment + comparison group) required for an unpaired t-test if it is to have a power of 80 per cent at a significance level of five per cent. The "signal" is the mean treatment effect and the "noise" is the residual standard deviation. |
9.29 Is it possible to predict the signal-to-noise ratio, and hence the required sample size, in advance? The expected noise level may be estimated from historical data if available, but the signal - that is, the predicted policy effect - is trickier. It may be possible to estimate it from the logic model of the intervention, reasoning along the lines of how many people will be affected and what might be a realistic change in their behaviour as a result. It may alternatively be possible to calculate how big an effect would need to be in order for the policy to be considered a success (either in political or cost-benefit terms), and to say that if the actual impact was less than this it would not matter if it was undetected.
9.30 The implication is that impact evaluation is only worth attempting on policies where the expected impact is large enough to stand out from random fluctuations in the system under study. How large is large enough depends on how well modelling is able to explain the differences between individual group members that arise in the absence of the policy. If it is possible to predict accurately what an individual's outcome "should" be, then any impact on that outcome due to the policy is easier to detect. If, however, the drivers of these differences are poorly understood, or are not captured in any model, then the noise level will be higher. Small schemes or minor refinements to practice that may still be good value for money and entirely worthwhile on the basis that "every little helps" cannot then have their impact evaluated, because the ability of research designs to detect the "little" from the midst of many competing drivers is too limited. In areas of study where the level of noise is large, this can even lead to a pessimistic conclusion that "nothing works".
9.31 If the final outcome measure is too noisy, the evaluator may still seek to detect a change in some intermediate outcome identified in the initial logic model (although the task still remains to translate the result into an estimate of final impact - a task which might be approached through reference to, for instance, theory-based evaluative models, see Chapter 5). Examining intermediate outcomes is a useful exercise in its own right, as it can help to understand the mechanism of the intervention. For example, it would be very hard to detect the effect of an advertising campaign promoting healthy eating on ultimate health outcomes, but a survey which showed some behaviour change, for instance higher consumption of fruit and vegetables in those areas subject to the campaign, might provide evidence that the campaign had had some success in communicating its message.
9.32 Even if it is not possible to detect an impact, it might still be possible to answer the question: in a best case scenario, how good might the policy benefit be, and yet have a reasonable chance of failing to be detected by the study? This could be important if it turns out that, even under such an optimistic scenario, the costs of the policy would outweigh its benefits. This can be done by deriving, from power calculation, the smallest detectable effect and then comparing the benefit that would be obtained from this impact with the cost of the policy. Notice that the two possible outcomes of this method are not symmetrical: it might find that the policy would not be value for money, even if it managed to generate the smallest detectable effect; or it might just be inconclusive, in the sense that the policy might be value for money, even at some effect size smaller than the smallest detectable.
____________________________________________________________________
12 This section assumes a basic knowledge of statistics, for example hypothesis testing and the t-test.
13 Significance is a function of the "noise", or variance in the outcome of interest. If the change in an outcome is said to be "significant at a five per cent level", it means that, given the natural variance in that outcome, a change of such a magnitude would only be expected five per cent of the time.
14 Significance is a function of the "noise", or variance in the outcome of interest. If the change in an outcome is said to be "significant at a five per cent level", it means that, given the natural variance in that outcome, a change of such a magnitude would only be expected five per cent of the time.