9.54 The discussion above has stressed how the textbook research designs (e.g. DiD, PSM, RDD) may be viewed in a common framework as ways of addressing selection bias. They are not mutually exclusive. While it is true that one design may form the centrepiece of a study, it is often appropriate to combine elements of a number of different approaches. For example, the analyst can form matched groups prior to performing a DiD (and may then find that the parallelism assumption is much better satisfied than for unmatched groups). As a second example, the model for an RDD can usefully be augmented with terms for other variables thought to affect the outcome, if they are available (which will boost its power to detect the policy effect).
9.55 Once a preliminary analysis has been made the analyst should think critically about the assumptions involved and to what extent the results will remain robust should those assumptions be incorrect. This may involve triangulation with data collected through a process evaluation such as stakeholder interviews to probe whether the modelling has captured the situation as it really is, running variants of the model under alternative assumptions, and where possible performing supporting analyses to test the assumptions directly. And, it almost goes without saying, always plot the data.
9.56 There are a number of threats to validity of research designs, some of them applying even where the design itself is very strong, as in the case of an RCT (further detail is provided in the supplementary guidance). These threats arise from the fact that the social scientist cannot usually control the experiment to the same degree as would be possible for a clinical researcher, and may be summarised under two headings:
• "Hawthorne effects" - subjects may react (either positively or negatively) to the knowledge that they are being experimented on, and in a way which affects the outcome of interest. This can occur especially if they are aware either of being granted or denied a potentially beneficial treatment. For instance, a participant who is denied access to a training course might react by seeking additional training outside of the trial. In a clinical setting this risk is mitigated by blinding or the use of a placebo, but this is almost impossible in the social policy field.
• Mis-assignment - the actual allocation and receipt of treatment may differ from what the researcher intended, because either the provider or recipient circumvented the planned design, for a variety of reasons.
9.57 Process evaluation can be valuable in determining whether and to what extent either of these has occurred.
9.58 Whenever a policy was targeted on individuals who were outliers in some way (for example, prolific offenders, low educational attainers) a common hazard for the evaluator is regression to the mean. If assignment to the policy was based on a snapshot measure shortly before it began (for instance, the number of offences in the last month, or results in a recent school test) then the selection process will to some extent capture the results of temporary fluctuations in an individual's life rather than underlying extremes. After participation, it is more likely for the extreme individuals to recover their underlying level, or "regress to the mean", than to become yet more extreme. The outcome will be seen to improve, but this will be at least partly a "natural" improvement, which, if unrecognised, might result in a misleading impression of a policy benefit.
9.59 The evaluator can check directly for regression to the mean if historical data are available, by looking for evidence that the outcome of interest has natural variability ("peaks and troughs"), and then seeing whether recruitment into a scheme appeared to occur closer to a peak. Repeating the analysis using different time baselines is a useful sensitivity test for this purpose. Some research designs, such as RCTs and RDDs, are constructed to avoid the problem making this check unnecessary, whereas others such as matching designs and DiD do not.
9.60 Examining historical time series data, where available, is valuable for descriptive purposes. It places any changes in the outcome measure that might have been the result of the policy in the context of pre-existing trends (did the trend change after the policy was introduced?) and can be used to test the parallelism assumption for DiD. Indeed, whenever a non-equivalent comparison group is used, the evaluator has considerably more confidence that post-policy changes were caused by the policy if the comparison and treatment groups have tracked one another for a long historical period. A useful trick when visually examining the data is to index the time series to a common baseline.
9.61 Another judgement the evaluator will wish to make is whether a "matched" comparison group really is matched. With regard to observed characteristics, this may be done by comparing distributions between the two groups. This check should be done even for RCTs, especially when numbers are small, as randomisation does not always provide balanced samples - that is, samples which are similar in terms of the characteristics likely to affect the outcome. With regard to unobserved characteristics, careful consideration based on subject area knowledge will be needed to assess possible non-equivalence.
9.62 A particular case where a "matched" comparison group may fail is when policy allocation was in fact rigorously targeted, but the evaluator does not have access to all the information on which the targeting was based, perhaps for one of the reasons mentioned in paragraph 9.56. In this case, the presence of a reasonably sized region of common support22 should be regarded with the utmost suspicion: it is virtually a sure sign that the selection bias has not been adequately captured, because in a deterministic selection process there should be no common support at all (just as there would not be for an RDD). A DiD analysis on the "matched" groups might provide a remedy (assuming the historical data exist to permit it), since it acknowledges the non-equivalence of the two groups.
9.63 As with any statistical study, the evaluator should beware of embarking on "fishing expeditions or data mining", especially when many variants of a model are being fitted. If different variants give different conclusions it is vital to be clear about how the assumptions differ and the robustness or otherwise of the model to changing them. A useful technique is to hold back a portion of the data during an initial phase of analysis and then check that these data give consistent results.
"Constrained designs"
9.64 Much of this chapter has been concerned with the design and analysis of studies when the policy has been designed so as to provide a comparison group. However, an analyst may be asked to evaluate a policy that is not amenable to these approaches, for example, if on practical grounds none of the desired policy allocation methods was possible, or if data are not available or of insufficient quality, or the policy has already been implemented and the opportunity to put a research design in place was missed.
Natural experiments and instrumental variables
9.65 A solution may present itself if it is possible to carry out any of the approaches in this chapter in retrospect. The influence of random shocks or administrative anomalies on policy allocation can sometimes create a so-called "natural experiment", in which comparisons with a naturally occurring comparison group can be made even though none was present by design. Essentially the same theory and analysis considerations then carry through. A more general case is where a so-called instrumental variable can be identified - an external factor which influences the likelihood of being exposed to a policy, and which does not in itself affect outcomes. This can be a very useful way of overcoming selection bias. It is often difficult, however, to find a suitable instrument, and very rare to identify one in advance, so it is not common to use this as part of a planned evaluation strategy. More information on this approach is given in the supplementary guidance.
"Before and after" studies
9.66 Sometimes the level of evidence available falls far short of what would generally be regarded as a true impact evaluation. A common example is the single group pre-and post-test design, or simply "before and after" design, in which an outcome is measured before and after intervention takes place but there is no comparison group. This only really has any credibility when the system being studied is so simple that the policy is the only thing that could reasonably be expected to influence the result. Unfortunately, real social systems are seldom that simple. Unless there is a strong justification for ruling out influences other than the policy (not simply a lack of obvious alternative explanations), this design should not be reported as an impact evaluation. The supplementary guidance provides detail on the large number of threats to validity with this design.
Use of process evaluation information
9.67 This chapter has already highlighted the benefits of combined evaluations where process studies, which study the implementation and delivery of a policy or intervention often using qualitative methods, (Chapter 8), are integrated with impact evaluation. This is particularly important when quantitative measures of impact are weak, or not available at all. If as above there is no comparison group, or worse still not even an outcome measure is available, then the researcher may be able to draw upon the findings of a process study, action research or case studies. By their nature these types of study do not allow a quantitative measurement of impact, but they may be able to capture a direction of change. Front line staff directly involved in the delivery of the intervention will have a good feel for whether or not it is effective, and why. Care must be taken, however, that the evidence captured reflects the achievement of the wider aims of the policy, and is able to look beyond the immediately perceived impact by the interviewees.
Reporting of evaluation results
9.68 Whichever approach is used, the evaluation report should be worded to give an accurate and objective reflection of the strength of the evidence. If there remain significant doubts as to the strength of the counterfactual estimate (or if it could not be estimated at all) then the evaluator should avoid using the term "impact" or any other wording that would imply attribution of the outcome to the policy. Only if the evidence points decidedly towards a causal effect of the policy should it be reported in these terms. As usual, any appropriate caveats with regard to the assumptions made and the strength of the available evidence should appear alongside the conclusion.
The guidance in this section of the Magenta Book has been revised since the previous edition to clarify that weak designs, where there is no compelling reason to ascribe the outcome to the policy or to eliminate other potential causes, should in general not be reported as impact evaluations. |
9.69 As an example of appropriate reporting, the results of a successful (fictitious) impact evaluation might be stated as follows.
9.70 "The results of the ABC pilot imply that the proportion of pupils achieving five grades A-C at GCSE was increased by 0.7 per cent as a result of the ABC programme. This is after taking into account known differences between participating and non-participating schools, though there remains a possibility that some other differences between the schools could have contributed."
9.71 If a true impact evaluation was not possible, the evaluator should avoid wording like the following:
9.72 "In the year following the nationwide rollout of the XYZ policy, the proportion of pupils achieving five grades A-C at GCSE rose by 1.2 per cent. It is not possible to say for sure whether this was the result of the policy, but the results are encouraging."
9.73 This is bad reporting. There is too much risk of the first sentence being taken out of context. Despite the "caveat", the report seems to want to imply that the XYZ policy caused the improvement. To a casual reader the strength of evidence might seem to be similar for both the ABC pilot and the XYZ policy, when in reality the former is pretty robust but the latter is paper-thin. It is true enough that an increase in attainment is better than a decrease, but in order to regard it as "encouraging" (from the point of view of the XYZ policy) we would require a much wider appreciation in the context of other drivers of change and previous trends.
9.74 If the previous example could be backed up by some qualitative evidence, a more appropriate form of words might be:
9.75 "Although in the year following the nationwide rollout of XYZ policy the proportion of pupils achieving five grades A-C at GCSE rose by 1.2 per cent, this welcome rise was not necessarily caused by the policy. For such a claim to be made with confidence would require an appropriate evaluation that controls for other factors. However, interviews with teachers suggested that the policy had filled a genuine gap for struggling pupils who in previous years might have fallen through the net. It is therefore reasonable to suppose that it has contributed to the 1.2 per cent increase in proportion of grades A-C in the year since it was introduced."
9.76 To conclude, this chapter has described how the evaluator can go beyond merely stating what happened, and report something much more relevant to the policy maker: namely, whether the policy caused it to happen. The rationale for doing the extra work required is that it answers the impact evaluation question, whereas descriptive statistics alone do not. The two types of evidence - descriptions of the situation on the one hand, and impact evaluations on the other - say very different things and need to be reported in correspondingly different ways. The one must not be misrepresented as the other.
___________________________________________________________________
22 The "common support" consists of those members of the treatment and comparison groups who can be matched to each other. It is discussed in more detail in the supplementary guidance.