10.12 One of the most common quantitative synthesis tasks is to reconcile a number of different assessments of impact which may be based on different:
• data sources - for example survey and administrative data;
• groups of affected individuals - for example the first and final waves of recipients to receive an intervention, as in the evaluation of the impact of Pathways to Work; or
• statistical approaches and assumptions - Chapter 9 explained how the validity of the impact assessments depends on key assumptions.
10.13 It is highly unlikely that all the estimates will have equal validity meaning that a statistical combination of them to give an overall best estimate will not be possible. There are two types of validity to consider here: internal and external, as discussed in Box 10.A.
Internal validity (as discussed in Chapter 9 paragraph 9.14) refers to whether the results are a true reflection of the impact on the individuals being studied. In the case of a pilot study for example, are the estimates a true reflection of the impact on the individuals in the particular areas involved in the pilot during the lifetime of the evaluation? All statistical approaches to impact estimation depend on assumptions. Where different statistical approaches have been followed, it will almost always be because it was not possible to be certain in advance whether the necessary assumptions hold. Where possible, formal tests of the validity of the assumptions should be carried out (for example, testing the common trends or parallelism assumption in a difference-in-difference design. See Chapter 9 for a more detailed discussion). External validity refers to whether the impact estimated for those directly studied can be extrapolated / generalised to others. For example, as in the Pathways to Work example, the impact of a programme on the first group to go through it is likely to be a poor guide to its effectiveness, due to teething problems. A better guide is likely to be the impact on those who experience it after it has bedded in. More discussion of potential threats to external validity is given in paragraph 10.28. |
10.14 A different type of consideration might be which data source is closest to measuring the relevant outcomes. Administrative data would normally be more accurate than self-reported data where something very specific and objective is being measured. For example, it is well known that survey responses about which welfare benefits claimants receive are not fully reliable. Administrative data sources, in many cases, will have extremely low sampling error, giving far greater precision than is possible with surveys. But very often administrative data and surveys are measuring different things, or there are known limitations about one of the sources. For example, administrative data can provide information about numbers of recorded crimes, but only surveys can provide data on the fear of crime. Chapter 7 discusses surveys and administrative data in more detail.
10.15 On a related point, there is also the question of which results answer most closely the question at hand, which in turn depends on the decision being made. As explained in Table 10.A, impacts can be either average or marginal. Where the decision being made is whether or not to continue with a policy, or to implement a pilot, it is appropriate to use average treatment effects. But where the question is whether to expand or contract a programme, marginal effects are more important. As previously noted, which of these is available is likely to be dictated largely by circumstances rather than by choice. Where it is necessary to make decisions based on average effects when marginal effects would be more appropriate, or vice versa, it may be possible to explore the heterogeneity of treatment effects, either quantitatively (for example, looking at impacts for sub-groups) or qualitatively. The need for this should be considered at the planning stage.
10.16 In some cases, it may be clear that one set of estimates is more likely to be valid than others, and is therefore the appropriate one to use. In other cases, sampling error may explain the differences allowing the findings to be combined arithmetically. There may be occasions however where, despite best efforts, it may not be possible to fully reconcile the different studies, in which case it may be appropriate to report the impact as a range rather than as an exact figure.
Types of impact estimates |
Intention to treat (ITT) | The impact of the policy on the target group. For example, for a training programme for jobseekers, the net impact on all those eligible, whether they participated or not. | Where participation is voluntary, estimating the impact on the Intention To Treat group avoids most of the problems of selection bias. But where the proportion participating is small, the impact is small and can be very hard to detect. |
Treatment on the Treated | The net impact on those who were actually affected by the intervention - for example, those who took part in a training programme. | It will be much easier to detect with small participation rates, but depending on how participants are selected it may be difficult to account for bias. |
Which of these is estimated is more likely to depend on which impact evaluation methods are feasible than on which is more desirable. Note that as long as it is known who is treated and who is not, and that it is reasonable to assume that there is no impact on the non-treated, it is straightforward to calculate one from the other For either of these, there are two types of estimate:
|
Average Treatment Effect | The average net impact across all those treated, or who were intended to be treated. | This is the most common, and is the preferred estimator for cost-benefit analysis in particular, and for overall decisions about whether to implement a policy. It is less suitable where the decision is about the expansion or contraction of a policy. |
Marginal Treatment Effect, or Local Average Treatment Effect | The impact on those who in some sense are on the margins of participation. | An example of this is in a general Regression Discontinuity Design where the impact estimated is for those whose scores are on the borderline of eligibility. This is the estimator needed to inform decisions about expansion/contraction (in this example, changing the threshold score) but further assumptions are needed to produce an overall cost-benefit analysis. |
In most cases, whether the impact estimate is marginal or overall average will depend on the available evaluation methods rather than on what is desired. | ||