Future decisions and roll-out; scaling-up

10.26 Evaluations are often undertaken of pilot programmes, 7 this section focuses on the decision whether or not to move from a pilot to a fully implemented national policy or programme.

10.27 For an evaluation to have maximum impact on this decision, it is important to be certain that the results are internally valid and are an accurate reflection of the experience of those who have been affected by the pilot. Furthermore, deciding whether to move to full implementation also requires external validity, or certainty that the pilot findings can be extrapolated to estimate what would happen in a full implementation. This has a number of considerations, often referred to as "threats" to external validity which are summarised below, with examples in Box 10.D.

10.28 Reasons why results may not be generalisable, or threaten external validity, include:

• that pilot data are not representative of the wider population;

the state of the economy at the time of the evaluation;

what other policies and programmes were operating at the same time and in the same areas as a pilot;

spillover effects - where for example a policy implemented in one area has effects in neighbouring areas (which may be positive or negative);

substitution and displacement effects - where there may be positive impacts on those directly affected by a policy or programme, but negative effects on others;

general equilibrium effects - the overall impact on outcomes taking into account any indirect or secondary effects;

scalability - whether sufficient resources exist to implement a policy more widely. This is wider than just finances, for example a health intervention may require input from doctors who may be in short supply; and

what are known as Hawthorne effects - where an initial pilot is successful but largely as a result of increased oversight.

10.29 To an extent it is possible to mitigate these risks by careful planning of the evaluation.

Box 10.D: Examples of threats to external validity

One potential threat is that those affected by a pilot are not representative of the wider population. For example, if a policy is only piloted in parts of London, it would be unwise to assume that the observed effects would be the same in other parts of the country. A well-designed pilot study would address this by including a variety of different types of area. Even so, it is unlikely to be an exact representation of the whole population. Where it is possible to quantify how the pilot areas differ from the country as a whole, it may be possible to correct for this bias. This can be particularly valuable if the choice of pilot areas (or participants) is constrained, for example, if there is a greater than average representation of urban areas in the pilot.

As an example, suppose that there are 100 areas in the country, of which 20 are urban and 80 rural. A pilot programme is run in four urban and four rural areas. Weighting the results of urban areas by 0.2 and those for rural areas by 0.8 will ensure that the overall results are, at least in this respect, balanced. This can readily be extended to two or three factors. In reality, there are likely to be more factors than this, and achieving an exact balance will not always be possible. In such cases, it may be possible to estimate overall effects in a regression framework.

A more difficult case to deal with is where the pilot areas (or people, or units) are self-selecting, for example, if local authorities were asked to volunteer to participate. In such cases, the generalisability of the pilot findings to areas that are compelled to participate in a later implementation stage cannot be assumed. This is because the characteristics and contexts of the local authorities that volunteered may have contributed to them volunteering in the first place and to the impacts observed, these factors may be different in the authorities taking part in the later implementation and may affect the impacts.

10.30 It is important to recognise that a policy evaluation that shows a positive impact and good value for money does not mean that it was an appropriate policy, similar or better gains may have been realised by alternative policies that have not been evaluated. Decision making is also a balancing of risks. Proceeding with a policy for which the evidence is weak risks wasting the resources necessary to implement it. But not proceeding in such a case risks forgoing genuine gains which would have been made if in fact the programme were effective. In each case, the strength of the evidence on impact needs to be considered alongside the potential gains from an effective programme, the potential losses from an ineffective one, and the desirability or otherwise of any unintended consequences.




________________________________________________________________________
7 In this context a "pilot" refers to a programme or policy introduced on a limited basis - for example limited in time or geographical scope with the express purpose of producing evaluation evidence to inform a decision on whether or not to proceed to full implementation. For a good discussion of