Designing policies for effective evaluation

9.12 This subsection begins by introducing the theory behind research designs. A key part of successful impact evaluation is ensuring that a group of individuals or areas unaffected by the policy - the untreated - can serve as a comparison group. Such a group can be constructed in numerous ways, and several examples will be considered; these examples could form a basis for discussions between policy makers and analysts at the policy design stage. A separate subsection, below, develops some of the concepts further as they apply to the analysis of the data obtained.

9.13 It is worth noting that the methods of allocating policies described in this sub-section all rely on there being something tangible to allocate. That is, the policy needs to consist of specified interventions such that it is possible to say distinctly that some individuals or areas were exposed to them, and others not (and further, that there is no impact on those who were not exposed). The methods of this chapter are not well suited to evaluating higher level-strategies, which set out aims and principles for action, unless those strategies can be unpacked into their constituent activities. The first task for the evaluator when faced with that kind of evaluation problem is to ascertain how the strategy is to be implemented: what will interventions look like on the ground, and who will receive them.

Randomness

9.14 Randomness3 plays a central role in establishing the counterfactual to a policy. Randomness in the way policies are administered can balance out unobserved (sometimes, unobservable) differences in characteristics between the treated and untreated groups. The groups are then said to be equivalent - they differ on average only in their exposure or not to the policy. Comparisons between equivalent groups are said to have strong internal validity 4: the evaluator can (under particular circumstances) infer that any significant differences between the two groups were caused by the policy, because on average the two groups are similar in all other respects.

9.15 The difficulty with evaluating actual policies is that they tend to target the most problematic or deserving individuals, institutions, locations and so on. That is, policies tend to be non-random intentionally. So even when one group is exposed to the policy and another is not, the two groups will typically be non-equivalent. Drug treatment policies, for instance, target individuals with drug misuse problems, who are likely to be different from other people in quite particular ways (for example they are more likely to be younger, male, unemployed and with an offending history than people who are not drug misusers). Allocation of the policy or intervention is then said to be endogenous to the outcome which is being targeted, because the characteristics which make an individual (or area or business) more likely to receive the intervention are also likely to affect impact of the intervention on their outcomes. Estimates of the policy effect which do not take this into account will suffer from selection bias, and simple comparisons between the treated and untreated groups are not then valid.

Research Designs5

The purpose of research designs is to manipulate the implementation of the policy, or to exploit features which it already possesses, in such a way that a counterfactual can be estimated. Manipulating the policy is preferable because randomness can be introduced, or non randomness addressed, by design. Otherwise, a successful evaluation might need to rely on the required characteristics appearing by accident, and this is by no means guaranteed to be the case. So how should a good comparison group be obtained in practice? There are two approaches which will be considered in turn:

Experiments, or Randomised Controlled Trials (RCTs). The defining feature of this approach is that the assignment of eligible individuals (or areas) to treatment is explicitly randomised, as it were by the flip of a coin.

Quasi-experimental designs (QEDs). These designs do not use explicit randomisation, but address potential non-equivalence of the treated and untreated groups in other ways.

Randomised Controlled Trials (RCTs)

9.16 An RCT is usually regarded as the strongest possible means of evaluating a policy, because of its ability to balance out the differences between the groups. As was pointed out above, policy allocation by its very nature is not usually random, so opportunities to use it in practice are limited. If the policy is by intention "experimental", however, then randomised allocation might be more readily acceptable. In these instances the policy will usually begin with a pilot in a restricted number of areas only.

9.17 Randomisation can face some practical hurdles in a social research context mainly rooted in the difficulty in maintaining complete control over the allocation process, and the near impossibility of "blinding"6 for the sorts of interventions being tested in public policy. It may get excluded because of (sometimes unfounded) concerns over ethical issues7, or because an "experimental" design is at odds with a desire to focus the efforts of the policy in a targeted way. Both these arguments presuppose that the intervention is effective in the first place, which it is the purpose of the evaluation to ascertain (unless strong existing evidence already supports it - in which case the value of a pilot, randomised or otherwise, might be arguable anyway). In the latter case it may still be possible to incorporate randomisation for a limited subgroup of eligible participants. Boxes 9.C and 9.D provide two examples of randomised control trials.

Box 9.C: An example of a randomised control trial

Evaluation of HM Prison Service Enhanced Thinking Skills programme (Ministry of Justice)

There is considerable international evidence, from various systematic reviews and meta-analyses analysing a large number of offending behaviour/cognitive behavioural programmes, to support the effectiveness of these programmes in reducing re-offending. However, the evidence from research in England and Wales on the effectiveness of these programmes is mixed. This project looked at a shorter-term impact than reconviction to assess the efficacy of the Enhanced Thinking Skills (ETS) programme in the UK.

The main aim of the project was to examine the impact of ETS courses on 'impulsivity' in adult male offenders over the age of 18, and to investigate whether changes in levels of impulsivity were reflected in changes in prison behaviour. Impulsivity, a behaviour targeted for change by ETS courses, was chosen as the main outcome measure as there is research evidence of links between impulsivity and offending (e.g. Mak, 1991, Eysenck and McGurk, 1980).

Further analysis of individual cases was undertaken to investigate evidence of reliable clinical change. A secondary aim was to explore a range of other psychometric measures in the ETS test battery to evaluate the wider effectiveness of ETS courses, and to examine background factors of offenders, and institutional factors, in order to determine which offenders benefit most from ETS programmes, under which conditions. This was to see whether there were improvements to be made in course content, targeting of offenders, and selection of the most appropriate assessment methods.

A Randomised Controlled Trial (RCT) was selected in order to minimise bias in allocation of participants to groups. However, RCTs have rarely been conducted in UK prisons, largely due to ethical concerns about withholding treatment from a control group. These concerns were avoided by adopting a waiting list control design in which all eligible offenders ultimately received treatment. Offenders with a priority need to attend a course were assigned to a parallel cohort group prior to the random allocation, and their data were analysed separately.

However, it is not possible to assess the impact of the ETS course on reoffending through this study as all participants eventually received the intervention (hence there was no control group for reoffending analysis).

The study demonstrated positive results with regard to the (short-term) effectiveness of the ETS programme. More specifically, the study revealed that ETS programmes are effective in reducing both self-reported impulsivity and the incidence of prison security reports in adult male offenders.

Additionally, the analysis of background factors raised a number of issues relating to which offenders benefit from ETS programmes and how others may be assisted to benefit more. This could lead to better targeting of offenders for ETS courses, and adaptation or development of programmes specifically designed to meet different needs. The evaluation also raised questions about the relationship between offence type, impulsivity and effectiveness of ETS courses with different offence groups, which may lead to a greater understanding of particular types of offending and ways to reduce offending.

For more information read the evaluation reports online.8

Box 9.D: An example of a randomised controlled trial

Primary School Free Breakfast Initiative (Welsh Assembly Government)

The Welsh Assembly Government made a commitment to introduce free healthy breakfasts in primary schools in Wales from September 2004. By January 2007 all primary schools had been offered the opportunity to participate with more than 1000 schools involved. The coalition Governments 'One Wales' commitment of 2007 was to maintain the programme.

A cluster randomised controlled trial, with an embedded process evaluation, was commissioned in May 2004 to assess the impact of providing free breakfasts in schools on children's eating habits, concentration and behaviour. The cluster randomised design was chosen because randomisation at the individual level was not possible as the programme was implemented at the whole school, rather than individual pupil, level. The cluster randomised approach is often chosen for settings based interventions, such as schools or workplaces.

The study recruited 111 primary schools, of which 56 were randomly assigned to the control condition and 55 to the intervention. Data were collected at each for three time points: baseline, four month and twelve month follow-up. In each school, one Year 5 (age nine to ten years) and one Year 6 (age ten to eleven years) class were randomly selected, resulting in a repeated cross-sectional survey of approximately 4350 students at each data point.

The evaluation team concluded that the results provided partial support for the scheme as a dietary intervention. The 12 month follow-up found that:

41 per cent of pupils in intervention schools that had started a scheme attended at least once a week, with 30 per cent of these attending each school day;

the quality of breakfasts eaten improved among pupils in intervention schools, with consumption of items such as fruit, vegetable and wholemeal bread increasing;

more positive attitudes towards breakfast were found in intervention schools;

there was no significant effect on breakfast skipping, episodic memory or inattention; and

the absence of a decrease in breakfast skipping was suggested to be unsurprising, given the relatively small number of breakfast skippers at baseline. The evaluation team recommended that further work be undertaken in promoting pupil uptake and reach to address the breakfast skipping issue.

There is an existing evidence base suggesting that breakfast consumption influences cognitive functioning and classroom behaviour. The lack of impact on cognitive functioning in this study is likely to reflect the fact that this was analysed at school level, influenced by uptake, rather than tracking change at the individual level.

For more information read the evaluation reports online9

Quasi Experimental Designs (QED)

9.18 Suppose, however, that randomisation has for whatever reason been rejected. A QED should then be considered. Fundamentally, these designs use one of two approaches (or sometimes, a combination of both):

exploiting natural randomness in the system to obtain a comparison group that is "as good as random", insofar as group membership does not depend on any factors likely to affect the outcome; or

acknowledging that the comparison group is non-equivalent, but obtaining it in a way that allows selection bias to be modelled (typically in some form of regression model).

9.19 Some of the options for obtaining a comparison group are shown in Table 9.A. It is worth mentioning that phased introduction is arguably the most robust approach of those listed, and if full randomisation is deemed unsuitable then this approach should always be given serious consideration at the policy design stage.

Pilots

9.20 Designing evaluation for a pilot involves essentially the same considerations as for a larger scale policy, but there are some additional caveats:

If the pilot is on a very small scale, its effects may not scale-up as expected. There could be greater enthusiasm among those involved with the initial piloting than would be encountered more widely. The dynamics of administering the intervention could be rather different among a small group than would be the case with more widespread implementation. Therefore, unless the pilot is simply a proof-of-concept it should try to operate through the same administrative structures as will be used in an eventual wider policy.

Piloting can provide the evaluator with a ready-made comparison group in the form of areas similar to those where the pilot took place, but not operating it. However, unless the evaluation uses only administrative data, it will be necessary to carry out data collection in the comparison areas as well. That could be more problematic as staff working in those areas will face an additional burden from taking part in the evaluation, without gaining the potential benefits of early assignment to the new policy. An alternative is to allocate treatment and control groups within a pilot area.

Addressing non-randomness

9.21 Whether the comparison groups in Table 9.A are "as good as random" depends on the details of how they arise, or are constructed, for any particular policy. For example, if a phased introduction is used and the assignment of areas to waves is essentially arbitrary (or indeed, has actually been randomised) then it is reasonable to compare areas that are in the first wave with those that are not. On the other hand, if the highest priority areas are placed in the first wave, then the comparison group must be regarded as non-equivalent, and selection bias is a real possibility. Another issue is that consistency of delivery may change over time, especially if the first wave embraces the new policy more enthusiastically than the later waves.

Table 9.A: Example sources of a comparison group

Phased introduction

The policy is phased-in in "waves" rather than introduced simultaneously in all geographical areas. During the period when not all areas are implementing the policy, the areas assigned to the later waves can form a comparison group for the earlier ones. This is similar to piloting but can be more rapid, as there is no presumption of an evaluation being completed on the first wave before the second is launched. It does however require that the impact occurs on a short timescale, relative to the interval between waves, and that the details of the policy do not change between waves. It also assumes that behavioural effects and impacts are not triggered with the policy announcement.

Intermittent application

If the policy involves interventions that are very short term in nature (such as media campaigns, for example) then applying these in intermittent bursts, where different areas receive them at different times, can be used to compare active areas to quiet areas. Once again, the impact needs to occur on a short timescale if this approach is to be used.

Accidental delays

Policies that begin simultaneously nationwide are problematic with regard to area-based studies. But it is worth investigating whether for practical reasons some areas went ahead more rapidly than others. If a frank account of the degree of implementation can be obtained from each area, a comparison group of "slow starters" might emerge. If there is a "postcode lottery", the evaluation can make use of it.

Intensity levels

If simultaneous introduction of the policy is unavoidable, another strategy is to evaluate based on differing modalities or intensities in different areas. Where there is local discretion on how the policy is implemented, it may be possible to classify different areas according to the decisions they made; where some areas receive enhanced funding or run additional interventions, these areas may be compared with those operating only the basic policy. In both cases, however, the impact estimated is for the difference between variants of the policy rather than for the policy as a whole.

Administrative rules

A comparison group may arise as a result of having to "draw a line" to decide who receives an intervention. For example, an offender aged 17 years 11 months may be very similar to one aged 18, but treated completely differently by the criminal justice system.

Targeting

Whenever a policy is intended only for a certain subpopulation (of individuals or areas), those unaffected by it form a potential comparison group. Almost always in this scenario, the comparison group will be non-equivalent.

Non-volunteers

Where participation in a programme is voluntary, those who do not participate can be a source of a potential comparison group. Such a comparison group will always be non-equivalent and controlling for the differences will be challenging.

9.22 So, if the comparison group is not "as good as random", what can be done about it? At the policy design stage, the points to consider are:

how allocation to treatment will occur (whether intentionally or accidentally) and how this might lead to non-equivalence;

what data can be captured on the known characteristics of individual subjects, for use in subsequent analysis; and

whether it is possible to design the policy so that allocation uses an objective rule, based on these known characteristics of those who might be targeted. If it can, then evaluation will be stronger, because the sources of selection bias are all known about.

9.23 The topic of modelling selection bias is developed further in the sub-section on data analysis below.

9.24 In relation to the third bullet above, a special case of an objective allocation rule is to form an "assignment score" based on the level of need of each individual. Those above a certain score receive the intervention. An elegant method of analysis is then offered by the regression discontinuity design (RDD; supplementary guidance will provide more detail on RDD). This design is based on examining the boundary between the "only just eligible" and the "not quite eligible". The scores (both of participants and non-participants) need to be captured for future analysis. The main drawback of the RDD is that the results only apply directly to those at the boundary, and may not be an accurate indicator of the effects on individuals with characteristics away from the threshold.

9.25 Voluntary participation in an intervention is an example of non-randomness that is a particular problem for the evaluator. It is tempting to use individuals who opted not to participate in some scheme (or chose not to complete the course) as a comparison group for those who did, but the fundamental flaw with this approach is that opters-in are very likely to be different from opters-out, and in particular are likely to be better motivated. Motivation might be important if it is a significant determinant of the effectiveness of the intervention (for example educational courses being more effective with motivated students). This "self-selection" is another example of a non-equivalent comparison group, and can be one of the hardest to address. Some possible solutions are:

attempt to control for motivation directly. However, motivation is difficult to observe by nature and standard administrative data such as demographics about the prospective participants are unlikely to capture it. Therefore, specialised surveys may be required in an attempt to elicit participants' reasons for the decision, and this can be a costly exercise. Alternatively it may be possible to find proxies for motivation. For example, studies on schemes to help non-employed people into work10 have found that previous labour market history gives a good indication of motivation, if recorded in sufficient detail;

carry out the analysis on the basis of intention to treat (ITT). The policy group consists of all those offered the intervention, even those who decline, and a comparison group is drawn from individuals who would have been eligible but were not offered (perhaps because they were associated with an institution that did not operate the scheme at the time )11. Impacts estimated on an ITT basis tend to be smaller than those based on an actual treatment group, since the ITT group is diluted by non-participants, and it may not be possible to distinguish the impacts from the "noise" (see below). However this approach can have stronger internal validity and is arguably more policy relevant, since it measures the effect per person of making the policy available, which can actually be controlled; and

examine what happens downstream of the decision to participate. If some individuals who consented were later unable to participate due to unavailability of resource or other administrative reasons (but not due to reneging, which would reintroduce selection bias) then these individuals can provide a comparison group.




________________________________________________________________
___________________________
3 Randomness" is used here in its widest sense, of events occurring by chance. "Randomisation", where a chance mechanism is introduced into policy allocation quite deliberately, is an important special case, but is not the only way in which randomness can occur

4 Internal validity and external validity are two terms often used to describe the strength or otherwise of an evaluation design. They can be explained by reference to the evaluation of a programme piloted in a small number of areas. Internal validity is where we can estimate the impact on the people who took part in those areas; external validity is where you would get the same impact in other areas, or at another time

5 This chapter of the Magenta Book uses the term "research designs" to include both experimental and quasi-experimental designs.

6 Blinding" refers to feature of experiments in which neither participants, nor those interacting with them, are aware who is in the treatment group and who in the control group. This is most easily understood in the context of drug trials, where it is necessary to guard against the well-know placebo effect, whereby somebody who believes they are getting an improved treatment can respond positively regardless of whether there is any direct effect. To overcome this, treatment and control group members receive apparently identical treatments, and have no way of knowing which they are receiving. Further, because those monitoring their progress may - consciously or unconsciously - record results differently for those they know to be receiving the alternative treatment, they also need to be 'blind' to the allocation. In social policy experiments, this is extremely difficult to achieve. For example, if the 'treatment' was a course of training, it would be readily apparent to all who was receiving it and who was not.

7 Sometimes, perhaps because it is less common as a means of evaluating social policies, it is supposed that choosing who will benefit from a pilot intervention by random allocation is somehow unfair or unethical. Yet it is no more unfair than allocating treatment on the basis of where somebody lives, which is a much more familiar process.

8 Evaluation of HM Prison Service Enhanced Thinking Skills Programme, McDougall, Perry, Clarbour, Bowles and Worthy, 2009), Ministry of Justice Research Series 3/09 http://www.justice.gov.uk/publications/

9 An Evaluation of the Welsh Assembly Governments Primary School Free Breakfast Initiative, Murphy, Moore, Tapper, Lynch, Raisanen, Clark, Desousa, and Moore, November 2007, http://www.wales.gov.uk

10 The econometric evaluation of the New Deal for Lone Parents, Department for Work and Pensions Research Report No. 356. 2006. http://www.dwp.gov.uk/

11 The econometric evaluation of the New Deal for Lone Parents, Department for Work and Pensions Research Report No. 356, 2006. http://www.dwp.gov.uk/