Dec. 1, 2011, 4:24 p.m.
Impact evaluation is the buzzword on every development practitioner’s lips these days. The need to establish and quantify results from foreign aid spending is gaining greater urgency in the current budget climate, and donors have increasingly acknowledged that past efforts to show “what works and why” have fallen short. RCTs are often called the “gold standard” because of their ability to establish a direct causal link between a particular program and its impact on the intended beneficiaries.
Yet despite the promise RCTs hold for helping donors and implementers demonstrate results, they are sometimes an inappropriate choice of evaluation method. This is particularly true for many democracy and governance (D&G) programs. Key methodological issues, such as insufficient sample size and spillover effects, hamper efforts to attribute changes in outcomes to a particular program. Perhaps more significantly, the political nature of most D&G programs, as well as their focus on centralized institution building, affect the relevance of using RCTs. The challenge for those evaluating D&G programs lies in designing an evaluation framework that approximates the methodological rigor associated with RCTs while accounting for the idiosyncrasies of D&G interventions.
Why RCTs are Considered the Gold Standard
All evaluation methods aim to establish a direct relationship between the program, such as ethics trainings for judges, and the outcome of interest, such as a reduction in corruption. Impact evaluation methods go a step further by establishing a counterfactual, or what would have happened in the absence of the program, through the use of comparison groups. One method of establishing a comparison group is to survey individuals before and after the program to determine how attitudes, knowledge, or some other characteristic changed. A second technique is to compare a group of individuals that received the program (the “treatment group”) with a group that did not receive the program (the “control group”).
Using only one of these methods is not sufficient for claiming that the program caused the outcome, since other factors could have contributed to the changes observed. RCTs combine both the before/after and with/without technique. The rigor in this method – and the reason why RCTs are considered the gold standard – comes from choosing statistically identical comparison groups instead of relying on assumptions about the two groups being similar, which can lead to biased results.
Take an example of a program that introduces a case-management system in an attempt to reduce corruption in the judiciary. A typical evaluation might use the before/after approach and gather survey data on the number of bribes paid in a particular district court before and after the program is implemented. That the resulting change in bribery can be attributed to the case-management system is the assumption on which the evaluation would rest. Yet the change might have occurred without the case-management system, and the use of a with/without comparison will shed light on this. In finding a district court to use as a comparison, it is imperative that this control group be as similar as possible to the treatment group. If several judges attend an ethics training during the evaluation time period, that training could be an intervening factor that affects the number of bribes paid at the district court serving as the control group. Thus the true impact of the case-management system on corruption in the district court will not be observed.
Applying the RCT Method to D&G Programs
The use of the RCT framework resolves two main problems that plague most D&G evaluations, namely the levels-of-analysis problem and the issue of missing baseline data. The levels-of-analysis problem arises when evaluations link programs aimed at meso-level institutions, such as the judiciary, with changes in macro-level indicators of democracy, governance, and corruption. Linking the efforts of a meso-level program to a macro-level outcome rests on the assumption that other factors did not cause the outcome. An RCT design forces one to minimize such assumptions and isolate the effect of the program, versus the effect of other factors, on the outcome. By choosing a meso-level indicator, such as judicial corruption, to measure the outcome, the evaluator can limit the number of relevant intervening factors that might affect the outcome. In addition, because an RCT design compares both before/after in a treatment and control group, the collection of relevant baseline data, if it does not already exist, is a prerequisite for conducting the evaluation. Many D&G evaluations have relied on collecting only ex-post data, making a true before/after comparison impossible.
Yet it would be difficult to evaluate some “traditional” D&G programs through an RCT design. Consider an institution-building program aimed at reforming the Office of the Inspector General (the treatment group) in a country’s Ministry of Justice. If the purpose of the evaluation is to determine what effect the program had on reducing corruption in that office, there is no similar office (control group) from which to draw a comparison. The lack of a relevant control group and sufficient sample size is the main reason many evaluations cannot employ an RCT design.
Sometimes, treating one group of individuals affects the behavior of the control group, which statistically “contaminates” the control group and makes pure with/without groups impossible to establish. Consider a Parliamentary-strengthening program in a small country where members of Parliament constantly interact. Some members are designated as the treatment group and given training on how to better manage public funds. The members chosen for the control group are likely to hear about such training from their colleagues. Such “spillover” effects are difficult to account for in D&G evaluations, not only because of the difficulty in controlling the spread of information, but because donors rightly place a high value on sharing best practices. In terms of institution building, it is hard to imagine a scenario in which one could both randomize and control spillover.
The complex political nature of D&G programs also make an RCT design problematic. Political actors often oppose the aims and activities of D&G programs because they challenge the status quo. Donors and implementers need to flexible and willing to modify short-term aims and activities for the sake of achieving intermediate- and long-term objectives. This can make it difficult to capture all the relevant baseline data needed to conduct a similar before/after analysis of both the treatment and control groups, since new elements may be introduced midway through the program. In addition, D&G programs are often aimed at multiple levels, actors, and target populations. This means that linear lines of causation are much harder to draw, and intervening variables such as changing political situations make isolating the effect of the program difficult.
The Way Forward
The focus on using RCTs to evaluate aid’s impact is understandable, given the dearth of methodologically rigorous evaluations. In some cases, RCTs are an ideal method for evaluating D&G outcomes; Ben Olken’s study of the tools that effectively reduced corruption in World Bank infrastructure projects in Indonesia is one example. The risk, however, is that this will privilege the use of RCTs as the only valid type of evaluation design. Donors and practitioners can avoid this while generating better-quality D&G evaluations by adopting the following practices: