McDermott 2002 “Experimental Methods in Political Science”

Experimental Design

Standardization is crucial in experimentation because it ensures that the same stimuli, procedures, responses, and variables are coded and analyzed. Reduces the likelihood that extraneous factors could influence results in decisive ways. Standardization requires the same set of experimental procedures or protocol to be administered to the subjects across conditions, with only the independent variable(s) or interest manipulated. Perfect standardization would ensure that differences in outcomes between groups are due to the treatment and not the result of extraneous environmental factors.
Randomization refers to the assignment of subjects to experimental conditions. Subjects are assigned randomly to ensure that unrelated/spurious factors do not bias the results. Random assignment should result in these unrelated differences between groups to cancel out in expectation, leaving no systematic differences between subjects to bias the study. This is most effectively achieved using a random number generator and randomization can be checked using regressions on demographic information collected in the study.
Between-subjects versus within-subject: Typically one experimental condition is compared to another experimental condition and/or to a control condition. The control creates a baseline and treatments are compared against it and each other. In between-subjects designs, different groups of subjects are assigned to experimental or control conditions and then compared. In within-subject designs (also known as A-B-A), an individual serves as her own control and treated groups: a subject begins with a baseline measure (A), is later administered a treatment (B), and the baseline measure is taken again post-treatment (A) to determine the effect of the treatment on the subject.
Placebo Effects: Fake treatments can cause powerful changes in outcome based on an individual’s belief that the treatment will work. Control conditions are essential to determine the extent of placebo effects.
Experimental Bias: The experimental process can introduce biases through expectancy effects, experimenter bias, and demand characteristics, discussed in greater detail below.
1. Expectancy Effects: Expectancy effects occur when an experimenter communicates (usually inadvertently) how he or she wants the subject to behave or respond. Results are then self-fulfilling prophesy as experimenters create the results they desire through signals rather than through the theorized experimental manipulations. Ways to avoid this: various experimenters run subjects under all conditions, using double-blind strategy where the experimenter is not aware of the subjects’ conditions, designing the experiment to avoid experimenter involvement (i.e. computer-based), or treating the experimenter as a factor/variable in the statistical analysis (i.e. controlling for enumerator or testing for enumerator effects).
2. Experimenter Bias: Experimental choices originate from the experimenter’s beliefs and attitudes, and these choices can influence the design of an experiment in a nonrandom way. This may be acceptable if it is carefully considered, but can be problematic if the investigator does not adequately consider his beliefs.
3. Demand Characteristics: Similar to expectancy effects, except for the origin of the cues. The cues emerge from the subject’s interpretation of the experiment rather than from anything the experimenter does directly. Systematic bias can be introduced when the purpose of the experiment is too obvious. This can be made worse when subjects attempt to behave in a way that would make the experimenter like them more and thus try to do what they think that the experimenter “wants”. Ways to limit the impact of demand effects: use deception to ensure that subjects cannot determine the relevant hypotheses, evaluating the demand characteristics in the analysis at the end of the study, and using computer technology to complicate or depersonalize the experiment to reduce the likelihood that subjects can discern its true purpose.

Experimental Measures

Experimental measures aim for reliability and validity in measurement. Reliability refers to the extent to which an experiment tests the same thing over and over again. A reliable result is one that is easily replicable. Reliability improves (1) when measures are standardized, (2) when a larger number of measures have been taken, and (3) when factors that might bias the data are controlled in advance.

Self-Reports, Behavioral Measures, and Psychological Measures
- Self-reports: verbal or written reports of a subject’s responses to stimuli. Can be questionnaires, surveys, or interviews. Qualitative responses can be coded into quantitative categories for analysis.
- Behavioral measures require experimenters to observe the behavior of subjects and to record the subjects’ responses to stimuli.
- Physiological measures include collecting data on changes in heart rate, skin response, blood pressure, MRI, PET, or hormone levels in response to stimuli.
Incentives
- Most psychological experiments offer little more than course credit as an inducement.
- However, economists typically offer material incentives (money or lottery); these can be offered for showing up or as part of the experiment (i.e. to incentivize participants to play to win against an opponent in a game theoretic simulation).

Threats to Internal and External Validity

Internal validity asks “Did in fact the experimental treatments make a difference in this specific experimental instance?”
External validity asks “To what populations, settings, treatment variables, and measurement variables can this effect be generalized?”
Typically, psychologists have been more concerned with internal validity and political scientists more concerned with external validity.

Threats to Internal Validity

History: any event that occurs outside the experimenter’s control in the time between the measures on the dependent variable. More of a concern when there’s a lot of time between the measurements on the DV.
Intersession history: events inside the study itself that are beyond the control of the investigator and may affect the outcome of the study that introduce confounds.
Maturation: Natural needs, growth, and development of individuals over time.
Performance Effects: Performance can change as a result of experience. Test performance can be affected simply by having taken the test before, pre-tests and post-tests cannot be assumed to constitute identical assessments. Independent of manipulation, taking the first test can affect the performance on the second test through the process of learning.
Regression toward the mean: All scores are a real score plus a random error. Subjects with extreme scores are likely to move closer to the mean in subsequent tests, so if subjects are chosen based on extreme values, then experimenters are likely to bias their own results by not considering regression dynamics in their selection process.
Subject self-selection: Subjects who select into a condition may differ systematically from those who are randomly assigned to a condition.
Mortality: Occurs when subjects are lost to follow-up by the investigator. Subjects that fail to show up for subsequent rounds of an experiment may differ systematically from those who continue to be measured, introducing selection bias. Especially problematic in longitudinal field experiments as folks move and their situations change. The bias is most problematic when some aspect of the experimental treatment has systematically made one group of subjects uncomfortable enough to drop out.
Selection-maturation interaction: Occurs when subjects are placed into an experimental condition in a nonrandom manner and some aspect of the group differs in maturation from the others in a systematic way.
Unreliable measures: Non-random measurement error, shifts in the subject population, etc. cause bias.

Threats to External Validity

Testing interaction effects: The act of testing can increase subjects’ sensitivity to the variables, making it difficult to generalize results.
Unrepresentative subject population: Sears (1986) argues sophomore university students typically differ in important ways from the population of interest, although Roth (1988) found that findings from sophomore students are remarkably robust. Still a lot of concern about subject pools. Best practice is to sample directly from the population of interest, though this can be hard with elites since they are typically either too busy or not interested in participating in experiments. (Typical experimental subjects often lack the experience needed to act “as if” they were professional legislators; yet, legislators themselves are often reluctant to participate in experiments as subjects.)
Hawthorne Effect: People change their behavior merely because they are aware that they are being observed.
Professional subjects: Overly experienced or jaded subjects may be more likely to guess the underlying hypotheses or manipulation in an experiment if they have participated in similar experiments in the past.
Spurious measures: An unexpected aspect of the experiment may induce subjects to give systematically irrelevant responses to particular measures, which are then understood to be experimental effects.
Irrelevant measures: Irrelevant aspects of the experimental condition can produce results that appear to be experimental effects.

Advantages and Disadvantages of Experiments

Advantages

Comparative advantage of experiments is in their high degree of internal validity (when well-designed and executed). Experiments can provide strong support for causal inferences because investigators can control the environment, isolating differences in independent variables of interest so that differences in the dependent variable can be attributed to the experimental manipulations.

Ability to derive causal inferences: Randomization of subjects and control of the environment allows for greater confidence regarding causal inferences about the relationships between the variables of interest.
Experimental control: The experimenter has control over recruitment, treatment, and measurement of the subjects and variables.
Precise measurement: A well-designed and implemented experiment allows the researcher to improve the quality of measurement.
Ability to explore the details of process: Experiments can break down complex relationships to investigate constituent parts in isolation or in greater detail in order to understand which details of a process result in differences under investigation. Ability to interact variables allows the researcher to determine under what conditions the relationships hold.
Relative Economy: Small surveys, convenience samples, and the like offer economical alternatives to large-scale surveys or field experiments. In particular, student samples are cheap, plentiful, and pretty reliable. More representative samples can get expensive, but mTurk is pretty cheap, too.

Disadvantages

Experiments may not be ideal. Most of the disadvantages are around external validity, ethics, or feasibility problems.

Artificial Environment: Many experimental settings are artificial and unrepresentative of the environments in which subjects would normally perform the behavior studied. This is because it can be impossible or unethical to more realistically simulate many environments.
Unrepresentative Subject Pools: Subject pools may be unrepresentative of populations of interest.
External Validity: It is difficult in the laboratory to simulate key real-world conditions that operate on political actors.
- Engagement can be low due to short-term and weak incentives: in the real world, actors have histories and shadows of the future with each other, they interact around many complex issues over long periods, and they have genuine strategic and material interests, goals, and incentives at stake.
- Cultural norms, relationships of authority, and the multitask nature of the work itself might invalidate any results that emerge from an experiment that does not, or cannot, fully incorporate these features into the environment or manipulation.
- subjects may behave one way in the relative freedom of an experiment, where there are no countervailing pressures acting on them, but quite another when acting within the constrained organizational or bureaucratic environments in which they work at their political jobs.
- Failure to mimic or incorporate these constraints into experiments, and difficulty in making these constraints realistic, might restrict the applicability of experimental results to the real political world.
- External validity is only fully established through replication: the same model should be tested on multiple populations using multiple methods to determine external validity.
- External validity is more related to the realism created by the experiment than to “mundane realism”, the similarity to real-world settings. As long at the experimental situation engages the subject in an authentic way, experimental realism has been constructed.
Experimenter Bias: Experimenter bias, including expectancy effects and demand characteristics, can limit the relevance, generalizability, or accuracy of certain experimental results.

Experimental Ethics

Informed Consent: provide disclosure statement to every subject prior to the experiment, describing the procedures, expected gains and risks, and ability to leave, give contact info.
Risk/Gain Assessment: Take precautions to limit risk to subjects.
Deception: Deception is subject to ethical debate. Some argue it violates informed consent, others argue that it damages the reputation of experimental scientists and is unjustified, others argue that it can be necessary to ensure that subjects cannot guess the hypotheses under investigation and thus essential in providing unbiased results. Deception increases IRB scrutiny.
Debriefing: After experiment, tell subjects as much as possible about the experiment. Explain any deception used and why it was necessary. Reiterate confidentiality.

Druckman et al 2011 “Experimenting With Politics”, plus lab in the field discussion

Lab Experiments

Place subjects in situations that show how people reach decisions as voters, jurors, or legislators.
Laboratory experiments can inform the design and effectiveness of governmental institutions.

In a classic laboratory experiment by Ostrom et al. each subject decided how much to withdraw from a group fund that mimicked a scarce environmental resource. If the subjects overwithdrew, then the group as a whole earned less. Allowing group members to shame those who overwithdrew, or to shame and fine, yielded greater collective benefits than did fines alone.

The results challenged the long-standing presumption that a group’s ability to produce high-value public goods—such as good air quality for all, despite individual incentives to pollute—requires an external authority to impose punishments for noncompliance. The work stimulated a large body of research into when and how common political factors, such as ethnic heterogeneity in politically salient groups, affect the possibility of effective self-governance in the absence of external coercion.

Survey Experiments

Embed experiments in large, and often nationally representative, surveys. These experiments elucidate how variations in the descriptions or presentations of political phenomena affect the perceptions and feelings of diverse citizen populations.
Survey experiments are particularly valuable for clarifying voter behavior. For example, Kuklinski et al. used a “list experiment” to elicit the extent to which citizens are willing to admit racial anxiety or animus. Subjects were presented with a list of items and asked, “How many of them upset you?” Some received a three-item list; others received a four-item list where the added item was “a black family moving in next door.” Among white survey respondents in the American South, the four-item group reported an average 2.37 items that made them upset, compared to 1.95 items in the three-item group. Given that the groups are otherwise identical, the implication is that 42% of southern respondents were upset by the thought of a black neighbor. This finding contrasts with non-southerners who reported that a similar number of items made them upset regardless of whether they chose from the list of three or four items. It also is telling that just 19% of southern respondents admitted that a black neighbor would upset them when asked the question directly.

Survey experiments can also provide a window into how people will think if policies are described in different ways. Schuldt et al. provide a compelling example in their study of one of the most debated issues of our time: climate change. The authors randomly assigned some survey respondents to answer a question about whether “global warming” has been happening. Other respondents were asked a version of the question that replaced the words “global warming” with “climate change.” The authors examined how the wording differences affected response patterns among politically relevant sub-populations. For example, 60% of Republican respondents believed climate change to be occurring, whereas only 44% of them believed global warming was taking place. Collectively, such experiments give users of surveys the means to more accurately interpret existing survey responses, and also provide unique data on the extent to which stated attitudes are robust to situational variations.

For decades, traditional opinion surveys have shown that many citizens cannot recall seemingly basic political facts, such as which political party controls a majority of seats in the U.S. Congress. Academics and members of the press, in turn, drew broad claims about voter incompetence from such data. Experimental research has produced a different view. For example, Lodge et al. studied how citizens’ memories of specific candidate attributes affect their subsequent preferences. After asking respondents to report their opinions on a set of issues, the researchers gave them a fact sheet describing the issue positions of two candidates. After a randomly determined delay of between 1 and 31 days, 80% of respondents failed to recall candidate issue positions. Yet, most respondents expressed strong preferences for the candidate who most closely shared their positions. Common “political fact tests” may thus reveal very little about how voters think.

In a more recent related study, Prior and Lupia ( 10) asked 1200 selected members of a national survey to answer a set of fact-based political questions. They randomly assigned some respondents to a control group that mimicked traditional surveys. In a second group, respondents were paid 1dollar for every correct answer. Relative to the control group, payment produced a 32% increase in correct answers for respondents who reported following politics “some of the time” (rather than “most of the time” or “not at all”). Thus, opinion surveys may underestimate what voters know because they offer little motivation for respondents to think about the questions during the interview.

Lab-in-Field Experiments

The human species did not evolve in Universities. Hence it might be risky to take human behavior as measured in the behavioral laboratories of Universities as the “real” human behavior and starting point of a discourse on how societies should be organized to fit best this “real” nature of humans.

The first step to test the validity of results of experiments in University labs is to take the lab experiment into the field that is into the natural context where people normally make decisions. Taking the lab experiments from the University into the different habitats of humans social environments allows for two important insights:
(1) Firstly these lab experiments in the field are an “robustness check” of the results previously produced in the University lab.
(2) Secondly one can learn from those lab experiments in the field about the “taxonomy” and “variety” of human behavior – depending on social class and structure measured behavior might be different from what has been measured in the University lab with students and rather artificial lab environments. To understand better the taxonomy and distribution of human behavior is fundamental for a better understanding of more realistic modeling of dynamism of complex social systems.

Field Experiments

Researchers integrate random assignment into real political campaigns or attempts to implement policy. These experiments can clarify the relative effectiveness of various tactics and strategies.
In recent years, field experiments have gained greater visibility, particularly in the context of voter mobilization. A leading example is that of Gerber and Green, who randomly distributed messages—for example, through personal contact, by phone, or by mail—to potential voters during an election campaign. Compared to voters who received no reminder or a mail or phone reminder, a personal visit boosted turnout. More recently, Gerber et al. performed a study in which some subjects received a message that their neighbors would be informed about whether they turned out to vote. These subjects were much more likely to vote in the election than were subjects who received no message.

Findings from these voter mobilization experiments have affected the manner in which political parties conduct campaigns and have been used as a model for inquiries about the effectiveness of voter mobilizations strategies. For example, researchers in China showed that during a regional election in 2003, door-to-door canvassing to encourage people to vote increased turnout by over 10%. In another study, researchers randomly assigned 49 Indonesian villages to one of two methods for choosing an economic development program. In roughly half of the villages, chosen citizen representatives made the decision. In the other villages, all eligible villagers could vote directly on which program to pursue. The experimental treatment had small effects on the villages’ chosen projects, but villagers who were given the opportunity to vote viewed the chosen projects as more valuable and were far more satisfied with the outcome. Collectively, these efforts reveal effective routes to increasing electoral participation in ways that lend legitimacy to electoral outcomes.

Natural Experiments

Attempting to identify and analyze real-world situations in which some process of random or as-if random assignment places cases in alternative categories of the key independent variable. In the social sciences, this approach has been used to study the relationship between lottery winnings and political attitudes, the effect of voting costs on turnout, the impact of quotas for women village councilors on public goods provision in India, and many other topics. In the health sciences, a paradigmatic example comes from John Snow’s nineteenth-century tests of the hypothesis that cholera is waterborne.

Gerber and Green (2012) Field Experiments: Design, Analysis, and Interpretation

Introduction

Problem with research based on statistical interpretations of observational data: dominant methodological practice is to move from raw correlations to more refined correlations.
Observational research is vulnerable to unobserved heterogeneity, and the list of potential confounders is substantial, so can’t just control away the problems with observable data.
Experiments are a solution to the problem of unobserved confounders: offers a research strategy that does not require the identification of all potential confounders because these balance out in expectation due to random assignment.
Experiments are a fair test: neither treatment nor control groups should in expectation have an advantage other than treatment.
Types of experiments and their trade-offs:
- Lab experiments are great for assessing a theoretical claim by testing an implied causal relationship (i.e. game theorists can use lab experiments to manipulate uncertainty and incentives to asses effects on bargaining between subjects). These are often more abstract, and the laboratory environment reminds participants that they are participating in an experiment. Subjects are often university students or lay people. A practical advantage of lab experiments is that one can more easily administer multiple variations of a treatment to test fine-grained theoretical propositions.
- Field experiments prioritize realism and unobtrusiveness in an effort to test context-specific hypotheses. Best for addressing questions that address both theoretical and practical (policy-relevant) concerns. Results can be biased if subjects are keenly aware that they are being studied and adjust their behavior based on what they perceive the experimenter’s desired outcome to be. Experiments in real-world settings are designed to make generalizations less dependent on assumptions. “Fieldness” of experiment is gauged along four dimensions: authenticity of treatment, participants, contexts, and outcome measures. When field experiments are not highly naturalistic, then these are more dependent on assumptions. Field interventions can be more cumbersome and risky. Implementation is challenging and often requires coordination with policy practitioners who don’t always embrace things like random assignment. Field experiments can sometimes achieve higher levels of theoretical nuance by applying a wider array of treatments to a large pool of subjects.
- Natural Experiments occur when a government or institution randomly assigns treatment between individuals, i.e. the Vietnam draft lottery, random assignment of defendants to judges, lotteries that assign children to charter schools, random assignment of audits in Brazil, visa lotteries, etc. When random assignment procedures are used by a government, this sets the stage for a natural experiment. Extra effort should go into verifying random assignment in natural experiments.
- Quasi-Experiments occur when there is “near-random assignment” of different individuals or groups to treatments, but these do not exhibit true random assignment and thus can have greater uncertainty with regard to causal inference due to selection bias and unobservable confound problems (i.e. there are systematic differences between candidates that narrowly win elections versus those who narrowly lose, generating selection bias problems that cannot be solved using regression discontinuity designs). Data are also sparse in the close vicinity of the boundary in regression discontinuity contexts. In quasi-experimental designs, there is a strong reliance on argumentation and modeling choices that increase uncertainty over causal inference.

Causal Inference and Experimentation

Summary:
1. A causal effect is the difference between two potential outcomes: one in which the subject receives the treatment and on in which the same subject does not receive the treatment.
2. The fundamental problem of causal inference arises because one cannot simultaneously observe the potential outcomes for a subject under the treatment and control conditions.
3. Experiments provide unbiased estimates of the average treatment effect among all subjects when the following conditions are met:
  - Random Assignment: Treatments are allocated such that all units have a known probability of being placed into the treatment group. Simple or complete random assignment implies that treatment assignments are statistically independent of the subjects’ potential outcomes. Assumption is satisfied when all treatment assignments are determined by the same random procedure. (Discretion in assignment should be minimized, and subjects that “must be treated” should be excluded from the study due to violations of randomization.)
  - Excludability: Potential outcomes respond solely to receipt of treatment not to assignment or any indirect by-products of random assignment. Treatment must be defined clearly so can determine if subjects reacting to treatment or other stuff. This assumption violated if:
    1. different procedures used to measure outcomes in treatment and control groups
    2. research activities, other treatments, third-party interventions other than thre treatment of interest differentially affect the treatment and control groups
  - Non-interference: SUTVA. Potential outcomes for observation i only reflect the treatment or control status of observation i. This assumption is jeopardized if:
    1. subjects are aware of the treatments that other subjects receive
    2. treatments may be transmitted from treated to untreated subjects
    3. resources used to treat one set of subjects diminish resources that would otherwise be available to other subjects
Potential Outcomes Framework
1. Causal effect of the treatment τ_i is the difference between two potential outcomes:
  τ_i ≡ Y_i(1) − T_i(0)
2. Fundamental problem of causal inference: you can only observe Y_i(1) or Y_i(0), but never both, where Y_i is the observed outcome and (1) or (0) are treatment and control conditions, respectively.
3. d_i is the observed treatment delivered to each subject.
4. Potential outcome equation:
  Y_i = d_iY_i(1) + (1−d_i)Y_i(0)
  This indicates that Y_i(1) is observed for treated groups and Y_i(0) for untreated.
Average Treatment Effect (ATE)
1. The average treatment effect (ATE) is the sum of the causal effect of the treatment τ_i divided by N, the number of subjects:
  $ATE=\frac{1}{N}\Sigma^N_{i=1}\tau_i$
  which is equivalent to μ_Y(1) − μ_Y(0), the average value of Y(1) for all subjects minus the average Y(0) for all subjects.
2. The sample average can vary between samples and is thus characterized as a random variable
3. The expected value is the average outcome of a random variable.
Assumptions that must be met for unbiased estimates of the ATE: Random assignment and unbiased inference
1. Random assignment addresses the “missing data” problem by creating two groups of observations that are, in expectation, identical prior to the application of the treatment.
2. Forms of Random Assignment
  1. Let N be the number of subjects, and m the number assigned to treatment.
  2. Simple random assignment: each subject is allocated to treatment group with probability m/N
  3. Complete random assignment: allocates exactly m units to treatment.
3. Under simple or complete random assignment, treatment is independent of the subjects’ potential outcomes and thus $Y_i(0), Y_i(1), X \upvDash D_i$.
The Mechanics of Random Assignment
1. Determine N, the number of subjects and m, the number of subjects assigned to treatment
2. Set a random number seed for reproducibility
3. generate a random number for each subject
4. sort the subjects by their random numbers
5. assign the first m subjects to treatment.
The Threat of Selection Bias when Random Assignment is not used
- Selection problem: receiving treatment may be systematically related to potential outcomes (i.e. without random assignment, villages with female village heads are likely to be systematically different from those that do not have females in charge)
- The expected difference between treated and untreated outcomes is equal to the sum of the ATE among the treated and selection bias.
- Under random assignment, selection bias term is zero and the ATE among treated villages is the same as the ATE among all villages.
- Without random assignment, the apparent treatment effect is a combination og the ATE and the selection bias.
Two core assumptions about Potential Outcomes
1. Excludability: To isolate the causal effect of the treatment, potential outcomes excludes factors other than the treatment. Must therefore be able to distinguish between the treatment and the allocation to treatment or control. Exclusion restriction or excludability indicates that treatment assignment has no effect on outcomes except through the treatment itself.
  Violate if (1) there are asymmetries in measurement (i.e. different folks measuring treatment from those measuring control) or (2) if random assignment sets in motion causes of Y_i other than d_i (i.e. if your treatment is female village heads, and this signals to aid organizations to divert aid toward female village heads, then assignment to treatment has generated a confound).
  Bolstered by: double-blindness and parallelism in the administration of the experiment
2. Non-Interference: AKA Stable Unit Treatment Value Assumption (SUTVA). Y_i(d) reflects whether the observation received treatment or not. Must be able to ignore the potential outcomes that would arise if subject i were affected by the treatment of other subjects.
  The problem here is that villages are often interdependent and people talk to each other. Treatment in one village could affect nearby villages through communication, policy diffusion, and budgetary interdependence.
  Can minimize by spreading out temporally or spatially.
  Can also design experiment to detect spillover between units.

Sampling Distributions, Statistical Inference, Hypothesis Testing, and Power

Sampling Distribution: the frequency distribution of a statistic obtained from hypothetical replications of a randomized experiment. More simply, it is the collection of estimates that could have been generated by every possible random assignment.
Under the central limit theorem, the sampling distribution of the estimated Average Treatment Effect takes the shape of a normal distribution as the number of observations in treatment and control conditions increases.
Standard error is a measure of statistical uncertainty. The larger the standard error, the more uncertainty surrounds the parameter estimate. To minimize standard error, researchers can:
- Assign similar numbers of subjects to treatment and control groups.
- Limit extraneous sources of variability in outcomes.
- Use blocking to improve precision by grouping observations with similar potential outcomes.
- Measure potential outcomes using the difference between pre-test and post-test values of the dependent variable, as these measures tend to have lower variability.
- Measure outcomes as accurately as possible.
Hypothesis Testing: Once the data are collected, sampling variability is one of the foremost concerns in the interpretation of results. Randomized experiments generate empirical results with one of two interpretations: either the treatment exerted a causal effect or the result occurred due to sampling variability.
- Two types of hypothesis testing: (1) the sharp null hypothesis of no effect, in which the treatment effect is zero for all subjects Y_i(1) = Y_i(0)∀i, and (2) the null hypothesis of no average effect, in which the average treatment effect is zero μ_Y(1) = μ_Y(0).
- For small-N experiments or if a small number of clusters are used, randomization inference (the calculation of p-values based on an inventory of possible randomizations) may be necessary for accurately determining p-values. While it may not be necessary with larger-N studies, this approach can be accurately applied to a broad array of applications (including studies with a small numbers of clusters or “fuzzy clustering”, in which robust cluster standard errors tend to be downwardly biased, low N studies, studies in which randomization must pass a balance test, studies with multiple comparisons) without using approximations or additional assumptions. In the case of one-sided non-compliance, the researcher may have to be content using randomization inference to assess the sharp null hypothesis that the intention-to-treat effect is zero.
- p-values: the probability of obtaining a test statistic at least as large in absolute value as the observed test statistic, given the null hypothesis is true. Important not to confuse statistical significance with substantive significance. The right way to think about an experimental result that is substantively significant but statistically insignificant is that it warrants further investigation. As we conduct further experiments, our uncertainty will gradually decrease, and we will be able to make a clearer determination about the true value of the ATE. Conversely, don’t be overly impressed by statistical significance in the absence of substantive significance, especially in very large studies. All else equal, the standard error decreases in proportion to the square root of N, leading to substantively trivial results that reach statistical significance in large studies.
Confidence Intervals: Policy makers don’t really care about whether the effect is statistically distinguishable from zero. Their objective is to use the results to form a guess about the ATE. Interval estimation generates a probability statement about the range of values within which a parameter is located. Interpreted statistically, the interpretation of the confidence interval is that over hypothetical replications of the experiment, the interval has a 95% chance of including the true ATE.
Sampling Distributions for Experiments that Use Block or Cluster Random Assignment
1. Blocked random assignment: subjects are partitioned into subgroups and complete random assignment occurs within each block, can ensure that equal numbers of members of theoretically-important subgroups are assigned to each experimental condition.
  - Advantages: Can increase precision when the blocking variables strongly influence outcomes. This reduces sampling variability. By randomizing within each block, researcher eliminates the possibility of rogue randomizations. Blocking also ensures that certain subgroups are available for separate analysis.
  - Disadvantages: rarely has negative consequences in practice, but can lead to misanalyzed data, since ATE must be analyzed block by block and analysis must follow the randomization procedures.
2. Cluster random assignment: all subjects in the same cluster are placed as a group into either treatment or control conditions (i.e. a city, village, or classroom is placed into control or treatment).
  - Advantages: Sometimes clusters are the only option for applying treatments (i.e. when dealing with classroom interventions or TV commercials where the cluster is the minimum group size).
  - Disadvantages: May diminish precision when units with similar potential outcomes end up in the same cluster or when clusters of different sizes are compared.
Power refers to the probability that the researcher will be able to reject the null hypothesis of no treatment effect.
- involves some guesswork: power analysis requires a researcher to use the values of unknown parameters, especially estimating the size of the true ATE.
- power rises as a sample increases
- power also increases with effect size, so strengthening the treatment is another way to remedy insufficient power.

Using Covariates in Experimental Design and Analysis

One of the advantages of randomized experiments is that they generate unbiased estimates of the average treatment effect regardless of whether the researcher accounts for other causes of the outcome. Omitted variable bias is eliminated through successful randomization. But covariates are still used for three primary reasons:
- to rescale the dependent variable so that potential outcomes have less variance, improving the precision with which the treatment effects may be estimated.
- regression analysis uses covariates to eliminate observed differences between treatment and control groups and to reduce the variability in outcomes. The net effect is usually an improvement in the precision with which the treatment effect is estimated. Can also be used to check for data-handling errors that potentially undermine random assignment of observations to treatment and control groups.
- can use covariates to determine block randomization. If investigators have intuitions about which covariates influence potential outcomes, they can use covariates to form relatively homogeneous groups, or blocks, each with different expected outcomes. Randomization is conducted separately within each block.
Importantly, appropriate treatment of covariates is best determined at the design stage, before results are collected, to avoid researchers attaching post-hoc causal interpretations to covariates’ apparent “effect” or failures in block randomization balance and analysis.
Using Covariates to Rescale Outcomes: Administering a pre-test may improve the precision of estimates of the causal effects of a treatment.
- Administering a “pre-test” in which X_i, a set of covariates assumed to influence potential outcomes, is collected. One of the key assumptions is that X_i are fixed constants that are observed prior to assignment to treatment or control. Post-treatment covariates, in contrast, are measured after the experiment and are thus potentially affected by experimental assignment and may thus violate the fixed-X assumption.
- Rather than using a difference-in-means to determine treatment effects, one can use a difference-in-differences estimator, which analyzes change scores between the post-test outcome Y_i and pre-test values X_i. Like the difference-in-means estimator, the difference-in-differences estimator produces unbiased estimates, but typically does so with higher precision (lower variance).
- There are trade-offs to collecting pre-test information. Budget constraints can limit how much information researchers can collect, but more importantly if pre-testing changes the way that participants respond to treatment, then things like social desirability can strongly bias the observed responses to treatment. Although pre-tests can improve precision, they are not worth conducting if the pre-test provokes different reactions in treatment and control groups.
- Advantages: By rescaling the outcome to reflect the change from pre- to post-test, the difference-in-differences estimator produces estimates that are both unbiased and precise.
- Disadvantages: The pre-test can interact with the treatment, for example, by triggering social desirablility. In these cases, the pre-test introduces systematic differences between the potential outcomes of treatment and control groups, biasing results.
Adjusting for Covariates using Regression:
- Advantages: Improves flexibility by allowing for the inclusion of multiple covariates as right-hand side variables, thus reducing the disturbance variability more effectively than scaled regression. Negligible bias in large samples, and in these cases researchers should include any pre-treatment covariates that prior research, pilot testing, or theoretical intuition suggests will predict outcomes.
- Disadvantages: In small-N ( < 20), controlling for covariates can lead to bias. More serious problems arise when researchers consider experimental outcomes in deciding which covariates to include. The investigator can settle on a regression model that makes the results look especially impressive or interesting, thus introducing bias.
- Best practice: Consider covariates in pre-analysis plan to avoid introducing bias by selecting a regression model based on results. Present regression results alongside a difference-in-means estimator to allow audience to judge the extent to which the inclusion of covariates is consequential.
Covariate Imbalance and the Detection of Administrative Errors:
- Asymptotic properties aside, random assignment of a finite number of observations produces some degree of imbalance, or correlation between assignment to treatment and one or more covariates. By controlling for covariates, the researcher can reestablish balance.
- So long as imbalance is solely due to random chance (as opposed to administrative error) and so long as we control for the covariate that is imbalanced, there is no reason to expect imbalance on other covariates or on unmeasured cases of the outcome variable.
- To assess balance, conduct a regression in which covariates are used to predict assignment to treatment. If the imbalance on a certain covariate is larger than one would expect by chance, investigate the randomization procedures. If the randomization procedure is satisfactory, report the imbalance, and report the results both with and without covariate adjustments.
Blocked Randomization and Covariate Adjustment
- Blocking can produce a relatively small gain in precision over after-the-fact regression adjustment, especially in large samples.
- The primary advantage of blocking over regression is that blocking more credibly requires ex-ante expectations about the relationship between covariates and potential outcomes.
Do not causally interpret covariates in an experimental analysis, as this involves all the threats to inference associated with observational data (and more, since the covariates lack their own theoretically-necessary controls).

Implementation Problem I: One-Sided Noncompliance

Compliance: occurs when the actual treatment coincides with the assigned treatment.
Noncompliance: occurs when subjects who were assigned to receive the treatment do not receive the treatment or when subjects who were assigned to the control group inadvertently receive the treatment.
One-sided noncompliance: occurs when there is only non-compliance from one of the two groups (either units assigned to the treatment group do not receive treatment or units assigned to control inadvertently receive treatment, but not both.
The more common case for the social sciences occurs when no subject assigned to the control group is treated, but some of those assigned to treatment go untreated. For example, in door-to-door canvassing, only 25% of those households that are assigned to the treatment group will actually receive the treatment because people aren’t home when canvassers stop by. Two näive (and invalid) approaches:
- (1) Ignore non-compliance and compare the average outcomes for the full treatment group (N=1000) to the control group (N=1000). This is what we would do under full compliance. Since only 25% of the treatment group received treatment and this method implicitly assumes that the average treatment effect is zero for the untreated portion of the treatment group, but mathematically they enter the average for having received treatment. Assuming a non-zero actual treatment effect, the inclusion of untreated units in the treatment group will bias the coefficient for treatment toward zero.
- (2) Compare the average outcome only among subjects that received treatment (N=250) to the average outcome of the control group (N=1000). The subjects who are actually treated are a non-random subset of the original treatment group. Groups formed after randomizations will not generally have the same expected potential outcomes as the whole, randomized, group. This will bias coefficients because compliance is self-selected, the remaining sub-sample is not representative of the whole study population, and randomization has broken down. In the canvassing example, voters who have moved from the location on the official list of registered voters will not be reached by canvassers and their turnout rates will be very low. In combination, this will bias the coefficient on the treatment variable upward, making canvassing appear more effective (because subjects who move away will have been included in the control group but not the treatment group).
Potential Outcomes and Noncompliance
- Actual Treatment and Assigned Treatment: the potential outcome d_i(z) indicates whether subject i is actually treated when treatment assignment is z.
  - d_i(1) = 0 subject assigned to treatment does not receive treatment
  - d_i(1) = 1 subject assigned to treatment receives treatment
  - d_i(0) = 0 subject assigned to control does not receive treatment
  - d_i(0) = 1 subject assigned to control receives treatment (ruled out here, but enters in next section which covers two-sided noncompliance)
- Four types of subjects:
  - Compliers are subjects whose potential outcomes meet two conditions: (1) they receive treatment if assigned to the treatment group (d_i(1) = 1) and (2) they do not receive treatment if assigned to the control group (d_i(0) = 0).
  - Never-takers are subjects for whom d_i(1) = 0 and d_i(0) = 0. These are typically the subjects that generate one-sided noncompliance in social science experiments, so they are the focus in this section.
  - Always-takers are subjects for whom d_i(1) = 1 and d_i(0) = 1. In the examples used here for one-sided non-compliance, we assume that there are no always-takers. They will be discussed more in the next section.
  - Defiers are subjects for whom d_i(1) = 0 and d_i(0) = 1. These are assholes who we hope don’t exist because in their presence, causal inference cannot be achieved. They take treatment if in control group and refuse treatment if they’re in the treated group. We generally assume they don’t exist, because this is a perverse set of potential outcomes.
Defining Causal Effects for the Case of One-Sided Noncompliance: Two core assumptions are non-interference and excludability
- The non-interference assumption for experiments that encounter noncompliance
  Assumption of non-interference consists of two parts:
  - Whether a subject is treated depends only on the subject’s own treatment group assignment. Other subjects’ assignments are assumed to have no bearing on whether one receives the treatment.
  - The potential outcomes are affected by (1) the subject’s own assignment and (2) the treatment that the subject receives as a consequence of his assignment.
  This is easily violated in experimental settings. Important to consider whether potential outcomes vary depending on how subjects are allocated to experimental groups or how treatments are actually administered.
  We may need to distinguish between the causal effect of assignment to treatment and the causal effect of having actually received the treatment. The causal effect of having been assigned to the treatment group is called the intent-to-treat effect, which measures the average effect of experimental assignment on outcomes, regardless of the fraction of the treatment group that is actually treated. In experiments with 100% compliance, the ITT is equal to the ATE.
  A researcher is often more interested on estimating the average treatment effect (ATE) rather than the average effect of assignment to treatment (ITT). To isolate the effect of treatment from the effect of assignment when there is non-compliance, we need another assumption: excludability.
- The excludability assumption for one-sided noncompliance
  The excludability assumption stipulates that potential outcomes respond to treatments, not treatment assignments. Under the excludability assumption, we write the potential outcomes according to whether the subject received the treatment and disregard the assigned treatment.
Average Treatment Effects, Intent-to-Treat Effects, and Complier Average Causal Effects
- Experiments that encounter noncompliance do not generate the information necessary to identify the ATE. A more realistic goal is the Complier Average Causal Effect (CACE) which is defined as:
  $CACE \equiv \frac{\Sigma^N_{i=1}(Y_i(1)-Y_i(0))d_i(1)}{\Sigma^N_{i=1}d_i(1)}=E[Y_i(d=1)-Y_i(d=0)|d_i(1)=1)]$
  where E[Y_i(d=1) − Y_i(d=0) is the average treatment effect and d_i(1) = 1 is among compliers.
  We need the non-interference and exclusion restriction to hold in order to accurately estimate the CACE.
Avoiding Common Mistakes
- Interference may be reduced by keeping the density of treatment to a low level and by measuring the outcome quickly after the treatment is administered, before it has an opportunity to spread.
- Exclusion restriction violations can be minimized through careful monitoring of experimental conditions to minimize the effects that countless factors that might coincide with assigned treatment have negligible effects on potential outcomes.
Statistical Inference
- Use OLS to estimate the ITT: regressing assignment to treatment on outcomes yields a regression that is the equivalent of the difference-in-means estimator.
- Use OLS to estimate the ITT_d: The proportion of compliers is estimated by calculating the ITT_d. Regress actual treatment (d_i) on assigned treatment (z_i).
- Use 2SLS to estimate the CAC**E: Estimate a two-stage least squares regression in which assignment to treatment (z_i) instruments for treated (d_i). Because assignment to treatment is randomized by the researcher, we can assume that assignment to treatment is independent of the error term. Assignment to treatment should also strongly predict the subject actually being treated, thus allowing for consistent estimation.
Designing Experiments in Anticipation of Non-Compliance
- Noncompliance not only prevents researchers from estimating the ATE, but creates problems for the CACE as well. While the CACE provides consistent estimates, the higher the rate of noncompliance the lower the relative efficiency of the 2SLS estimator will be. High levels of noncompliance make the ITT a weaker instrument, inflating standard errors and lowering the efficiency of estimates. If high noncompliance is anticipated, the sample must be drastically increased, which is expensive. So design your experiments as well as possible to avoid problems with noncompliance.
- Placebo designs are another way to mitigate statistical uncertainty. In the placebo design, subjects are first recruited to receive a treatment, and then, given compliance, are allocated to either a treatment group or a placebo group. In a voting participation experiment, once canvassers knock on a door and someone answers, the subject is then either allocated to a treatment group that receives information about a campaign or a placebo group that receives a placebo treatment about recycling that has nothing to do with voting and thus should not be expected to influence voting patterns. The CACE can then be estimated by comparing the outcomes between the treatment and placebo groups.
Summary:
- If you expect or experience high levels of noncompliance, then the ATE cannot be accurately estimated because noncompliance prevents us from collecting the needed information. The intention-to-treat (ITT) effect can be estimated using a simple difference of means estimator, but is typically of much less interest to the researcher than knowing what the effect of having actually received the treatment would be. With higher levels of noncompliance, the Complier Average Causal Effect (CACE) may be more useful.
- The estimate of the CACE requires non-interference and exclusion restriction assumptions to be met, but then can be estimated using 2SLS, using assignment to treatment as an instrument for having received treatment. The stronger a predictor assignment to treatment is for having received treatment, the stronger it will be as an instrument and the more precise estimates of the CACE will be.
- Placebo tests may also be used to estimate the CACE. In this case, we can estimate complier average causal effects by first randomly allocating participants to a treatment group and then once compliance is established allocating them either to a true treatment group or to a placebo group, where the placebo is not anticipated to have any effect on the outcome of interest. The difference between the treatment and placebo groups then becomes the CACE.

Implementation Problems II: Two-sided Noncompliance

Two-sided noncompliance occurs when some subjects in the control group are treated and some subjects in the treatment group go untreated. This happens when you have never-takers in the treatment group and always-takers in your control group. This happens when subjects have access to treatment and discretion about whether to take them, as happens with encouragement designs.
As with one-sided noncompliance, simply comparing the average outcomes among those who do and do not receive treatment is a non-experimental research strategy that is prone to selection bias.
Noncompliance changes the interpretation of the experimental estimates. Instead of estimating the average treatment effect, the researcher estimates the intent-to-treat effect and/or the complier average causal effect.
- ITT estimates the effect of assignment to treatment on outcomes, ignoring the rate at which assignment results in actual treatment.
- The Complier average causal effect refers to the ATE among subjects who with a particular set of of potential outcomes: they are treated if assigned to the treatment group, but not if they are assigned to the control group. Which subjects are compliers is generally unknown with two-sided noncompliance.
Interpretations of results in the presence of two-sided noncompliance is constrained in two ways:
- Generalizations are limited by the fact that experimental estimates refer to causal effects among Compliers, not the sample as a whole.
- Uncertainty about who the compliers are means that researchers must be extremely cautious when generalizing to other interventions, subjects, and settings.
The CACE under two-sided non-compliance: when the following assumptions are met, the CACE can be measured using the data. The estimator is the ratio of the estimated intent-to-treat effect of random assignment on outcomes divided by the estimated intent-to-treat effect of random assignment on actual treatment. All else equal, as the proportion of compliers increases, standard errors decline, and the estimator becomes less susceptible to bias due to violations of the excludability assumption.
Assumptions necessary for the CACE to be consistent
- Excludability, non-interference, and independence (described in previous sections).
- Monotonicity: Two-sided noncompliance requires an additional assumption: monotonicity. Simply put, the monotonicity assumption requires there to be no defiers in the sample. That is, there should be no one in the sample that would take treatment if assigned to the control group but would refuse treatment if assigned to the treatment group. Since this is more perverse than always-takers’ or never-takers’ behaviors, we generally assume that there are no defiers.

Implementation Problems III: Attrition

Random assignment of subjects to treatment and control groups implies that the average outcome in the treatment group is an unbiased estimator of the average Y_i(1) in the sample pool, and the average outcome in the control group is an unbiased estimator of the average Y_i(0) in the control group. If there are no problems in the randomization procedure, the expected value of selection bias is zero; a simple difference in means test is an unbiased estimator of the treatment effect. This all rests on the crucial assumption that the researcher observes outcomes for all of the experimental subjects, which is violated in the case of attrition.
Attrition occurs when outcome data for some subjects is missing. If this happens at random, attrition is likely to somewhat decrease the precision of estimates (due to lower N and measurement error in dependent variable). If attrition is systematic, then this can pose a serious threat to unbiased inference. When attrition is systematically related to potential outcomes, removing observations from the data set means that the remaining subjects assigned to control or treatment groups no longer constitute random samples. A difference in means test is no longer an unbiased estimator due to selection bias.
Reasons for attrition:
- Subjects may refuse to cooperate with researchers. (They may refuse to fill out a post-treatment questionnaire.)
- Researchers may lose track of experimental subjects. (Especially in longer-term field experiments, subjects may move or die.)
- Firms, organizations, or government agencies block researchers’ access to outcomes. (Especially common for sensitive issue areas like corruption.)
- The outcome variable may be intrinsically unavailable for some subjects. (Income or wage after intervention for unemployed.)
- Researchers deliberately discard observations. (Excluding from analysis those who failed attention checks or seemed not to be taking the experiment seriously.)
Conditions under which attrition leads to bias
- “missingness” is itself a potential outcome–whether a subject’s outcome is reported may depend on the experimental group to which the subject was assigned.
- Assuming that assignment to treatment is the same as receiving treatment for all subjects, and that we are missing reported outcomes for some subjects, we can treat the presence or absence of a recorded outcome using a dummy variable coded 1 for recorded outcome and 0 for no recorded outcome. We can then examine whether the differences in the recorded outcomes between the control and treatment groups. If there is a difference between the groups, or if we have strong theoretical reasons to believe that treatment or control conditions would lead to missingness, then we should assume that there is non-random attrition and that a difference-in-means test between the two groups on the potential outcome of interest will have selection bias.
- Attrition is less of a problem if the data are missing independent of potential outcomes (MIPO). This independence conditions implies that whether a subject’s outcomes are missing or not has no correlation with Y_i(0) or Y_i(1). This implies random missingness. We can sometimes induce conditional MIPO (MIP**O|X) by including covariates, but this requires strong theoretical and empirical justifications.
Refining the estimand when attrition is not a function of treatment assignment
Placing bounds on the ATE: Inserting extreme values in the place of of missing values is one approach, although when missingness is high, these bounds can be uncomfortably high. Can also bound the ATE of a subgroup, which involves trimming outcomes at the top or bottom of the outcome distribution for the experimental group with a lower rate of missingness, makes somewhat stronger assumptions and tends to produce more informative bounds. This leads to somewhat uncertain interpretation, as inference rests on the validity of the montonicity assumption and how far one can generalize based on the estimated ATE is uncertain.
Attrition sucks, so be smart about experimental design so that you minimize it.
- Follow-up sampling to measure outcomes on subjects who go missing after the first round.
- Keep good records.
- Gather several outcome measures, each from a different source, so that missing data on one outcome can be imputed using information from other parallel records.
- If the outcomes are measured a long time after the intervention, take a baseline, at least one midline, and an endline measure.
- Start an experimental project by first considering the availability of outcome measures.

Implementation Problems IV: Interference between Experimental Units

The Stable Unit Treatment Value Assumption (SUTVA) is a crucial assumption for causal inference. It requires there to be no interference between treated units—a subject’s potential outcomes should respond to that subject’s treatment alone. Others’ exposure to treatment should not affect the potential outcomes of a subject. Non-interference is one of the core assumptions needed to establish the unbiasedness of the difference-in-means estimator.
Interference between experimental units is a violation of SUTVA.
Examples of social phenomena that cause the treatment of one unit to have repercussions for other units include:
- Contagion: The effect of being vaccinated on one’s probability of contracting an endemic disease depends on whether others are vaccinated. The causal effect of vaccination is likely to be small if one is surrounded by others who are vaccinated, large if surrounded by unvaccinated.
- Displacement: Police interventions designed to suppress crime in one location may displace criminal activity to nearby locations. Coca eradication in Colombia affects coca production in Peru; a comparison between “treatment” in Colombia and “control” in Peru will suggest that the intervention in Colombia was far more successful than it really was.
- Communication: Interventions that convey information about political causes may spread from individuals who receive the treatment to those who are nominally untreated. Because members of the control group have received the treatment, the impact of the treatment will appear to be smaller.
- Social Comparison: An intervention that offers housing assistance to a treatment group may change the way in which those in the control group evaluate their own housing conditions. Potential outcomes in the control group decrease when the treatment group is treated.
- Deterrence: News of anti-corruption audits in treated localities may spread to untreated localities, which then respond to the treatment.
- Persistence and memory: Within-subject experiments (time series) measure the effects of introducing a stimulus on outcomes in subsequent time periods. If a subject recalls past interventions, this may result in a different response to later interventions.
It is crucial to think of ways that SUTVA may be violated and design interventions to minimize spillover.
Identifying the Causal Effects in the Presence of Localized Spillover: We can model some spillovers using multilevel designs. Assume 2 voters in each household. You send a mailer to either 0 voters in the household, 1 voter in the household, or 2 voters in the household. We can investigate the potential outcomes for each of the four conditions: Y₀₀ if no mailer to the household, Y₀₁ if a mailer was sent to the subject but not her roommate, Y₁₀ if a mailer was sent to the roommate but not the subject, and Y₁₁ if a mailer was sent to both individuals. We can then find out the direct and spillover effects of the intervention. This logic can be extended to within schools, workplaces, villages, etc. Non-interference is not violated with localized spillovers if spillover is adequately integrated into the potential outcomes in the experimental design.
Spatial Spillover: Adjacent schools, villages, neighborhoods, etc. may also be subject to spillovers. Spillovers are not obviously confined to a localized space. The canonical example is health interventions designed to prevent the transmission of disease from one person to another. The intervention is directed at randomly selected locations (i.e. villages), and researchers look not only at the direct effects of the treatment but also at whether untreated units are affected by their proximity to the nearest treated unit or the number of treated units within a certain radius. This can be addressed by modeling the spatial spillover of treatments. The weighted difference in means can be used to model the probability of spillover effects, mitigating (though not necessarily eliminating) bias from spillovers. Even when we specify the distance that spillover effects travel, our estimates may be severely biased if we fail to take into account the probability that each unit is exposed to spillovers.
Within-Subjects Design and Time-Series Experiments: Random assignment in a within-subject design refers to when a treatment is administered. The allure of within-subject designs is their capacity to generate precise treatment estimates with a single subject. Entities are compared to themselves, which means that background attributes hold constant. Important to specify a random point at which the intervention occurs in order to rule out the possibility that the timing of the intervention is systematically related to potential outcomes in different periods. The more time periods and interventions, the less likely it is that a randomly assigned pattern of treatment assignment will coincide with over-time trends in potential outcomes. Random assignment does little to shore up the shakiest assumption: the stipulation that treatments have no anticipated or persistent effects. It is important to theoretically and empirically justify any assumptions that a treatment effect will dissipate over time and to incorporate this adequately into the experimental design.
Waitlist Designs (AKA Stepped-Wedge Designs): Waitlist designs play a valuable scientific and diplomatic role. On the scientific side is their ability to track treatment effects among several subjects as they play out over time. On the diplomatic side, these sidestep the problem of withholding treatment from a control group, which development practitioners get twitchy about. Phased designs combine the benefits of within-subject time-series and between-subject cross-sectional designs.
Summary
- Non-interference requires researchers to specify how potential outcomes respond to all possible random assignments, and, in turn, how treatment effects will be designed. There is value in carefully defining the estimand and assessing whether the experimental design identifies the parameter of interest.
- In order to address concerns about interference empirically, researchers turn to experimental designs that relax the assumption that subject i is unaffected by whether others are treated. These designs share a common feature: they randomly assign and explicitly model for varying degrees of secondhand exposure. In the face of potential interference, the challenge is to design more flexible designs that less reliant on potentially fallible modeling assumptions.
- Statistically modeling potential outcomes when subjects are clustered geographically or along other dimensions such as social network proximity requires careful theoretical and empirical work. Exposure to spillovers tends to vary, even when subjects are randomly assigned to treatment. If subjects have different probabilities of exposure to spillovers and these probabilities are related to potential outcomes, data analysis requires special care. Comparing average outcomes among those exposed to spillovers and those not exposed is prone to bias. A better approach is to use the randomization procedure to simulate the probabilities of exposure to spillovers and then to compare weighted means.
- Spillover isn’t necessarily bad. Secondhand influence has enormous practical implications! If the effects of a treatment are found to resonate through social networks, their cumulative effects may be many times greater than their effects on those who receive treatment directly.
- In the presence of spillovers, modify the ATE. The average potential outcome if precincts were treated with heightened police patrols minus the average potential outcome if neighboring precincts were treated with heightened police patrols. Redefining the estimand in this way means that interference is no longer a source of bias. The experiment provides an instructive answer to the question of how your crime rates are likely to differ from your neighbors’ if you receive the treatment and they do not.

Heterogeneous Treatment Effects

Rarely is it plausible to assume that every observation responds to an intervention in the same way. Heterogeneous treatment effects allow us to move away from simple average effects in order to investigate variability in treatment effects. This allows us to understand which individuals will be most responsive and under what conditions. This gives insight into why a treatment does or does not work.

How treatment effects vary across different values of the covariates is treatment-by-covariate interactions or subgroup analysis.

Can also vary treatments and the experimental context in which they are deployed or received.

Limits to what experimental data tell us about treatment effect heterogeneity: In an experiment, m subjects are assigned to treatment and, for each subject, the treatment effect is defined as the difference between the treated and treated potential outcomes. Treatment effect heterogeneity refers to the variance of the treatment effect across subjects. If we find evidence of treatment effect heterogeneity, our next step is to investigate the conditions under which treatment effects are large or small.
Bounding *Var(τ_i) and Testing for Heterogeneity: We can estimate bounds for *Var(τ_i) by calculating the largest and smallest covariances implied by the data. Another strategy is to test the null hypothesis that Var(τ_i) = 0 by comparing the observed variances of Y_i(1) and Y_i(0). The hypothesis that Var(τ_i) = 0 implies that Var(Y_i(0)) = Var(Y_i(1)), and so observing markedly different variances suggests that treatment effects are heterogeneous.
Two approaches to the exploration of heterogeneity: covariates and design: If a researcher’s theoretical intuitions or preliminary hypothesis tests suggest the presence of heterogeneous effects, then it may be important to examine the conditions under which the treatment effect varies. There are two ways to do this. The first is treatment by covariate interactions, or variation in ATEs from subgroup to subgroup. The second is introducing additional interventions in order to assess treatment-by-treatment interactions, or variation in ATEs across other randomly assigned treatment conditions.
- Assessing treatment-by-covariate interactions: The ATE within a subgroup is called the conditional average treatment effect (CATE). When researchers speak of interaction effects between a treatment and one or more covariates, they are referring to the difference between CATEs. We can then use F-tests to determine if the difference between two subgroups is likely to have occurred by chance.
- Caution is required when interpreting treatment-by-covariate interactions: Even if there were in fact no interactions, the probability that at least one estimated interaction proves significant rises as the number of tests increases. We can use the Bonferroni correction to address the multiple comparison problem, which divides the level of statistical significance by the number of tests run. Another way is to assess the joint significance of all interactions considered together using the F-test to compare nested models.
- Assessing treatment-by-treatment interactions: The basic limitations of subgroup analysis can be overcome with more elaborate experimental designs that manipulate both the treatment as well as the personal or contextual factors that are thought to affect the size of the treatment effect. Assuming two or more factors each of which have two or more experimental conditions, then factorial design allocates subjects at random to every combination of experimental conditions. Factorial design allows researchers to study the way in which the treatment effect of one variable changes depending on the levels of other randomly assigned factors.
Using Regression to Model Treatment Effect Heterogeneity:
Treatment-by-covariate interactions: When assessing treatment-by-covariate interactions, researchers find themselves on the border between experimental and nonexperimental research. The experimental treatments are randomly assigned, but the covariates with which they interact are not. When CATEs are found to vary depending on the value of a covariate, interpretation remains ambiguous. Treatment-by-covariate interactions may provide useful descriptive information about which types of subjects are most responsive to treatment, but the theoretical question of whether these interactions are causal requires an experimental design that randomly varies what are believed to be the relevant subject attributes or contextual characteristics.
Treatment-by-treatment interactions: Mutli-factor experiments (factorial design) have the potential to shed light on practical questions like, “which combinations of treatments are most effective?” and on theoretical questions like, “Under what conditions are the treatment effects large or small?”.
Automating the Search for Interactions: Regression is a useful device for estimating interactions–assuming we have a model in mind as we approach the data. In principal, regression could be used to examine interactions between several treatments and covariates, but the more discretion researchers have when adding or dropping variables, the farther away from generating results that are reproducible or have known sampling distributions. Two options: First option is to keep things simple–don’t interact more than two things, and only interact things that are theoretically important. One can go a long way interacting one or two substantively interesting things, showing them to be robust and reproducible. Second option is to automate the search using machine learning. Whether the interactions the computer finds are likely to be confirmed by follow-up experiments and whether the computer is any better at identifying plausible interactions than a researcher is an open question. Mike absolutely hates this kind of shit and thinks you should preregister your hypotheses, including subgroup analyses. Ideally, modeling decisions are guided by a planning document that identifies ex ante which interactions are to be tested.

Mediation

Some of the most important discoveries have involved intervening or mediating variables that transmit the influence of an experimental intervention. One of the most famous examples is the discovery that feeding limes to sailors prevents scurvy. Years later, it was discovered that it was the vitamin C in particular in limes that prevented scurvy. We can think of limes as the treatment, vitamin c as the mediator, and getting scurvy or not as the outcome. When an experiment indicates that a treatment influences an outcome, researchers immediately express curiosity about the channels through which an experimental treatment transmits its influence.
Mediation analysis starts with an average causal effect of a treatment Z_i on an outcome Y_i. The researcher endeavors to determine whether Z_i induced a change in a mediating variable M_i, and whether a Z_i induced change in M_i produced a change in Y_i. The success of this effort is generally judged by whether the mediators account for all of the influence that Z_i exerts on Y_i. To return to the previous example, the researcher should demonstrate that limes have no effect on scurvy if they do not contain vitamin C. As the example suggests, this is difficult to do and requires some strong assumptions.
A different approach is implicit mediation analysis. Instead of attempting to estimate the channels through which Z_i transmits its influence using a statistical model, implicit mediation analysis takes a design-based approach. The researcher conducts an experiment with an array of treatments in order to investigate how adding or subtracting different different ingredients to and from Z_i alters its effects. This sheds light on causal mechanisms and aids in the search for especially effective interventions.
Regression-based approaches to mediation: A mediating variable is caused by an intended treatment (Z_i) and in turn causes the outcome (Y_i). In other words, the assigned treatment (Z_i) affects the mediator (M_i) and either or both Z_i and M_i affect Y_i. Note that regression-based approaches will be biased because regression assumes constant treatment effects and that the disturbances are unrelated. Without this assumption, regression will generate biased estimates when applied to a regression with both Z_i and M_i on the right hand side, typically exaggerating the effect of the mediator and understating the direct effect of treatment. The inclusion of pre-treatment variables as regressors does not bias results, but the inclusion of post-treatment variables (“bad controls”) is likely to bias the estimation of the treatment’s direct effect.
Why experimental analysis of mediators is challenging: When mediators are manipulated experimentally, prospects for sound inference improve, but basic problems remain due to the impossibility of observing complex potential outcomes. In practice, researchers rarely have the luxury of manipulating mediators directly, which means they must resort to encouragement designs. These are vulnerable to bias through violations of the exclusion restriction, as may occur when multiple mediators link cause and effect.
Ruling out mediators?: One option is to regard mediators as outcome variables, paying special attention to the question of when potential mediators may be dropped from consideration on the grounds that they appear to be unaffected by the randomly assigned treatment.
Implicit mediation analysis: Another option is to vary the treatments in theoretically guided ways so as to manipulate the mediators implicitly. This focuses on experimental comparisons, thus minimizing the risk of bias, although differing opinions about what the causal mechanisms actually are when several mediators are suggested by a series of interventions can remain a problem for political science research.

Instructive Examples of Experimental Design

Using Experimental Design to Distinguish between competing theories: (Ashraf et al 2010)
- Which is more effective: giving away water disinfectant, bed nets, medicines, etc. in developing countries, or making people pay for it? Two arguments: (1) products that are distributed for free are perceived to have little value and go unused whereas products that people pay for are valued more and used more, (2) charging denies access to poorest households. The really difficult to answer question lurking within the first argument is whether people appear to use things more when they pay for things (a) because people are willing to pay more for something if they value it more or (b) if the actual act of paying more for something makes them value the thing more.
- Sunk-cost effect versus screening hypothesis:
  - Sunk cost: The act of having paid for something makes a person use it more. The more you pay, the more you value it, the more you use it.
  - Screening: Willingness to pay is proportionate to an individual’s valuation of a good. People who are willing to pay for something value it and will use it. People who are willing to pay more value it more and will use it more.
- Randomly offering water disinfectant at a low or high price and seeing which households buy it and use it is not enough to distinguish between the sunk cost and screening hypotheses. Solution to distinguishing between the two theories lies in disentangling willingness to pay (offer price) with the actual price paid (transaction price).
- Three treatments using offer price and transaction price. Price serves two distinct roles: it determines whether a purchase occurs and how much usage occurs given a purchase. Experimenter asks if a household would be willing to buy the product at the offer price and then reveals a transaction price, which may be the same or lower. Each household is randomly assigned an offer price and a transaction price.
- Where p′ is the high price, and p is the low price, Q_i(x) is the quantity used at the transaction price x and Y_i(x) is the is the decision whether or not to purchase at price x. There are three treatments under the 2x2 design (one omitted because it gives no necessary information to discern between the hypotheses and is unrealistic), and three potential outcomes, shown below:
  - Offer price (high), transaction price (high) E[Q_i(p′)|Y_i(p′)=1]
  - Offer price (high), transaction price (low) E[Q_i(p)|Y_i(p′)=1]
  - Offer price (low), transaction price (low) E[Q_i(p)|Y_i(p)=1]
  - OMITTED FROM 2x2, BECAUSE UNNECESSARY: Offer price (low), transaction price (high)
- The screening hypothesis can be tested by assessing whether usage increases when the offer price reses, holding constant the transaction price.
- The sunk-cost hypothesis can be tested by assessing whether usage increases when the transaction price rises, holding constant the offer price.
- Ensure that the exclusion restriction holds using a follow-up question to make sure than subjects remember paying the transaction price rather than the offer price. The researchers also took steps to reaffirm the transaction price with custormers in order to increase its cognitive salience.
Oversampling subjects based on their anticipated response to treatment: (Slemrod et al 2001)
- Minnesota Department of Revenue wanted to increase tax compliance. Taxpayers with business and farm income have more flexibility in what they report to tax collection agencies than citizens with wage only income, which is reported by employers directly to the government. Because of this increased discretion, business and farm owners were labelled high-opportunity tax payers due to higher rates of non-compliance.
- Treatment: official letter warning taxpayers of a possible audit
- Baseline: prior year reported income in tax documents
- Endline Outcome: year-over-year change in reported income
- Population and sample:
  - Six groups in two dimensions (2x3 design): Low, medium, and high income; low and high opportunity (where “high opportunity” describes taxpayers who can relatively easily conceal sources of income from tax collection agencies).
  - If interested in the ATE, would treat similar numbers in each group.
  - Concerns about (1) cost of experiment and (2) non-interference places limitations on how many mailers could be sent out and how many audits could actually be conducted (should expect spillovers to the control group if a high percentage of the population were told to expect an audit).
  - More interested in the CATE for “high-opportunity” and “high income” subgroups. Simple randomization across households would yield very few observations in the high-opportunity, high income subgroup. So, oversample the high-high subgroup to treatment, use CATE, and use weighted regression to estimate the ATE.
- Analysis: Because the sample was weighted, weight the subgroups when estimating the population ATE. Because they were testing the CATE for six groups (six hypotheses), the probability of at least one false positive is (1 − 0.95⁶ = 26%), so they used a Bonferroni correction, where a significant p-value would be $p<\frac{0.05}{6}=0.008$.
- Given that there may be some large outliers, the difference-in-means estimator may not have a normal distribution, and the p-value might be meaningless.
Comprehensive measurement of outcomes: (Simester et al 2009)
- If you’re interested in a relatively broad outcome: healthy behaviors, civic participation, etc., then limiting your measurement of outcomes to one might be misleading. A low-calorie diet might make folks exercise less, information about local government might make folks more likely to attend a townhall but less likely to vote.
- Measure an array of conceptually-relevant outcomes over time.
- Customers of a clothing retailer were randomly assigned to high and low advertising treatments, where the low group received fewer mailed catalogues over the course of the year than the high advertising group. The high advertising group purchased more from the catalogues in the immediate term. But, because the researchers collected data on several outcomes, they discovered that the high advertising purchased less in the subsequent 8 months and purchased less through the company’s website than the low advertising group. What appeared to be a good strategy based on one outcome was demonstrated to be a less profitable strategy when data were collected on several outcomes.
Factorial Design and Special Cases of Non-Interference: (Bertrand and Mullainathan 2004)
- What is the effect of race on employment prospects?
- Audit study, but using emailed resumes instead of confederates (decreases potential confounds due to differences between people). Vary (1) the perceived race of candidates’ names (use demographic information and pre-tests to determine names that sound “most black” or “most white”), (2) resume quality. Varying applicant quality independent of race allows a test of whether the effect of race depends on an applicant’s qualifications. If employers are reluctant to hire blacks because they perceive them to be unqualified, highly-qualified resumes should overcome that barrier. Additionally, this allows the researchers to compare substantive effects of applicant race with those of applicant quality.
- Uses 2x2 factorial design with white and black names; low and high quality with approximately equal numbers of employers assigned to each of the four treatment groups. The factors are by design uncorrelated, allowing for simple difference-in-means comparisons.
- The difference in means estimator has a slightly different interpretation here: the difference in interview rates for white and black applicants represents the estimand effect of race weighted by the share of resumes of different quality. The weights were approximately half and half here, but the researchers could have weighted it differently if they had a reason to do so.
- The really cool part about this experiment was that the researchers sent one black and one white resume to each employer, thus allowing the researchers to observe what is damn close to both potential outcomes for each treated unit. (To do this well, would need to have a couple of similarly-qualified resumes to randomly allocate to black and white candidates so that firms would never receive identical resumes with different names.) This all hinges on a strong no-interference assumption: that no matter whether a firm is sent one resume or four, its potential outcomes Y_i(d) remain the same. Because there are differences between the resumes, they are not actually observing Y_i(0) and Y_i(1) for all subjects, but they’re getting a lot closer than we usually do.

Writing a Proposal

Spell out the research hypothesis: describe in a sentence the causal parameter that you intend to estimate. Explain whether and why you expect the causal parameter to have a particular sign or magnitude. If you anticipate heterogeneous treatment effects, indicate which subgroup(s) you expect to show particularly large or small effects. Pre-registering these keeps allegations of data-mining at bay. Don’t go crazy with it, because the number of planned comparisons determines the Bonferroni correction for multiple comparisons. Specifying the research hypothesis forces researcher to be clear about what is being tested and the outcomes.
Describe the treatment in detail: Clear decription of the treatment allows reader to interpret results and allows other researchers to replicate your experiment. Describe: circumstances in which it was employed, who administered treatment and how, information about canvassers (local or not, age, education, gender, ethnicity…), when experiment occurred, issues at stake, information given, etc).
Describe the criteria by which subjects were included in the experiment: Explain sample restrictions: how subjects came to be included in experiment, criteria that determined eligibility, etc. Describe both population and sampling method.
Explain how subjects were randomly assigned to experimental groups: Fully describe randomization procedure and leave the random numbers in the dataset for later verification. Include statistical code used for generation and the random number seed in the proposal so that the process is automated and reproducible. Discarding bad randomizations in order to improve precision is acceptable but must be justified. When estimating ATE, it may be necessary to weight data if screening caused subjects to have different probabilities of assignment to treatment.
- Simple or complete random assignment: all subjects have equal probability of being assigned to treatment or control.
- Blocked assignment: observations are first divided into distinct strata. Within strata, subjects have equal probability of assignment to treatment and control, but this probability may vary between strata. In effect, each block is its own experiment.
- Clustered assignment: Not assigned as individuals, but as groups. For example, an education experiment may randomly assign classrooms or schools to treatment or control.
- Must account for blocking or clustering in estimating ATE. Otherwise, biased inferences.
Summarize the experimental design: Provide tables or figures with descriptive statistics about assignment to experimental conditions. How many subjects assigned to each condition, tables for each block, describe distribution of subjects in clusters (average cluster size, standard deviation–cluster size can affects sample variability and cluster variance can undermine unbiasedness of difference-in-means estimates).
Check the soundness of the randomization procedure: In expectation, treatment and control groups should have similar background characteristics or covariate balance. Need to check if covariate balance is in line with what one should expect given the use of random assignment. If it’s pre-administration, try re-randomizing. If after, it may be necessary to introduce controls for unbalanced covariates and display results with and without controls. Can determine the magnitude of imbalance using descriptive statistics (for simple or clustered randomization) or weighted means regression (for block randomization). Consider the likelihood that the imbalance would occur by chance (1 in 20 covariates will be statistically significant in expectation by chance), but consider whether it is the result of a flaw in the randomization procedure.
Describe the outcome measures: describe outcomes and manner in which each is measured (describe the items, their correlations with each other, and their correlations with background factors that are expected to predict your outcome of interest).
- Attrition: Describe how many subjects in each condition have missing outcomes.
- Minimize measurement asymmetries: indicate whether administrators are blind to the subject’s condition, if there’s anything about question wording, interviewers, or context that might encourage the treatment group to give different responses. Are the interviewers connected in any way to the administration of the treatment itself?
- Noncompliance: Define compliance and how compliance is measured.
Describe how you plan to analyze the data: Analytic plans help limit the scope of discretion and are helpful in situations where the experimental results are ambiguous. When following a plan, the analyst cannot pick and choose results depending on whether they “look good” or generate statistically significant results. Another nice benefit is that everything is ready to go once you have observations. A simple analytic plan could include:
1. graph the distribution of outcomes for each group
2. compute average outcomes and standard deviations for each group
3. compute the average treatment effect and its standard error
4. use regression to estimate the ATE after controlling for specific covariates
5. use randomization inference to test the sharp null of no treatment effect
6. if continuous outcome, compare variances across experimental groups or use nonparametric bounds (ch 9) to assess whether heterogeneous treatment effects
7. test interactions between treatment and specific covariates
Archive your data and experimental materials: Create a physical or electronic archive of your experimental materials – all scripts, messages, mailings, etc. for both control and treatment groups. Gather lists of contacted individuals and respondents, etc. Keep it. Anything with information about your subjects should typically be anonymized if made public.
Register your experiment: submit your research proposal so that it becomes a part of a permanent public record. This helps to combat the problems of publication bias, as significant findings are more likely than null results to be published. Provides a public good of showing what did not make it to publication. Will hopefully eventually become the norm.

Protection of Human Subjects

Avoid assigning subjects to experimental conditions that you expect will hurt them
Exposing subjects to significant risk of harm requires their informed consent
Take precautions to protect anonymity and confidentiality
Confer with your IRB as you plan your research

Internal and External Validity of Online Surveys and Student Samples

Student Samples

Druckman and Kam

Concerns about the sample come down more to a theoretical than an empirical issue.
- First, we urge researchers to attend more to the potential moderating effects of the other dimensions of generalizabilty: context, time, and conceptualization.
- Second, we encourage the use of dual samples of students and non-students. The discovery of differences should lead to serious consideration of what drives distinctions (i.e., what is the underlying moderating dynamic and can it be modeled?).
- Third, we hope for more discussion about the pros and cons of alternative modes of experimentation.
  - While we recognize the benefits of using survey and/or field experiments, it is critical to assess the advantages in light of the full range of considerations. For example, the control available in laboratory experiments enables researchers to maximize experimental realism (e.g., by using induced value or simply by more closely monitoring the subjects).
  - Similarly, there is less concern in laboratory settings about compliance × treatment interactions that become problematic in field experiments or spillover effects in survey experiments. In terms of external validity, increased control often affords greater ability to manipulate context and time, which, we have argued, deserve much more attention.
  - Finally, when it comes to the sample, attention should be paid to the nature of any sample and not just student samples. This includes consideration of non-response biases in surveys (see Groves and Peytcheva 2008) and the impact of using “professional” survey respondents that are common in many web-based panels.
In an experiment we want external validity for generalizability along multiple dimensions:
- External validity refers to generalization not only of individuals but also across settings/contexts, times, and operationalization.
- Not just replication (this study run on different sample – could we find conceptually similar relationships across multiple dimensions)
- Student subjects don’t intrinsically negate these
External Validity:
- worry disproportionate distribution of low attitude crystallization
- if the treatment effect is the same across populations (homogenous data generation process) , the nature of a particular sample is largely irrelevant for establishing that effect and can still get unbiased estimator
- if heterogeneous (depends on age, race etc) then may worry because sample doesn’t provide enough variance, but random helps
- Theory important! Know if homogenous or heterogeneous effect expected and adjust
- what are we interested in – average citizen v how leaders of countries (can student proxy?)
- The external validity of a single experimental study must be assessed in light of an entire research agenda, and in light of the goal of the study (e.g., testing a theory or searching for facts).
- Assessment of external validity involves multiple-dimensions including the sample, context, time, and conceptual operationalization. There is no reason per se to prioritize the sample as the source of an inferential problem. Indeed, we are more likely to lack variance on context and timing since these are constants in the experiment.
- In assessing the external validity of the sample, experimental realism (as opposed to mundane realism) is critical, and there is nothing inherent to the use of student subjects that reduces experimental realism.
- The nature of the sample—and the use of students—matters in certain cases. However, a necessary condition is: a heterogeneous (or moderated) treatment effect. Then the impact depends on:
- If the heterogeneous effect is theorized, the sample only matters if there is virtually no variance on the moderator. If there is even scant variance, the treatment effect not only will be correctly estimated but may be estimated with greater confidence. The suitability of a given sample can be assessed (e.g., empirical variance can be analyzed).
- If the heterogeneous effect is not theorized, it may be misestimated. However, even in this case, evaluating the bias is not straightforward because any sample will be inaccurate (since the “correct” moderated relationship is not being modeled).
The range of heterogeneous, non-theorized cases may be much smaller than often thought. Indeed, when it comes to a host of politically relevant variables, student samples do not significantly differ from non-student samples.
There are cases where student samples are desirable since they facilitate causal tests or make for more challenging assessments.
We have made a strong argument for the increased usage and acceptance of student subjects, suggesting that the burden of proof be shifted from the experimenter to the critic.

Online Surveys

Researchers often ask questions about cause and effect, but confounds pose major obstacles to causal inference. At the core of conventional quantitative methods is the hope that such confounders can be identified, measured, and controlled. This is difficult if not impossible to actually do: there are difficult-to-measure or unobservable confounds, and including too many controls can cause more problems than it solves.
Randomized control experiments present a possible solution, because randomization is one way to eliminate confounding.
However, many causes of interest to social scientists are difficult to manipulate experimentally.
Natural experiments offer one possible solution: social and political processes or research-design innovations create situations that approximate true experiments.
Find observational situations in which causes are randomly, or as-if randomly, assigned among some set of units, such as individuals, towns, districts, or absence of a cause can then provide credible evidence for causal effects, because random or as-if random assignment can help to eliminate confounding.
Limitations: Not planned, but discovered. Can be difficult to tie to a research agenda unless you’re lucky. Validating as-if random is not straightforward. The popularity of natural experiments can incentivize “conceptual stretching” by researchers, using research designs that only implausibly meet the definitional features of the method. Causes that Nature randomly assigns may not be super important causal variables. Prioritizing causal inference can narrow research agendas to focus on theoretically irrelevant or substantively uninteresting topics.
Varieties of natural experiments, Comparison with “true” (randomized control trial) experiments and observational studies, and Comparison with Quasi-experiments and Matching
- “true” (randomized control trial) experiments have three key attributes:
  1. The response of experimental subjects assigned to receive a treatment is compared against that of assigned control group.
  2. Assignment is randomized through some automated process (i.e. coinflip, random number generator)
  3. The manipulation of the treatment is under the control of an experimental researcher.
- “Standard” observational studies sometimes share the first quality (in that different groups receive different treatments), but do not share the second or third attributes: self-selection is normal, confounds are abundant, and there is no experimental manipulation. Observation research must balance between concerns over omitted variable bias and including irrelevant or poorly measured variables that can make inferences even less reliable than omitted variable bias. Additionally, inferring causation requires a theory of how the data were generated—observable research tends to lack credibility as persuasive depictions of the data-generating process.
- Natural experiments share one crucial attribute with true experiments and partially share a second attribute. First, outcomes are typically compared across subjects exposed to a treatment and those exposed to a control condition (or a different treatment). Second, in partial contrast with true experiments, subjects are often assigned to the treatment not at random, but rather as-if at random (though sometimes true randomization occurs, as in lottery studies). Given that the data come from naturally occurring phenomena that often entail social and political processes, the manipulation of the treatment is not under the control of the analyst; thus, the study is observational.
Contrast with quasi-experiments and matching:
- Quasi-experiments: For the study to qualify as a natural experiment, the researcher should be able to make a credible claim that the assignment of non-experimental subjects to treatment and control conditions is as good as random. This distinguishes natural experiments from “quasi-experiments,” in which comparisons are also made across treatment and control groups but these studies are characterized by non-random assignment.
- Matching: assignment to treatment is not (as-if) random. Instead, comparisons are made between “treatment” and “control” observations that are matched on observable confounders. Matching thus seeks to statistically approximate as-if random assignment by conditioning on observable variables and is thus vulnerable to unobserved confounders and the models are driven by important assumptions, much like regression-based techniques.
Natural Experiments as a Design-Based Approach
An Evaluative Framework for Natural Experiments
How much leverage for causal inference do natural experiments in fact provide? To address this question, it is helpful to discuss three dimensions along which natural experiments may vary. The strongest research designs will perform well in each of these dimensions:
- The Plausibility of As-If Random: “Randomization” should be supported by the available empirical evidence—for example, by showing equivalence on relevant pre-treatment variables (those whose values were determined before the intervention took place) across treatment and control groups, as would occur on average with true randomization. Qualitative knowledge about the process by which treatment assignment takes place can also play a key role in validating a natural experiment.
- The Credibility of Statistical Models, which is closely connected with the simplicity and transparency of the data analysis. As-if random assignment implies that both known and unknown confounders are balanced (in expectation) across treatment and control groups, obviating the need to measure and control for confounding variables. If a researcher needs to add a bunch of post-hoc statistical fixes (like additional controls), then the plausibility of as-if randomness should be questioned. To bolster the credibility of the statistical models employed in natural experimental designs, analysts should report unadjusted difference-of-means tests, in addition to any auxiliary analyses.
- The Substantive Relevance of the Intervention: whether and in what ways the specific contrast between treatment and control provides insight into a wider range of social-scientific, substantive, and/or policy issues that motivate the study.
There may be trade-offs in seeking to design a strong natural experiment (one that performs strongly in each dimension). Different studies may manage the trade-off among these three dimensions in different ways, and which trade-offs are acceptable (or unavoidable) may depend on the question being asked.
Deep substantive knowledge, and a combination of quantitative and qualitative analysis, can help analysts better achieve success along all the three dimensions. Consider the studies of squatters in Argentina. There, substantive knowledge was necessary to recognize the potential to use a natural experiment to study the effect of land titling, and many field interviews were required to probe the plausibility of as-if randomness—that is, to validate the research design. Fieldwork can also enrich analysts’ understanding and interpretation of the causal effects they estimate.
Critiques and Limitations of Natural Experiments
- Detractors suggest that while these methods may offer reliable evidence of policy impacts at the micro level, the findings from natural experiments and from design-based research more generally are unlikely to aggregate into broader knowledge; the interventions being studied may lack substantive or theoretical relevance.
- Advocates for true experiments and natural experiments suggest that these these methods offer the most reliable route to secure causal inference. Even if some of the causes that analysis study appear trivial, the alternative to using true and natural experiments to make causal inferences is even less promising.
Avoiding Conceptual Stretching
- Analysts have sometimes claimed to use natural experiments in settings where the definition criterion of random or as-if random assignment is not plausibly met. This is reflective of a desire to cover observational studies with the glow of experimental legitimacy. There is a risk of conceptional stretching as researchers rush to call conventional observational studies “natural experiments”. This is not productive.

Dissecting Political Science Experiment Examples

Jensen Findley Nielsen. 2019 “Electoral Institutions and Electoral Cycles in Investment Incentives: A Field Experiment on Over 3,000 U.S. Municipalities.”

Field Experiment
Research Hypotheses, Treatments, and Outcomes:
1. Cities with directly elected leaders (mayor-council institutions) are more likely to respond and offer incentives than indirectly elected leaders (council-manager institutions).
  - Treatment: Electoral institutions cannot be randomly assigned, but we plan to perform sub-group analysis to study the effects of each treatment for cities with elected mayors compared with council-manager systems.
  - Outcome: (1) email response (yes or no), (2) size of offer, (3) website hits and activity patterns.
2. Cities with elected leaders are more likely to respond and offer incentives for projects where credit can be claimed prior to an election.
  - Treatment: Randomly assign the timing of when the investment will be announced: either two moths before the next election or one month after. A relocation announcement two months before the election should enable politicians sufficient time to claim credit and benefit electorally from attracting investment.
  - Outcome: (1) email response (yes or no), (2) size of offer, (3) website hits and activity patterns.
3. Country of Origin
  1. Cities will favor U.S. firms over foreign companies
  2. Leaders are less likely to respond positively to Chinese than Japanese investors
  3. Cities with elected (mayoral) leaders are less likely to respond positively to Chinese investors than non-elected (council-manager) cities.
  - Treatment: Randomize treatment of the implied investor’s country of origin (e.g. U.S., China, Japan)
  - Outcome: (1) email response (yes or no), (2) size of offer, (3) website hits and activity patterns.
Describe the criteria by which subjects were included in the experiment: Subjects are mayors, city managers, economic development directors, and their agents in 4000 U.S. cities. The cities and their operating procedures (not individuals) are the key units of analysis. Note: While there is some coordination between cities and state offices occurs as they discuss incentives, there is very little discussion and coordination among cities themselves, so the risk of contamination, detection, or spillover is relatively low.
Explain how subjects were randomly assigned to experimental groups: (Simple or complete random assignment, blocked assignment, clustered assignment…) Pre-treatment blocking on form of city government, size of city, region, and GDP per capita. Randomly assign inquiriesm asking in each about the types of local incentives that can be offered to the client firm on top of any existing state incentives.
Summarize the experimental design: Incorporated real consulting form, and took on a client to represent during the process. Approached all 4,000 cities in the U.S. with a population of 10,000 or greater. Vary key treatment variables including the timing of investment (in relation to elections) and the implied country of origin of the investment. All experimental conditions included in email communications sent to city managers, mayors, and economic development directors. Email not only realistic means of contacting officials, but also minimizes treatment asymmetries.
Describe the outcome measures, the manner in which each is measured, and any problems encountered/solutions employed:
- Attrition:
- Define compliance and noncompliance:
- Minimize measurement asymmetries: Used email and websites rather than other forms of communication to contact cities. Outcomes were measured using web forms that limited subjectivity of interpretation. This allowed for strong standardization of treatments and of the measurement of outcomes across subjects.
- Contamination, detection, and spillover risk: monitored, but assumed to be low due to competitive nature of incentives. In case became problem, backup design included providing an email link to invite subjects to learn the specific details of the client firm.
Describe how data were analyzed: Primary outcome of interest was data city officials provided in the web form included in the email. Prompts for level of tax abatements, number of years the abatements would be in place, and an open-ended box for additional information. Analyzed using difference-in-means, Probit (response rate and incentive offered), OLS (logged dollars).

Tomz 2007. “Domestic Audience Costs in International Relations: An Experimental Approach”

Survey Experiment
Motivation: Goal is to study audience costs directly while avoiding the problem of selection bias. Series of survey experiments to test prevailing theories that leaders would suffer “domestic audience costs” if issued threats or promised and failed to follow through. Citizens would think less of leaders who issued threats and then backed down than they would of leaders who never threatened at all. This generates costs to backing down that make international commitments more credible. Cannot be studied effectively using observational research because of strong strategic selection bias: If leaders take the prospect of audience costs into account when making foreign policy decisions, then in situations when citizens would react harshly against backing down, leaders would tend to avoid that path, leaving little opportunity to observe the public backlash.
Research Hypothesis: Approval ratings will be higher for a leader who chooses to “stay out” of a situation than they will be for a leader who commits to action and “backs down”. (Null hypothesis: no difference in approval ratings for “stay out” and “back down” conditions.)
Describe Treatment in Detail:
- All participants in the Internet-based survey received an introductory script: “You will read about a situation our country has faced many times in the past and will probably face again. Different leaders have handled the situation in different ways. We will describe one approach U.S. leaders have taken, and ask whether you approve or disapprove.”
- Participants then read about a foreign crisis in which “a country sent its military to take over a neighboring country.” To prevent idiosyncratic features of the crisis from driving the results, I randomly varied four contextual variables—regime, motive, power, and interests—that have been shown to be consequential in the international relations literature. The country was led by a “dictator” in half the interviews and a “democratically elected government” in the other half. The attacker sometimes had aggressive motives—it invaded “to get more power and resources”—and sometimes invaded “because of a long-standing historical feud.” To vary power, I informed half the participants that the attacker had a “strong military,” such that “it would have taken a major effort for the United States to help push them out,” and told the others that the attacker had a “weak military,” which the United States could have repelled without major effort. Finally, a victory by the attacking country would either “hurt” or “not affect” the safety and economy of the United States.
- Having read the background information, participants learned how the U.S. president handled the situation. Half the respondents were told: “The U.S. president said the United States would stay out of the conflict. The attacking country continued to invade. In the end, the U.S. president did not send troops, and the attacking country took over its neighbor.” The remaining respondents received a scenario in which the president made a threat but did not carry it out: “The U.S. president said that if the attack continued, the U.S. military would push out the invaders. The attacking country continued to invade. In the end, the U.S. president did not send troops, and the attacking country took over its neighbor.” The language in the experiment was purposefully neutral: it objectively reported the president’s actions, rather than using interpretive phrases such as “backed down,” “wimped out,” or “contradicted himself,” which might have biased the research in favor of finding audience costs.
Outcome measured: After displaying bullet points that recapitulated the scenario, I asked: “Do you approve, disapprove, or neither approve nor disapprove of the way the U.S. president handled the situation?” Respondents who approved or disapproved were asked whether they held their view very strongly, or only somewhat strongly. Those who answered “neither” where prompted: “Do you lean toward approving of the way the U.S. president handled the situation, lean toward disapproving, or don’t you lean either way?” The answers to these questions implied seven levels of presidential approval, ranging from very strong disapproval to very strong approval.
Criteria by which subjects were included in the experiment: The first experiment was administered to a nationally representative random sample of 1,127 U.S. adults in 2004. Nationally-representative random sample using random digit dialing and a recruitment interview in which subjects were offered WebTV services in exchange for taking a weekly survey. Uses ongoing incentives to increase participation and compliance rates. Checked demographic differences (sex, age, race, region, marital status, income, education, etc) between the U.S. population and survey sample to maximize representativeness.
Randomly assigned to experimental groups: Appears to be simple randomization of demographically-representative sample of U.S. households.
Experimental design: Survey with factorial design and likert scale outcomes, each described in greater detail above.
Randomization checks: Randomization checks not reported, but footnote indicates that checked for imbalances and used regression to check for any differences after accounting for imbalances, and found none.
Implimentation problems: attrition, compliance and noncompliance not discussed. Due to the short nature of the survey and limited risk to subjects, it is unlikely that attrition would have been high. No discussion of attention checks or other measures of compliance.
Data analysis: Difference in means tests between public reaction to (1) empty threat and staying out, (2) factorial conditions (regime, power, motive, interests), (3) factorial conditions related to escalation (threat of force, display of force, with casualties, without casualties); sub-group analyses using voter and non-voter types.

Karpowitz, Mendelberg, and Shaker. 2012. “Gender Inequality in Deliberative Participation”

Lab Experiment
Motivation/Research Question: Can men and women have equal levels of voice and authority in deliberation or does deliberation exacerbate gender inequality? Does increasing women’s descriptive representation in deliberation increase their voice and authority? How do (1) decision rules, (2) group composition and (3) the interaction thereof affect gender inequalities?
Research Hypothesis:
- The lower the number of women in a group, the less that women participate in and influence it, and the bigger the gender gap is in participation and influence.
- Unanimous rule protects gender minorities just as it protects preference minorities because of the emphasis it places on inclusion and cooperation.
- The group’s gender composition interacts with its decision rule to exacerbate or erase the inequalities. (1) Minority women will be included more under unanimous than majority rule, and this will decrease the gender gap. (2) Minority men will also be included more under unanimous than under majority rule, but this will enhance rather than close the gender gap. In sum, the effect of unanimity will be roughly equal for both genders, but women will shift from under-representation to equality, whereas men will shift from equality to over-participation.
Describe Treatment in Detail: After participants privately filled out a pretreatment questionnaire, they were brought together as a group, where they were instructed to conduct a “full and open discussion” and to choose the “most just” principle of redistribution. We instructed participants to make a collective decision that would apply not only concretely and immediately to themselves and their group but also hypothetically to society at large, so we could generalize beyond the lab situation to the decisions people make about redistribution in politics. The only requirement was that they deliberate for at least five minutes before making a collective decision about the redistribution of income earned during the experiment. All instructions were exactly the same across conditions.
Voting on rules for the distribution of income occurred by secret ballot, with the decision based on either unanimous or majority rule. Every group included five participants. Before deliberating, participants were given information about several well-known principles of redistribution, including no redistribution at all (everyone keeps what they earn), various poverty thresholds (minimum incomes below which no one would be allowed to fall), or equal redistribution of all income earned by the group (every group member receives the same amount, regardless of performance). These different principles and their implications formed the basis for much of the group discussion. During deliberation, each participant was recorded on a separate audio track, and the full conversation was also recorded on a master track that included all participants.
After the group deliberation and decision, participants were asked to indicate (privately) the most influential person in the group. Participants then performed several rounds of “work”—correcting as many spelling errors in a block of difficult text as they could find within a two-minute time limit. Participants earned money according to their performance, and these earnings were distributed to group members according to their chosen distribution scheme. At the end of the work period, participants responded to a series of questions on attitudes and beliefs and were debriefed.
Outcome Measured: (1) Speech Participation: we divide the number of seconds each individual spoke by the group’s total number of seconds. This is an individual’s Proportion Talk (scaled 0–1), and it allows contrasts across groups with varying discussion lengths. If men and women participated at equal rates in a five-person group, the average individual Proportion Talk for each gender would be 0.20 (in other words, the average male and the average female would each take 20% of the conversation), resulting in a gender gap of 0. Talk Time, men’s and women’s average talk time in the group, because it allows us to examine the gender-homogeneous groups.
(2) Perceived influence: We measured Influence after discussion by asking each group member to indicate the one person who was “most influential” in the group’s discussion and decisions.
Criteria by which subjects were included in the experiment: Non-hispanic white students and community members.
Experimental design: 6x2 between-subjects design, randomly assigning individuals to one of six gender compositions (that is, to a group that ranged from 0 to 5 women) and to one of two decision rule conditions (unanimous or majority rule).
Randomly assigned to experimental groups: Stratified by gender, and gender composition was randomly assigned to dates on the schedule of experimental sessions, and subjects who sigend up to attend on that date were assigned to the corresponding gender composition condition. Each man or woman had the same probability of being assigned to a given gender composition: each person is equally likely to be assigned to a treatment. Randomization of decision rule was achieved by the roll of dice prior to each session.
Randomization checks: Randomization checks and propensity score analyses indicate that individuals were assigned by a random process and groups were equivalent on relevant covariates.
Data Analysis: Regression controls with one-tailed tests; OLS regressions with cluster robust standard errors and negative binomial regressions. (Most results significant at p < .05 with one-tailed test….)

Previous Comp Questions

Do you agree or disagree with the following:There are things we worry about too much in social science experiments: Demand effects, Non-representative samples, Sponsor effects … and things we worry about too little: Statistical power, p-hacking, publication bias/file drawers. In answering the question, be sure to show you understand the above issues, perhaps with reference to specific research areas. Also, in giving your response of agree or disagree, be sure to clearly articulate your reasoning. (Spring 2019)
One of the major concerns over lab and survey experiments is the lack of external validity. What are the major approaches to addressing concerns over the external validity of lab and survey experiments? Which directions do you think are particularly promising, and which proposed solutions are least promising? Do these proposed solutions to external validity for lab/survey experiments make them better options than field experiments, which are typically cited for their stronger external validity? And why? (Fall 2018)
A newspaper has run an expose on racial disparities in wait times and treatment at the local emergency room. Specifically, the article alleges that African American and Latino patients face delays in seeing a doctor when at the emergency room, are less likely to receive pain treatment and are less likely to receive referrals for a specialist. The hospital administrator contends that the observations are anecdotal, and that the non-white population in the city is disproportionately small and afflicted with different medical issues that account for different patterns of care. As a social scientist trained in experimental and observational methods, you are tasked with designing a study to investigate the merits of the newspaper’s claim. How are you going to approach this assignment? You can choose a field experiment, a survey experiment, or an observational study, but you have to choose one. What are the strengths and weaknesses of your choice? What makes it the best choice? Please articulate the various components of internal and external validity and discuss the extent to which your study satisfies those criteria. Finally, please provide a discussion of the ethics of your proposed choice. (Spring 2018)
Pick a well-established argument from your field that has been primarily studied using observational data and that advances a causal claim (for example, democracies don’t go to war with each other; more educated people are more likely to vote). Briefly explain the typical (observational) research design employed to test this argument. Now, drawing on the logic of experiments, what threats to causal inference does this well-established argument face? Next, outline a rigorous experiment that would allow you to test this proposition and make a credible causal claim. Assume that you face typical constraints of a graduate student working on a dissertation, specifically, assume that you have limited time, a $15,000 grant, an IRB, and your own ethical concerns over research with human subjects. Also discuss the major shortcoming(s) of your proposed experiment. (Fall 2017)
Experiments have become a standard tool in political science. And yet there are many different types of experiments, including survey, lab, field, and natural experiments. Please discuss the costs and benefits of each type of experiment with respect to external and internal validity issues, including the core issues of compliance, attrition, and spillover. (Spring 2017)
The issues of non-response, social desirability bias, and question order effects commonly arise across many areas of survey research. Discuss an example of each of these three things that you have encountered either in your own research or in the political science literature. For each of the examples, describe how the issue arises (i.e. what causes the problem), what effect it has on the potential conclusions drawn by researchers, and how it can be remedied or at least how its effects can be minimized. (Spring 2016)
Deception is a debated topic in experimental methods -imagine that you are joining a department that has a “no deception” policy in regards to experiments.What do you see as the benefits and costs to such a policy? Would you campaign to change the rule? In your proposal, how much deception would you allow and exactly how would you justify it? (Spring 2015)
The primary benefits of social science experiments are due to randomization of subjects to treatment and control conditions. Sometimes even well-intentioned randomization fails, however. Please discuss how one would recognize different types or levels of randomization failure. Then, identify the various threats to internal and external validity associated with randomization failure. Finally, is it possible to “fix” the randomization and, if so, to what extent (and how) can the benefits of full randomization be restored. (Fall 2015)