I use a variant of the BW13 analysis as an introduction to my graduate seminar on quantitative methods of policy research, focusing on budget deficits rather than growth rates. One can correlate either party with budgetary restraint, simply by choosing to focus on the White House versus Congress. The BW13 paper is interesting not just because it asks a provocative question, but also because that question neatly wraps up partisan preferences with answers that are extremely difficult to answer unambiguously using empirical observations.
Teasing out causality is central to both policy making and policy research. As Steinberg (2007, PDF) writes:
Central to the aims of public policies, and the political constituencies supporting them, is the hope of having a causal impact on some aspect of the world. It is hoped that welfare-to-work programs will lead to a decline in chronic unemployment; that the international whaling regime will cause threatened species to rebound; and that health education campaigns will reduce HIV transmission. As Pressman and Wildavsky (1973, p. xxi) observed, “Policies imply theories. Whether stated explicitly or not, policies point to a chain of causation between initial conditions and future consequences. If X, then Y.” Accordingly, while causal theories play a role in many areas of social inquiry, they are vital to the practice of policy analysis, where they are used to diagnose problems, project future impacts of new regulations, and evaluate the effectiveness of—and assign responsibility for—past interventions (Chen, 1990; Lin, 1998; Young, 1999). Causal assessment plays an equally important role in the policy process tradition, as researchers identify the causal factors shaping policy agendas, decision-making styles, state–society relations, and the dynamics of stability and change (Baumgartner & Jones, 1993; Rochon & Mazmanian, 1993; Sabatier, 1999).In short, the question of attribution of cause and effect is a key one in many policy settings. Here I discuss the paper in the context of broader methodological and epistemological issues associated with evaluating cause and effect in policy settings.
1. Risks of Data-Before-Theory
In starting with a correlation rather than a theory of causality, BW13 start off down a potentially treacherous methodological path, but one that they ultimately negotiate fairly well. Here is how BW13 explain the observations which lead to their central question:
[E]conomists have paid virtually no scholarly attention to predictive power running in the opposite direction: Do election outcomes help predict subsequent macroeconomic performance? The answer, which while hardly a secret is not nearly as widely known as it should be, is a resounding yes. Specifically, the U.S. economy performs much better when a Democrat is president than when a Republican is.The potential problem here is that BW13 have generated a hypothesis -- Political party of the US president influences differentially subsequent US economic growth -- based on an examination of an existing correlation. Given that spurious relationships show up all the time (and are well understood, if not always appreciated -- see, e.g., Yule 1926, Hendry 1980, DeLong and Kang 1982, Ioannidis 2012), researchers have to exercise extreme caution in the generation of hypothesis based on observations rather than on theories of causality.
stylized.” The superiority of economic performance under Democrats rather than Republicans is nearly ubiquitous; it holds almost regardless of how you define success. By many measures, the performance gap is startlingly large--so large, in fact, that it strains credulity, given how little influence over the economy most economists (or the Constitution, for that matter) assign to the President of the United States. . .
During the 64 years that make up these 16 [presidential] terms, real GDP growth averaged 3.33% at an annual rate. But the average growth rates under Democratic and Republican presidents were starkly different: 4.35% and 2.54% respectively.3 This 1.80 percentage point gap (henceforth, the “D-R gap”) is astoundingly large relative to the sample mean. It implies that over a typical four-year presidency the U.S. economy grew by 18.6% when the president was a Democrat, but only by 10.6% when he was a Republican.
Of course, in doing policy-related research as a practical matter it is the unpredicted or surprising observation that typically generates research questions deemed important. Why did the train crash? Why is the unemployment rate high? Why did the typhoon disaster occur?
Right away you see that typical questions important to policy and decision making deviate from the textbook model of hypothesis generation in that they start with a correlation or an outcome and work backwards to a theory of causality. This makes attention to methods that much more important to avoid being fooled by randomness, or other pathologies of thinking, such as using data selectively to favor certain theories (e.g., here in PDF).
2. What is "Significant"?
BW13 find a very strong statistical relationship between political party of the president and the rate of GDP growth, p = 0.01. By contrast, they find much weaker evidence of a relationship between the height of the president and the rate of GDP growth, p = 0.39. What the p-value tells us is the probability of observing the difference between the two variables, assuming that we know the distributions of outcomes from which each comes, a point I'll return to below..
While the literature is chock full of discussions of the use and abuse of p-values in the interpretation of statistics, the idea that a strong p-value is either necessary or sufficient to denote a causal relationship persists. (Anyone doubting this claim need merely browse the archived discussions on this blog related to trends in extreme weather.)
One proposed method to deal with the challenges of data-before-theory is to implement more stringent levels of statistical significance. For instance, DeLong and Kang 1982 warn of the problem of data-mining by researchers (and on the relationship of p and t statistics, see this recent essay).
Most of us suspect that most empirical researchers engage consciously or unconsciously in data mining. Researchers share a small number of common data sets; they are therefore aware of regularities in the data even if they do not actively search for the "best" specification. There seems to be no practical way of establishing correct standard errors when researchers have prior knowledge of the data . . .Even if researchers are able to avoid data mining, any observations from a complex system -- like the US economic and political systems -- will be explainable by a large number of plausible theories and relationships. The set of explanations can be said to be overdetermined in the sense that there will be more hypotheses supported by evidence than can plausibly be correct. For instance, BW13 test 27 variables (Table A.3) against GDP growth rates over different time periods. They could plausibly come up with many more variations to test, given that there is no generally accepted theory of economic growth.
One possible reaction is to adjust standard errors by some multiplicative factor that "compensates" for this abuse of classical procedures. Along these lines, we can use our data to ask the question, By what factor would we have to divide reported t-statistics so that one-ninth of unrejected nulls would exhibit a marginal significance level of .9 or more? The answer is about 5.5. The t-statistic of two rule of thumb would then suggest that only unadjusted t-statistics of 11 or more should be taken seriously, in which case hypothesis testing-especially in macroeconomics-would become largely uninformative. Empirical work would play only a very minor role in determining the theories that economists believe. Some claim that at present empirical work does play a very minor role in determining the theories that economists believe (see McCloskey 1985).
When there are multiple statistical tests available, one method for dealing with this situation is called the Bonferroni correction, which "compensates" for the multiple tests by increasing the threshold of statistical significance by the factor of the number of possible tests. (See Appendix A in this CCSP report in PDF on climate extremes to see a nice example its application in the context of time series of climate extremes.)
The methodological approaches recommended by DeLong and Kang 1982 and under the Bonferroni correction both suggest making tests of statistical significance much more rigorous, even to the point of making the tests practically irrelevant. From this perspective, a p result of 0.011 may be qualitatively no different than 0.39 as both may be orders of magnitude away from a more appropriately calculated threshold.
For a small-N study where relationships are many, co-determined, non-stationary and contingent, what BW13 might actually tell us is that conventional methods of econometrics are very limited in their ability to say anything much about causality. This is especially the case when exploring a process (economic growth) that is in general poorly understood in terms of cause and effect.
3. Limits of Statistics in Small-N Studies
The discussion so far has taken us to an uncomfortable place: it may be that conventional economic research methods and statistics offer limited help in the challenge of untangling causality in policy settings. The very idea of statistical methods in such a context is worth unpacking. The idea that our observations of society -- in this case elected presidents and economic performance -- can be said to be samples which come from a distribution (much less, distributions which we might characterize accurately) requires a giant metaphysical leap into alternative universes of counterfactuals. Such a leap leads to questions fundamentally unresolvable using empirical methods. It is no wonder that many policy arguments reduce to competing views on esoteric theories.
In a 2007 paper (here in PDF) in The Policy Studies Journal Steinberg provides a nice overview of the importance of "small-N" studies:
Another great attraction of small-N approaches, both in theoretical and applied settings, is their ability to trace causal mechanisms. The design of intelligent policy interventions requires analyses that move beyond mere patterns of correlation to include reasonably precise characterizations of the mechanisms through which posited causal variables exert their effects. Similarly, credible theories of political behavior and policy processes must not only demonstrate correlations but must establish a logic of association (George & Bennett, 2005, pp. 135–47). Yet it is widely recognized that statistical analysis, for all of its analytic power, is of limited value in tracing causal processes . . .Getting back to BW13, which is certainly a small-N study, they take a smart approach by using the data to generate alternative hypotheses about the possible chain of causality between the election of a president and subsequent economic growth. However, as discussed frequently on this blog, their analysis is hampered by the fact that no one actually knows where economic growth comes from. BW13 don't either.
BW13 explore several variables that impact economic growth and come to the following conclusions:
Much of the D-R growth gap in the United States comes from business fixed investment and spending on consumer durables. And it comes mostly in the first year of a presidential term. Yet the superior growth record under Democrats is not forecastable by standard techniques, which means it cannot be attributed to superior initial conditions. Nor does it stem from different trend rates of growth at different times, nor to any (measureable) boost to confidence from electing a Democratic president.Their bottom line? The residual difference in economic growth observed between Republican and Democrat presidents is a "mystery." What BW13 may have actually rediscovered is that we don't know where economic growth actually comes from. Solving that riddle will require going beyond simple statistics.
Democrats would no doubt like to attribute the large D-R growth gap to better macroeconomic policies, but the data do not support such a claim. Fiscal policy reactions seem close to “even” across the two parties, and monetary policy is, if anything, more pro-growth when a Republican is president—even though Federal Reserve chairmen appointed by Democrats outperform Federal Reserve chairmen appointed by Republicans. It seems we must look instead to several variables that are mostly “good luck.” Specifically, Democratic presidents have experienced, on average, better oil shocks than Republicans, a better legacy of (utilization-adjusted) productivity shocks, and more optimistic consumer expectations (as measured by the Michigan ICE).