A recent opinion piece on a seemingly unrelated topic has great methodological importance for research in the "design science" tradition. Design science is essentially engineering research, in which the researcher builds a system -- e.g. a recommender system, a data mining system, etc. -- and tests whether it works better than existing systems. It also includes research such as interface design, in which the researcher isolates and tests the efficacy of a single design element, as opposed to a whole system. We'll get back to both kinds of design science below, and in a later post, I will discuss the important differences between these two versions of design science. But first, the NYT article.
The New York Times article is titled "Do Clinical Trials Work?" (
http://www.nytimes.com/2013/07/14/opinion/sunday/do-clinical-trials-work.html?pagewanted=all)
It writes about clinical trials of medicines. The purpose of a clinical trial (a so-called Phase 3 trial) is to test the efficacy of a proposed drug as part of the process of gaining approval from the FDA.
The chief concern that is expressed in this NYT opinion is that even after hearing the results of a study -- or indeed, of the totality of studies -- one still doesn't know which drug works best. The article implies that the reason for this is that each clinical trial tests the efficacy of a single drug, often against a placebo. This is not quite right. The reason one doesn't know which drug works best, is that each drug is treated in isolation, and it's impossible to compare the magnitude of effect on one study against the magnitude of effect in another study, due to the myriad confounding factors (e.g. different population, etc.). Distinct from this, the reason one doesn't know which COMBINATION of drugs works best, is because each drug is tested against a placebo. I will elaborate, but just to summarize till here:
PROBLEM 1: Don't know which drug works better; REASON: Each drug tested in different setting.
PROBLEM 2: Don't know which combination of drugs works best; REASON; each drug tested in isolation, against a placebo.
Finally, the article raises a third problem, which is that one doesn't know the circumstances under which one drug may work better than another; this is attributed to the fact that drug efficacy depends crucially on the presence of ("is moderated by", in academic parlance) genetic factors that most large clinical trials don't include.
In the information systems setting the question is, what do we learn from a study that pits system-with-feature-X against a "placebo" system-without-feature-X? Well, may we learn that feature X is helpful. But we don't know if this feature is better than feature Y, which was studied separately (PROBLEM 1). And more to the point, similar work is being done by dozens or hundreds of other researchers, each studying one or two system features, and demonstrating that they work better than ...nothing, i.e. the placebo. This leaves us with the question of which combination of features works best (PROBLEM 2).
This issue was recently raised in Norbert Fuhr's acceptance speech for his Salton award (
http://www.is.inf.uni-due.de/bib/pdf/ir/Fuhr_12.pdf), a kind of lifetime achievement award for work done in the field of information retrieval, aka search engines. As he noted, a study by Armstrong et al. (2009) reported that there has not been any upward trend in the overall performance of (laboratory-based) information retrieval systems over the past decade or so, in spite of an endless stream of papers reporting system features that improve performance. Now, the information retrieval field does not suffer badly from PROBLEM 1. The reason is that they often use standardized data sets, meaning that even when two researchers each study a different feature, they study them on the same set of documents and queries. This would be akin to two separate medical clinical trials, each testing a different drug for a particular condition, ON THE SAME SET OF PATIENTS. Obviously, this is not practicable in the medical setting, where hundreds of studies are being carried out, each in a different hospital etc.
But like many fields in engineering, including much work in IS's design science, information retrieval does suffer from PROBLEM 2. What happens is that each year, new design features are suggested, but always in comparison with the same - call it "placebo" -- baseline, not with respect to a system that includes all previously known good-features. The result is that we are left with a sort of inventory of design features, each of which is provenly better than nothing, but with no guidance about which combination of features works best. Armstrong et al. further imply that this essentially means that the studies were "cheating", for if feature Z only works better than a placebo with "no features", but not better than a decent system that includes previously-known-to-work features, then Z cannot be said to "work" in any meaningful sense. At least, that's their view. And, it leaves us with the problem of not knowing which combination of features works best. To remedy this, they suggest that each researcher should compare his/her newly proposed design against the best-performing system that is known to date. In other words, if I propose a new design feature Z, I should test a system that has all the features that lead to the very best performance overall but that does not include Z, against a system that has all those features and ALSO feature Z. Then, if Z adds marginal benefit, we will have learned something.
I recently wrote a commentary in ACM SIGIR Forum, which diametrically opposes the suggestion that proper design science should test proposed new features as additions to previous-best-performing systems. I argue that the correct remedy to this situation is not to require comparisons against previous-best-performers, but to engage in more conceptual research. Conceptual research is about using theory to guide the invention and definition of variables, and how to measure them, and their relationships with other similarly conceived variables -- NOT ULTIMATE OUTCOMES -- and testing those definitions and relationships in empirical work. This is the a-b-c of scientific work in all fields, except in engineering fields where there is a tendency to test any proposal on ultimate performance measures.
Take an example from maritime engineering. Suppose a researcher proposes a new design element for a ship, e.g. a new material that results in a stronger hull. In the engineering-oriented approach of Armstrong et al. -- which is also implied in the NYT article -- the researcher must create (simulate) a full, best-performing total ship that includes all good design elements that yield the world's single top-performing ship. Then, based on that previous top-performer, he/she should see if the stronger hull adds any improvement to the ultimate outcome e.g. top speed or time between refueling or whatever is the ultimate outcome measure.
By contrast, in conceptual research, a researcher would propose specific, local variables that the increased hull strength is expected to affect. Indeed, the whole notion of "hull strength" would have to first be conceived as a meaningful variable to think about; it would have to be defined, on its own terms and in terms of its expected relationships with other variables. The researcher would propose the other variables that it directly affects, and would not (only) predict how that might affect ultimate outcomes. In academic parlance, a variable's direct relationship with other variables that are not ultimate outcomes, is called the "mechanism" through which the variable affects the ultimate outcome. For example, the researcher might propose that the stronger hull will reduce the ship's wake. A proper test of that hypothesis is a (simulation of) whether such a stronger hull indeed reduces the ship's wake. The importance of such research for shipbuilding is the hope that, under some conditions, the reduced wake might improve an ultimate performance measure; but that would be outside the scope of the described research.
In this conceptual world, it is not only unnecessary, but counter-productive, to test instead whether the stronger hull led to improvement in some ultimate performance measure such as top speed. Unnecessary, because we are trying to learn how things work. And counter-productive, because it might very well be that the so-called "previous best" ship would be better if we had REMOVED one of its supposedly great features, and INSTEAD used the stronger hull. It is the nature of scientific work to study direct connections between local variables, and in this effort, it is perfectly correct to use a placebo as the baseline. This is not "cheating" because the aim is not to show that my system is the winner, or to say that ships with stronger hulls will be better on some ultimate performance criterion. Rather, the aim (in this example) is to see whether a stronger hull actually reduces the ship's wake. Other researchers will do something similar, studying different sets of local variables, such as how wake interacts with wind, or what have you. Armed with these separate understandings of how things work, we may be able to predict which combinations of elements work well, under which circumstances. It's not trivial, but we're in a much better position than if we had conducted experiments that only measure ultimate performance. Each piece of conceptual research contributes insights into how things work. Then we might be able to theorize and hypothesize which combinations make sense together. This is called science, and not all engineering fields are steeped in the tradition of conceptual research.
To summarize, in conceptual research, we learn how things work, and this will ultimately guide us about what combination to expect to work. By contrast, in the horse-racing approach that dominates some engineering fields, we are indeed left with an inventory of features, but no guidance about how they work, and so no guidance about which combinations may be expected to work best.
In the medical field, actually, I think there is a strong tradition of conceptual research to complement the measurement of ultimate outcomes. Clinical trials are like those engineering studies that try to measure an ultimate performance measure. But in medicine, those same clinical trials often also measure the many layers of causal mechanisms (what led to longer life? reduced tumor size during x months; what led to that? increased susceptibility of cancer cells to destruction by X; what led to that? increased Y, supplied by the drug being tested). Thus, in the medical field like in engineering fields, the ultimate performance measure of a single study has limited meaning. X worked better than nothing, but is it better than alternative Y that was studied elsewhere? Hard to know. Is it best to use X in combination with A or in combination with B or neither? Will a regimen of X added to A, offer any benefit compared with A alone? Don't know, based solely on the clinical trial's ultimate performance measures. But the answer is not to require clinical tests to compare the addition of X to the so-called previous best performer. Rather, the answer is to focus -- as the medical field does -- also on the mechanisms, the less-than-ultimate performance measures, which explain how things are working. This yields guidance about what combinations of drugs might work best. I am no expert, but I believe that medical research, including clinical trials, does not limit itself to ultimate performance measures. Therefore, I think the situation in the medical world is not as bleak as the article portrays it. I think they accumulate knowledge of mechanisms, and this serves as the basis for contemplating what combinations might work well. In engineering design science, I am less sanguine that researchers appreciate the benefit of conceptual research.
To summarize, in design science as in all science, the most important research is conceptual research that studies mechanisms, i.e. directly related variables. This is the best way to make sustained progress on the level of whole-systems, because it guides us about which combinations of features make sense together.