Automation & Reproducibility of Data Analysis

Modern scientific research depends increasingly on quantitative evaluation of hypotheses with large data sets and sophisticated statistical and computational tools. Simultaneously, competitive forces and limited resources in the sciences have unfortunately fostered a publication culture that values confirmation of hypotheses much more greatly than thoughtful negative results. In addition, researchers hesitate to share their data widely for a variety of reasons, including lack of good systems to ensure that they receive credit and concerns for the privacy of their research subjects. These pressures threaten to damage the reproducibility and trustworthiness of much of the scientific literature.

At CRCS, we believe that progress can be made on many of these challenges by collaboratively developing new kinds of tools for scientific data analysis, and reconsidering the pipeline from hypothesis to experiment to publication. Several different projects are under way to address these issues along various dimensions.

Automating Data Analysis

One of the significant challenges for science is bringing effective and rigorous analysis tools to bear on important problems, when methodological experts are not in the loop. While ideally, statisticians and machine learning researchers would be involved in many applied projects that demand sophisticated methods, it is not always possible to find such collaborators. This deficit motivate the development of new tools that not only automate analysis in ways that mimic statistical thought processes, but also that help scientists explore their data without damaging significance via, e.g., implicit testing of multiple hypotheses.