Shining Light into Black Boxes

Abstract
The publication and open exchange of knowledge and material form the backbone of scientific progress and reproducibility and are obligatory for publicly funded research. Despite increasing reliance on computing in every domain of scientific endeavor, the computer source code critical to understanding and evaluating computer programs is commonly withheld, effectively rendering these programs “black boxes” in the research work flow. Exempting from basic publication and disclosure standards such a ubiquitous category of research tool carries substantial negative consequences. Eliminating this disparity will require concerted policy action by funding agencies and journal publishers, as well as changes in the way research institutions receiving public funds manage their intellectual property (IP). In publicly funded research outside of computational science, the creation and dissemination of new tools, techniques, and methods requires detailed publication and disclosure of information necessary to satisfy peer review, experimental reproduction, and the ability to build upon another's work. Research tools created using public funds, such as animal models or cell lines, even those intended for commercialization, must fulfill disclosure and publication requirements ([ 1 ][1]). Disclosure practices among scientist-programmers often do not meet these standards. Computer programs created in the course of research can range from single-command line scripts to multigigabyte code repositories. Many scientist-created programs are ad hoc efforts never intended for distribution or release, but all can be equally critical to research outcomes. Although it is typical to publish general conceptual and functional descriptions of new, major pieces of scientist-created software, it is not uncommon to withhold the program source code and instead release only the binary (executable) version of a program. Source code is the human readable form of a programming language and contains the complete set of instructions for how a computer processes input data. In the absence of source code, the inner workings of a program cannot be examined, adapted, or modified. ![Figure][2] The consequences of relying on these black boxes in research computation can be far-reaching. Common implementation errors in programs, such as failing to convert units correctly or assigning missing values as zero, can be difficult to detect without access to source code ([ 2 ][3]). Recent retractions, resignations, and canceled clinical drug trials at Duke University involved unreleased and unreproducible code ([ 3 ][4]). Calls for greater focus on reproducibility in scientific research have mounted in recent years ([ 4 ][5], [ 5 ][6]), and the inability to reproduce many published computational results or to perform credible peer review in the absence of program source code has contributed to a perceived “credibility crisis” for research computation ([ 6 ][7], [ 7 ][8]). Source code withholding causes duplication of efforts by preventing sharing and reuse of validated computer code ([ 8 ][9]) and is incompatible with the stated goals of science funding agencies and policy advisory bodies ([ 9 ][10]). How and why this unique disparity in disclosure practices persists within research computation is complex and goes beyond simple protectionism. Contributing factors may include the informal means by which most scientist-programmers attain their programming skills ([ 10 ][11], [ 11 ][12]). It is not uncommon for self-taught programmers to be insecure about publishing “ugly” code: programs that work but do not conform to accepted best practices, are inefficient, or are aesthetically lacking ([ 12 ][13]). Lack of awareness and education around issues of code dissemination among scientist-programmers may also contribute. Among the small number of programming courses geared toward scientists, issues of code publishing or software licensing are seldom addressed. Systems of attribution and citation, frequently relied on as metrics for career evaluation and achievement, which have evolved to accommodate publication of traditional scientific methods and techniques, may not adequately assure authorship credit when source code is adapted by other researchers. Tendencies toward traditional IP protection regimes at institutional technology transfer offices (TTOs) can result in proprietary licensing and distribution schemes that discourage release of source code ([ 13 ][14]). Public-funding and policy-setting agencies have yet to enumerate clear, comprehensive, and universal policies promoting the publishing and dissemination of computer source code. Some specific funding initiatives evaluate applicants, in part, on software sharing and dissemination plans [e.g., ([ 14 ][15])]. Such grants are typically for, or specifically include, large software development projects, however, and thus fail to address the large majority of scientist-created code. Most significant may be the absence of a universal disclosure requirement by the gatekeepers of scientific publishing. Of the 20 most-cited journals in 2010 from all fields of science ([ 15 ][16]), only three ([ 16 ][17]–[ 18 ][18]) (including Science ) have editorial policies requiring availability of computer source code upon publication. This stands in stark contrast to near-universal agreement among the 20 on policies regarding availability of data and other enabling materials. Source code can be made available through a variety of mechanisms. Posting code for download on laboratory Web sites, deposition in public code repositories, or making use of publisher facilities for supplemental materials are just a few existing options ([ 6 ][7]). Because of the complexity and unique characteristics of computer source code, however, preserving the systems of attribution and citation that have evolved to accommodate traditional channels of scientific publishing (e.g., data...