On finding duplication and near-duplication in large software systems

Top Cited Papers

19 November 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 86-95
https://doi.org/10.1109/wcre.1995.514697

Abstract

This paper describes how a program called dup can be used to locate instances of duplication or near-duplication in a software system. Dup reports both textually identical sections of code and sections that are the same textually except for systematic substitution of one set of variable names and constants for another. Further processing locates longer sections of code that are the same except for other small modifications. Experimental results from running dup on millions of lines from two large software systems show dup to be both effective at locating duplication and fast. Applications could include identifying sections of code that should be replaced by procedures, elimination of duplication during reengineering of the system, redocumentation to include references to copies, and debugging.

Keywords

This publication has 12 references indexed in Scilit:

Parameterized Pattern Matching: Algorithms and Applications
Journal of Computer and System Sciences, 1996
Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code
Journal of Computational and Graphical Statistics, 1993
Status report: software reusability
IEEE Software, 1993
Identifying the semantic and textual differences between two versions of a program
Published by Association for Computing Machinery (ACM) ,1990
Detecting Plagiarism in Student Pascale Programs
The Computer Journal, 1988
The X window system
ACM Transactions on Graphics, 1986
Measurements of program similarity in identical task environments
ACM SIGPLAN Notices, 1984
Linear Algorithm for Data Compression via String Matching
Journal of the ACM, 1981
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory, 1977
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM, 1976