Contributor(s)The Pennsylvania State University CiteSeerX Archives
Full recordShow full item record
AbstractWe present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The run-ning time for finding all groups of similar files, even for as little as 25 % similarity, is on the order of 500MB to 1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar files to a query file using a preprocessed index. Application of sif can be found in file management, information collecting (to remove duplicates), program reuse, file synchronization, data compression, and maybe even plagiarism detection. 1.