A Novel Framework to Detect Source Code Plagiarism: Now, Students Have to Work for Real!
Contributor(s)
Equipe Hultech - Laboratoire GREYC - UMR6072 ; Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen (GREYC) ; Centre National de la Recherche Scientifique (CNRS) - Ecole Nationale Supérieure d'Ingénieurs de Caen - Université de Caen Basse-Normandie - Centre National de la Recherche Scientifique (CNRS) - Ecole Nationale Supérieure d'Ingénieurs de Caen - Université de Caen Basse-NormandieEquipe AMACC - Laboratoire GREYC - UMR6072 ; Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen (GREYC) ; Centre National de la Recherche Scientifique (CNRS) - Ecole Nationale Supérieure d'Ingénieurs de Caen - Université de Caen Basse-Normandie - Centre National de la Recherche Scientifique (CNRS) - Ecole Nationale Supérieure d'Ingénieurs de Caen - Université de Caen Basse-Normandie
Keywords
[INFO.INFO-TT] Computer Science [cs]/Document and Text Processing[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG]
Full record
Show full item recordOnline Access
https://hal.archives-ouvertes.fr/hal-01067161https://hal.archives-ouvertes.fr/hal-01067161/document
https://hal.archives-ouvertes.fr/hal-01067161/file/p57-lesner.pdf
Abstract
International audienceOur work focuses on detecting plagiarism within a source code corpus. The case study is to help a human corrector to find out plagiarism within source code written by Computer Science students. Like other approaches, we use the notion of similarity distance. However, in this work we introduce segmentation to split documents into smaller parts and propose a document-wise distance based on the cost of permuting segments to transform one document to another. Our framework is laid out as a pipeline, where each stage can be parameterized to build up a plagirism detector fitting user needs. The approach makes no assumption about the programming language being analyzed. Furthermore, it provides a synthetical report of the results to ease the decision making process, as we consider that only a human user has final word on wether it is plagiarism or not. We tested our framework on hundreds of real source files, involving many programming languages, allowing us to discover previously undetected frauds.
Date
2010-03-22Type
info:eu-repo/semantics/conferenceObjectIdentifier
oai:HAL:hal-01067161v1hal-01067161
https://hal.archives-ouvertes.fr/hal-01067161
https://hal.archives-ouvertes.fr/hal-01067161/document
https://hal.archives-ouvertes.fr/hal-01067161/file/p57-lesner.pdf
DOI : 10.1145/1774088.1774101
DOI
: 10.1145/1774088.1774101ae974a485f413a2113503eed53cd6c53
: 10.1145/1774088.1774101