Evaluating the performance of LSA for source-code plagiarism detection
AbstractLatent Semantic Analysis (LSA) is an intelligent information retrieval technique that uses mathematical algorithms for analyzing large corpora of text and revealing the underlying semantic information of documents. LSA isahighly parameterized statistical method, and its effectiveness is driven by the setting of its parameters which are adjusted based on the task to which it is applied. This paper discusses and evaluates the importance of parameterization for LSA based similarity detection of source-code documents, and the applicability of LSA asatechnique for source-code plagiarism detection when its parameters are appropriately tuned. The parameters involve preprocessing techniques, weighting approaches; and parameter tweaking inherent to LSA processing-in particular, the choice of dimensions for the step of reducing the original post-SVD matrix. The experiments revealed that the best retrieval performance is obtained after removal of in-code comments (Java comment blocks) and applyingacombined weighting scheme based on term frequencies, normalized term frequencies, and acosine-based document normalization. Furthermore, the use of similarity thresholds (instead of mere rankings) requires the use ofahigher number of dimensions.
Cosma, Georgina and Joy, Mike. (2013) Evaluating the performance of LSA for source-code plagiarism detection. Informatica, Volume 36 (Issue 4). pp. 409-424. ISSN 0350-5596