Browsing by Author "Sindhu, L"

Dyuthi/Manakin Repository

Dyuthi Home →
Browsing by Author

About Dyuthi | Login

Browsing by Author "Sindhu, L"

Now showing items 1-2 of 2

Author Identification in Malayalam using n-grams

Sumam, Mary Idicula; Bindu, Baby Thomas; Sindhu, L (2009)

[+]

Abstract:

Author identification is the problem of identifying the author of an anonymous text or text whose authorship is in doubt from a given set of authors. The works by different authors are strongly distinguished by quantifiable features of the text. This paper deals with the attempts made on identifying the most likely author of a text in Malayalam from a list of authors. Malayalam is a Dravidian language with agglutinative nature and not much successful tools have been developed to extract syntactic & semantic features of texts in this language. We have done a detailed study on the various stylometric features that can be used to form an authors profile and have found that the frequencies of word collocations can be used to clearly distinguish an author in a highly inflectious language such as Malayalam. In our work we try to extract the word level and character level features present in the text for characterizing the style of an author. Our first step was towards creating a profile for each of the candidate authors whose texts were available with us, first from word n-gram frequencies and then by using variable length character n-gram frequencies. Profiles of the set of authors under consideration thus formed, was then compared with the features extracted from anonymous text, to suggest the most likely author.

URI:

http://dyuthi.cusat.ac.in/purl/4103

Files in this item: 1

Files	Size
Author Identifi ... alayalam using n-grams.pdf	(388.1Kb)

A Copy detection Method for Malayalam Text Documents using N-grams Model

Sumam, Mary Idicula; Bindu, Baby Thomas; Sindhu, L (February 9, 2013)

[+]

Abstract:

In this paper a method of copy detection in short Malayalam text passages is proposed. Given two passages one as the source text and another as the copied text it is determined whether the second passage is plagiarized version of the source text. An algorithm for plagiarism detection using the n-gram model for word retrieval is developed and found tri-grams as the best model for comparing the Malayalam text. Based on the probability and the resemblance measures calculated from the n-gram comparison , the text is categorized on a threshold. Texts are compared by variable length n-gram(n={2,3,4}) comparisons. The experiments show that trigram model gives the average acceptable performance with affordable cost in terms of complexity

URI:

http://dyuthi.cusat.ac.in/purl/4104