Although words are basic semantic units in text, phrases and expressions contain additional information that are important for text classification. To capture these information, traditional algorithms extract composite features via word sequences or co-occurrences, such as bigrams and termsets, but ignore the influence of stop words and punctuation, which results in huge amounts of weak features. In this paper, we propose a text structure based algorithm to extract composite features. Termsets which cross punctuation marks or stop words in the text are excluded. To eliminate redundancy, a novel discriminative measure containing two factors is suggested. One is employed to measure the relevancy, while the other is incorporated to increase the values of composite features whose class frequencies are much smaller than those of their sub-features. Experiments on three benchmark datasets with both a support vector machine and a naive Bayes classifier illustrate the effectiveness of the approach.
To View the Abstract Contents
Now it is Your Time to Shine.
Great careers Start Here.
We Guide you to Every Step
Success! You're Awesome
Thank you for filling out your information!
We’ve sent you an email with your Final Year Project PPT file download link at the email address you provided. Please enjoy, and let us know if there’s anything else we can help you with.
To know more details Call 900 31 31 555
The WISEN Team