Wednesday, January 13, 2016

Remove or Filter Stopping/Stemming words using java

For the better indexing or searching the data in the big text chunk we need to filter the unwanted words from the data to get the better performance on the search by indexing only the logical words.

What is Stopping Words?

Stopping words are the words which will be used to make the sentence along with consonants/verbs i.e. where is my car? in this "where/is/my" are the stopping words which are not required for the search.

What is Stemming Words?

Stemmer are the words which will make the action word along with stopping words. i.e Stopping Word + "ing/tion/ational/ization/ation....etc" : going/standing

I was looking of the library to achieve the filtering of the stopping/stemming words, Not found much on googling, decided to go through the stopping/stemming words blogs/white papers and algorithms, started writing small util library to do the filtering by using java and its available on maven repository and full source on github.

Exude Library

This is simple library for removing/filtering the stopping,stemming words from the text data, this library is in very basic level of development need to work on for later changes.

This is the part of maven repository now,Directly add in pom following.

    <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>
Download latest version of exude download

Features:
1.Filter stopping words from given text/file/link
2.Filter stemming words from given text/file/link
3.Get swear words from given text/file/link

How Exude library works:
Step 1: Filter the duplicate words from the input data/file.
Step 2: Filter the stopping words from step1 filtered data.

Step 3: Filter the stemmer/swear words from step2 filtered data using the Porter algorithm which is used for suffix stripping.
exude process sequence flow:
How to use exude Library
Environment and dependent jar file
1.Minimum JDK 1.6 or higher
2.Apache Tika jar (which is used to parse the files for the data extraction)


Sample code
Sample Text Data
 String inputData = "Kannada is a Southern Dravidian language, and according to Dravidian scholar Sanford Steever, its history can be conventionally divided into three periods; Old Kannada (halegannada) from 450–1200 A.D., Middle Kannada (Nadugannada) from 1200–1700 A.D., and Modern Kannada from 1700 to the present.[20] Kannada is influenced to an appreciable extent by Sanskrit. Influences of other languages such as Prakrit and Pali can also be found in Kannada language.";
 String output = ExudeData.getInstance().filterStoppings(inputData);
output

extent southern influenced divided according halegannada kannada language three 450 found modern influences periods pali steever a middle d languages old nadugannada dravidian sanford history scholar appreciable 1700 1200 conventionally sanskrit prakrit present 20 
Sample File Data
String inputData = "any file path";
String output = ExudeData.getInstance().filterStoppings(inputData);
System.out.println("output : "+output);
Sample link Data
String inputData = "https://en.wikipedia.org/wiki/Rama";
String output = ExudeData.getInstance().filterStoppings(inputData);
System.out.println("output : "+output);
Get swear words from data/file/link
String inputData = "enter text with bad words";
String output = ExudeData.getInstance().getSwearWords(inputData);

Library source code on github

0 comments:

Post a Comment