Text Classification
Text classification with LibLinear
-
Create training instances from .arff file
private static Instances getDataFromFile(String path) throws Exception{
DataSource source = new DataSource(path); Instances data = source.getDataSet(); if (data.classIndex() == -1){ data.setClassIndex(data.numAttributes()-1); //last attribute as class index } return data;
}
Instances trainingData = getDataFromFile(pathToArffFile);
-
Use StringToWordVector to transform your string attributes to number representation:
*Important features of this filter:
-
tf-idf representation
-
stemming
-
lowercase wrods
-
stopwords
-
n-gram representation*
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000); if(useIdf){ filter.setIDFTransform(true); } filter.setTFTransform(true); filter.setLowerCaseTokens(true); filter.setOutputWordCounts(true); filter.setMinTermFreq(minTermFreq); filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER)); NGramTokenizer t = new NGramTokenizer(); t.setNGramMaxSize(maxGrams); t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile(); stopwords.setStopwords(new File(“data/stopwords/stopwords.txt”)); filter.setStopwordsHandler(stopwords); if (useStemmer){ Stemmer s = new /Iterated/LovinsStemmer(); filter.setStemmer(s); } filter.setInputFormat(trainingData);
-
-
Apply the filter to trainingData:
trainingData = Filter.useFilter(trainingData, filter);
-
Create the LibLinear Classifier
-
SVMType 0 below corresponds to the L2-regularized logistic regression
-
Set
setProbabilityEstimates(true)
to print the output probalitiesClassifier cls = null; LibLINEAR liblinear = new LibLINEAR(); liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE)); liblinear.setProbabilityEstimates(true); // liblinear.setBias(1); // default value cls = liblinear; cls.buildClassifier(trainingData);
-
-
Save model
System.out.println("Saving the model..."); ObjectOutputStream oos; oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model")); oos.writeObject(cls); oos.flush(); oos.close();
-
Create testing instances from
.arff
fileInstances trainingData = getDataFromFile(pathToArffFile);
-
Load classifier
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+“mymodel.model”);
-
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:
filter.setInputFormat(trainingData);
*This will make training and testing instances compatible.
Alternatively you could use InputMappedClassifier
*
-
Apply the filter to testingData:
testingData = Filter.useFilter(testingData, filter);
-
Classify!
1.Get the class value for every instance in the testing set
for (int j = 0; j < testingData.numInstances(); j++) { double res = myCls.classifyInstance(testingData.get(j)); }
res
is a double value that corresponds to the nominal class that is defined in .arff
file. To get the nominal class use : testintData.classAttribute().value((int)res)
2.Get the probability distribution for every instance
for (int j = 0; j < testingData.numInstances(); j++) {
double[] dist = first.distributionForInstance(testInstances.get(j));
}
dist
is a double array that contains the probabilities for every class defined in .arff
file
Note. Classifier should support probability distributions and enable them with: myClassifier.setProbabilityEstimates(true);