Classification

Using Support Vector Machines

Support vector machines is a family of algorithms attempting to pass a (possibly high-dimension) hyperplane between two labelled sets of points, such that the distance of the points from the plane is optimal in some sense. SVMs can be used for classification or regression (corresponding to sklearn.svm.SVC and sklearn.svm.SVR, respectively.

Example:

Suppose we work in a 2D space. First, we create some data:

import numpy as np

Now we create x and y:

x0, x1 = np.random.randn(10, 2), np.random.randn(10, 2) + (1, 1)
x = np.vstack((x0, x1))

y = [0] * 10 + [1] * 10

Note that x is composed of two Gaussians: one centered around (0, 0), and one centered around (1, 1).

To build a classifier, we can use:

from sklearn import svm

svm.SVC(kernel='linear').fit(x, y)

Let’s check the prediction for (0, 0):

>>> svm.SVC(kernel='linear').fit(x, y).predict([[0, 0]])
array([0])

The prediction is that the class is 0.

For regression, we can similarly do:

svm.SVR(kernel='linear').fit(x, y)

RandomForestClassifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

A simple usage example:

Import:

from sklearn.ensemble import RandomForestClassifier

Define train data and target data:

train = [[1,2,3],[2,5,1],[2,1,7]]
target = [0,1,0]

The values in target represent the label you want to predict.

Initiate a RandomForest object and perform learn (fit):

rf = RandomForestClassifier(n_estimators=100)
rf.fit(train, target)

Predict:

test = [2,2,3]
predicted = rf.predict(test)

Analyzing Classification Reports

Build a text report showing the main classification metrics, including the precision and recall, f1-score (the harmonic mean of precision and recall) and support (the number of observations of that class in the training set).

Example from sklearn docs:

from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

Output -

         precision    recall  f1-score   support

class 0       0.50      1.00      0.67         1
class 1       0.00      0.00      0.00         1
class 2       1.00      0.67      0.80         3

avg / total   0.70      0.60      0.61         5

GradientBoostingClassifier

Gradient Boosting for classification. The Gradient Boosting Classifier is an additive ensemble of a base model whose error is corrected in successive iterations (or stages) by the addition of Regression Trees which correct the residuals (the error of the previous stage).

Import:

from sklearn.ensemble import GradientBoostingClassifier

Create some toy classification data

from sklearn.datasets import load_iris

iris_dataset = load_iris()

X, y = iris_dataset.data, iris_dataset.target

Let us split this data into training and testing set.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

Instantiate a GradientBoostingClassifier model using the default params.

gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)

Let us score it on the test set

# We are using the default classification accuracy score
>>> gbc.score(X_test, y_test)
1

By default there are 100 estimators built

>>> gbc.n_estimators
100

This can be controlled by setting n_estimators to a different value during the initialization time.

A Decision Tree

A decision tree is a classifier which uses a sequence of verbose rules (like a>7) which can be easily understood.

The example below trains a decision tree classifier using three feature vectors of length 3, and then predicts the result for a so far unknown fourth feature vector, the so called test vector.

from sklearn.tree import DecisionTreeClassifier

# Define training and target set for the classifier
train = [[1,2,3],[2,5,1],[2,1,7]]
target = [10,20,30]

# Initialize Classifier. 
# Random values are initialized with always the same random seed of value 0 
# (allows reproducible results)
dectree = DecisionTreeClassifier(random_state=0)
dectree.fit(train, target)

# Test classifier with other, unknown feature vector
test = [2,2,3]
predicted = dectree.predict(test)

print predicted

Output can be visualized using:

import pydot
import StringIO

dotfile = StringIO.StringIO()
tree.export_graphviz(dectree, out_file=dotfile)
(graph,)=pydot.graph_from_dot_data(dotfile.getvalue())
graph.write_png("dtree.png")
graph.write_pdf("dtree.pdf")

Classification using Logistic Regression

In LR Classifier, he probabilities describing the possible outcomes of a single trial are modeled using a logistic function. It is implemented in the linear_model library

from sklearn.linear_model import LogisticRegression

The sklearn LR implementation can fit binary, One-vs- Rest, or multinomial logistic regression with optional L2 or L1 regularization. For example, let us consider a binary classification on a sample sklearn dataset

from sklearn.datasets import make_hastie_10_2

X,y = make_hastie_10_2(n_samples=1000)

Where X is a n_samples X 10 array and y is the target labels -1 or +1.

Use train-test split to divide the input data into training and test sets (70%-30%)

from sklearn.model_selection import train_test_split 
#sklearn.cross_validation in older scikit versions

data_train, data_test, labels_train, labels_test = train_test_split(X,y, test_size=0.3)

Using the LR Classifier is similar to other examples

# Initialize Classifier. 
LRC = LogisticRegression()
LRC.fit(data_train, labels_train)

# Test classifier with the test data
predicted = LRC.predict(data_test)

Use Confusion matrix to visualise results

from sklearn.metrics import confusion_matrix

confusion_matrix(predicted, labels_test)