"How to calculate accuracy of a model"

"So after googling for the past 2 hours you have somehow just fitted (or trained in layman terms) your model successfully and now you would like to know the accuracy score of your fit. Here are a couple of ways in which you can do that in python"

(How I fit my models every day. Source: Twitter/thesmartjokes)

Introduction

Before jumping right into the implementation let's first see how to calculate accuracy. As per the definition given in the course Udacity - Intro to Machine Learning : "Accuracy is defined as the number of test points that are classified correctly divided by the total number of test points."

Now that we understand how to calculate accuracy we can create a toy dataset and work on it.

Implementation in Python


import random

random.seed(18)

# total data point
n_points = 100

# data points
X = [(random.random(), random.random()) for ii in range(0,n_points)]

# data labels
y = [round(random.random()) for ii in range(0,n_points)]

# split into train/test sets
split = int(0.75*n_points) # Train:Test::0.75:0.25
X_train = X[0:split] # features_train
X_test  = X[split:] # features_test
y_train = y[0:split] # labels_train
y_test  = y[split:] # labels_test

# import Gaussian Naive Bayes (GaussianNB)
from sklearn.naive_bayes import GaussianNB

# define classifier
clf = GaussianNB()

# fit the training data features and it's labels
clf.fit(X_train, y_train)


# predict labels for the test dataset
pred = clf.predict(X_test)
Above code in Layman terms:

1. Generate total 100 data points which are nothing but our features (Value between 0 to 1) : X
2. Now generate their labels wrt to the features, also 100 in numbers. (Value 0 OR 1) : y
3. Split the data with training percent 75% and testing 25%. : split
4. 75% of the features (X) for training : X_train
5. 25% of the features (X) for testing : X_test
6. 75% of the labels (y) for training : y_train
7. 25% of the labels (y) for testing : y_test
8. Use Gaussian Naive Bayes as our classifier
9. Predict the labels for the features of our testing set : pred

Note: In machine learning, we use 'small y' in contrast with 'capital Y' for the labels.
Now we will use Gaussian Naive Bayes as our classifier. Keep in mind that as the length of `features of testing : X_test` is equal with `labels of testing : y_test` so I have taken the liberty to interchange them.

len(y_test) == len(X_test)
>>> True
Now we will find the accuracy. We can do this in 4 ways.

1. Without any use of the external library

In this approach, we will first count the total number of correctly predicted labels and then divide them by to the total number of labels (or test data points)

count = len(['matched' for idx, label in enumerate(y_test) if label == pred[idx]])
print((float(count) / len(y_test)))
>>> 0.68

2. Using numpy

We will use the numpy sum() method to check the total number of correctly predicted labels in contrast to iterating over them.

import numpy as np
print(float(np.sum(pred == y_test)/len(y_test)))
>>> 0.68

3. Using sklearn.metrics

We will use the `sklearn.metrics.accuracy_score` which is recommended over the previous. ('Cause admit it, you are lazy)

from sklearn.metrics import accuracy_score
print(accuracy_score(pred, y_test))
>>> 0.68

4. Using the score method of the classifier

Most of the classifiers provide this method. It takes the testing features and it's labels as the parameters.

print(clf.score(X_test, y_test))
>>> 0.68

Conclusion

So in this article, we learned what is accuracy and how to find it. We also learned how to implement in python.

Reference

[1] : https://twitter.com/thesmartjokes/status/843730749404545024
[2] : https://in.udacity.com/course/intro-to-machine-learning--ud120-india
[3] : http://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes
[4] : http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
[5] : https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
[6] : https://mahata.github.io/machine%20learning/2014/12/31/sklearn-accuracy_score/