How to evaluate the performance of your ML/AI model | by Sarah A. Metavalli

An accurate evaluation is the only way to improve performance

From technology to a new language or cooking a new cuisine, learning is the best way to learn anything. Once you learn the basics of a field or application, you can build on that knowledge by acting. Building models for various applications is the best way to solidify your knowledge of machine learning and artificial intelligence.

Although both fields (or actually subfields, since they overlap) have applications in a wide variety of contexts, the steps in learning how to model are more or less the same, regardless of the target application area.

AI language models such as ChatGPT and Bard are gaining popularity and interest from both tech novices and general audiences because they can be very useful in our daily lives.

Now that more models are being released and introduced, one may ask “what makesGoodAI/ML models, and how can we evaluate the performance of one?

That is what we are going to cover in this article. But again, we assume that you already have an AI or ML model built in. Now, you want to evaluate and improve (if necessary) its performance. But, again, regardless of the type of model you have and your end application, there are steps you can take to evaluate your model and improve its performance.

To help us follow through the concepts, let’s use the Vine dataset from sklearn(1), apply a support vector classifier (SVC), and test its metrics.

So, let’s jump right in…

First, we import the libraries we’ll be using (don’t worry about what each of them does now, we’ll get to that!).

import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import matplotlib.pyplot as plt

Now, we read our dataset, apply the classifier and evaluate it.

wine_data = datasets.load_wine()
X = wine_data.data
y = wine_data.target

Depending on your stage in the learning process, you may need access to large amounts of data that you can use for training and testing and evaluation. Also, you can use different data to train and test your model as this will prevent you from realistically evaluating your model’s performance.

To overcome that challenge, split your data into three smaller random sets and use them for training, testing, and validation.

A good rule of thumb is a 60,20,20 approach to doing that split. You will use 60% of the data for training, 20% for validation, and 20% for testing. You need to shuffle your data before partitioning to ensure a better representation of that data.

I know it may sound complicated, but luckily, Ticket-Learn came to the rescue by offering a function to do that split for you, train_test_split().

So, we can take our dataset and split it as follows:

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, train_size=0.60, random_state=1, stratify=y)

Then use the training part of it as input to the classifier.

#Scale data
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#Apply SVC model
svc = SVC(kernel="linear", C=10.0, random_state=1)
svc.fit(X_train, Y_train)
#Obtain predictions
Y_pred = svc.predict(X_test)

At this point, we have some results to “evaluate”.

Before starting the evaluation process, we must ask ourselves one essential question about the model we use: What will make this model good?

The answer to this question depends on the model and how you plan to use it. That being said, there are standard evaluation metrics that data scientists use when they want to test the performance of an AI/ML model, including:

accuracy is the percentage of correct predictions by the model out of total predictions. This means, when I run the model, out of all the predictions, how many predictions are true? This article goes into depth about testing the accuracy of a model.
accuracy is the percentage of true positive predictions by the model out of all positive predictions. Unfortunately, precision and accuracy are often confused; One way to explain the difference between them is to think of accuracy as the closeness of the predictions to the true values, while precision is how close the correct predictions are to each other. Therefore, accuracy is an absolute measure, yet both are important for evaluating model performance.
Memorization is the ratio of true positive predictions to all true positive examples in the dataset. The purpose of recall is to find related predictors within the dataset. Mathematically, if we increase the recall, we decrease the accuracy of the model.
F1 score is tIt is a combination of precision and recall, using both precision and recall to provide a balanced measure of model performance. This video from CodeBasics discusses the relationship between precision, recall, and F1 scores, and how to find the optimal balance of those evaluation metrics.

Video by CodeBasics

Let us now calculate different metrics for the predicted data. The way we will do it is by first displaying the confusion matrix. The confusion matrix is simply the predicted result versus the actual result of the data.

conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
#Plot the confusion matrix
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(conf_matrix, cmap=plt.cm.Oranges, alpha=0.3)
for i in range(conf_matrix.shape(0)):
for j in range(conf_matrix.shape(1)):
ax.text(x=j, y=i,s=conf_matrix(i, j), va="center", ha="center", size="xx-large")
plt.xlabel('Predicted Values', fontsize=18)
plt.ylabel('Actual Values', fontsize=18)
plt.show()

The confusion matrix for our dataset would look like this,

If we look at this confusion matrix, we can see that in some cases the actual value was “1” while the predicted value was “0”. Which means that the classifier is not 100% accurate.

We can calculate the precision, accuracy, recall and f1 score of this classifier using this code.

print('Precision: %.3f' % precision_score(Y_test, Y_pred, average="micro"))
print('Recall: %.3f' % recall_score(Y_test, Y_pred, average="micro"))
print('Accuracy: %.3f' % accuracy_score(Y_test, Y_pred))
print('F1 Score: %.3f' % f1_score(Y_test, Y_pred, average="micro"))

For this particular example, the results for those are:

Precision = 0.889
Recall = 0.889
accuracy = 0.889
F1 score = 0.889

Although you can actually use a variety of approaches to evaluate your models, certain evaluation methods will better predict model performance depending on the model type. For example, in addition to the above methods, if the model you are evaluating is a regression (or involves regression) model, you can also use:

– Mean Squared Error (MSE) Mathematically, it is the average of the squared differences between the predicted and actual values.

– Mean Absolute Error (MAE) is the average of the absolute difference between the predicted and actual values.

Those two metrics are closely related, but implementation-wise, MAE is simpler (at least mathematically) than MSE. However, MAE does not do well with significant errors, unlike MSE, which emphasizes errors (as it squares them).

Before discussing hyperparameters, let us first differentiate between hyperparameters and parameters. A parameter is a method that defines a model for solving a problem. In contrast, hyperparameters are used to test, validate, and optimize model performance. Hyperparameters are often chosen by data scientists (or in some cases clients) to control and validate the learning process of the model and, therefore, its performance.

There are a variety of hyperparameters that you can use to validate your model; Some are generic and can be used on any model, such as:

learning rate: This hyperparameter controls how much the model needs to change in response to an error when the model’s parameters are updated or changed. The selection of the optimal learning rate is a compromise with the time required for the training process. If the learning rate is low it can slow down the training process. Conversely, if the learning rate is too high, the training process will be faster, but the performance of the model may be affected.
batch size: The size of your training dataset will significantly affect the training time and learning rate of the model. Therefore, finding the optimal batch size is a skill that is often developed as you build more models and increase your experience.
Number of Ages: An epoch is a complete cycle for training a machine learning model. The number of epochs to use varies from model to model. Theoretically, more epochs lead to fewer errors in the verification process.

In addition to the above hyperparameters, there are also model-specific hyperparameters such as regularization strength or the number of hidden layers in a neural network implementation. This 15 minute video by APMonitor explores the various hyperparameters and their differences.

Video by APMonitor

Validating an AI/ML model is not a linear process, but more of an iterative one. You often go through data partitioning, hyperparameter tuning, analysis, and validating results more than once. How many times you repeat that process depends on the analysis of the results. For some models, you may only need to do this once; For others, you may need to do this a couple of times.

If you need to repeat the process, you use the insights from previous evaluations to improve the model architecture, training process, or hyperparameter settings until you are satisfied with the model’s performance.

When you start building your own ML and AI models, you’ll quickly realize that choosing and implementing models is the easy part of the workflow. However, testing and evaluation is the part that will make up most of the development process. Evaluation of AI/ML models is an iterative and often time-consuming process, and requires careful analysis, experimentation, and fine-tuning to achieve desired performance.

Fortunately, the more model building experience you have, the more systematic your process of evaluating model performance becomes. And it’s a worthwhile skill given the importance of evaluating your model, such as:

Evaluating our models allows us to objectively measure the model’s metrics that help us understand its strengths and weaknesses and provide insight into its predictive or decision-making capabilities.
If different models exist that can solve similar problems, then evaluating them helps us to compare their performance and choose the best one for our application.
Evaluation provides insight into the weaknesses of the model, allowing for improvement through analysis of errors and areas where the model performs poorly.

So, be patient and keep building models; It gets better and more efficient with the more models you build. Don’t let the details of the process discourage you. It may sound like a complicated process, but once you understand the steps, it will become second nature to you.

(1) Lichtman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California,
School of Information and Computer Science. (CC BY 4.0)

Google’s CFO just got promoted

How Google’s latest AI model is generating music from your brain activity

Easy Rider to Midnight Run, The Greatest Roadtrips Movies of All Time

Three new Starfield animated shorts offer more glimpses of Bethesda’s new universe

Some top AMD chips have a huge security flaw

What is a Linux Bash Script and How Do You Build One?

Trending Tags

World IVF Day: Infertility is a silent epidemic – why is it important to tackle fertility problems? experts tell

What is ‘duck walk’ in old age? Expert shares tips on maintaining normal mobility

Radiohead brands portfolio expands with the launch of Hustle™ energy drink. Unveiled through new campaign “Dreams are free, #HustleModeOn for everything else – Food Marketing Technology”

From Chris Gayle to Virat Kohli: Most runs scored by players in India vs West Indies ODI series

Infertility Treatment: How Ayurveda Can Help Increase Fertility? experts tell

Ishant Sharma opens up about the truth behind Zaheer Khan’s Test retirement and the allegations against Virat Kohli

Trending Tags

Google’s CFO just got promoted

How Google’s latest AI model is generating music from your brain activity

Easy Rider to Midnight Run, The Greatest Roadtrips Movies of All Time

Three new Starfield animated shorts offer more glimpses of Bethesda’s new universe

Some top AMD chips have a huge security flaw

What is a Linux Bash Script and How Do You Build One?

Trending Tags

World IVF Day: Infertility is a silent epidemic – why is it important to tackle fertility problems? experts tell

What is ‘duck walk’ in old age? Expert shares tips on maintaining normal mobility

Radiohead brands portfolio expands with the launch of Hustle™ energy drink. Unveiled through new campaign “Dreams are free, #HustleModeOn for everything else – Food Marketing Technology”

From Chris Gayle to Virat Kohli: Most runs scored by players in India vs West Indies ODI series

Infertility Treatment: How Ayurveda Can Help Increase Fertility? experts tell

Ishant Sharma opens up about the truth behind Zaheer Khan’s Test retirement and the allegations against Virat Kohli

Trending Tags

How to evaluate the performance of your ML/AI model | by Sarah A. Metavalli | May, 2023

UHBW improves patient safety with digital innovation from System C

RBI withdraws Rs 2,000 notes, Rs 2.31 crore cash found in Rajasthan government office

admin

RBI withdraws Rs 2,000 notes, Rs 2.31 crore cash found in Rajasthan government office

Leave a Reply Cancel reply

Recent posts

Recent News

Open Access vs. Subscription: Masa Depan Aksesibilitas Jurnal Akademik

Strategi Memilih Jurnal yang Tepat untuk Naskah Penelitian Anda