Authored by
Praveen Srivatsa, Director Asthrasoft Consulting
Machine Learning is the underlying science behind all of Artificial Intelligence. Machine Learning leverages algorithms to understand data patterns and map the same into intelligent suggestions. While working with these algorithms is the domain of a data scientist, it is important for business and management teams to understand the basis of how these algorithms work.
Algorithms types by input
There are many types of algorithms that are in use. Broadly however, Machine Learning algorithms can be categorized either based on input as “Supervised”, “Unsupervised” and “Reinforcement” or based on output as “Regression”, “Classification”, “Clustering” or “Association”. Across all these algorithms, the accuracy of the outcomes are based on “Specificity”, “Sensitivity” and “Precision”. Let's look at each of these to understand the machine learning algorithms better.
Algorithms types by input
Supervised algorithms are ones where the data scientists picks the algorithm that best suits the data pattern that they have observed. A good example is object recognition where we train the algorithm with lots of photos of dogs and cats and then ask it to recognize if a new photo has a dog or a cat. Facial recognition, object identification and text detection are examples of supervised learning algorithms.
Unsupervised learning algorithms are the ones where there is no known categorization or identification. Grouping similar photos together, organizing people by height, classifying music by preferences are all examples of unsupervised learning.
Reinforcement learning is a continuous process where neither the input or the output is known, but we have an expected action which is mapped to a reward. Getting an algorithm to understand a game and learn to beat its opponent is a great example of reinforced learning.
Algorithms types by output
Regression is a type of supervised learning algorithm that usually returns a numeric outcome. Predicted value of a stock, estimated sales earning by product, box office collections of a new movie are all examples of different regression algorithms that take into account multiple parameters and predicts an outcome. Some of these algorithms include linear regression and logistic regression.
Classification is another type of supervised learning algorithm that returns a class. Identifying a color, facial recognition, object detection are all examples of classification. In this case the algorithm is trained with sample labelled data for each of these types which it then uses to detect the same from a newer set of data. Some of these algorithms include “random forest” and “support vector machine” (SVM).
Clustering is a type of unsupervised learning algorithm that groups the input classes and returns a set of input classes by the group. The groups can be adjusted for the count and granularity of the groups. Vehicle movement patterns by area, targeted advertising by age and analyzing jobs openings by skill are all examples of clustering algorithm implementations. This allows businesses to target specific groups while allowing for new groups to evolve automatically. Examples of clustering algorithms include K-Means and K-NN (nearest neighbour).
Association is another type of unsupervised learning that finds which of the input parameters influence other parameters. For example, home buyers might be furniture buyers too, people who like a type of music might like a type of movie too or people who play a sport might visit a type of restaurant.
Other algorithms include “Anomaly Detection” which continuously identifies anomalies or exceptions and is used for fraud detection, surveillance or sentiment analysis and “Decision Making” which uses reinforcement learning that emphasises decisions on actions in a given environment and is ideally suited for robotics and gaming.
Classification is another type of supervised learning algorithm that returns a class. Identifying a color, facial recognition, object detection are all examples of classification. In this case the algorithm is trained with sample labelled data for each of these types which it then uses to detect the same from a newer set of data. Some of these algorithms include “random forest” and “support vector machine” (SVM).
Clustering is a type of unsupervised learning algorithm that groups the input classes and returns a set of input classes by the group. The groups can be adjusted for the count and granularity of the groups. Vehicle movement patterns by area, targeted advertising by age and analyzing jobs openings by skill are all examples of clustering algorithm implementations. This allows businesses to target specific groups while allowing for new groups to evolve automatically. Examples of clustering algorithms include K-Means and K-NN (nearest neighbour).
Association is another type of unsupervised learning that finds which of the input parameters influence other parameters. For example, home buyers might be furniture buyers too, people who like a type of music might like a type of movie too or people who play a sport might visit a type of restaurant.
Other algorithms include “Anomaly Detection” which continuously identifies anomalies or exceptions and is used for fraud detection, surveillance or sentiment analysis and “Decision Making” which uses reinforcement learning that emphasises decisions on actions in a given environment and is ideally suited for robotics and gaming.
Understanding the Accuracy of algorithms
With machine learning, the objective of a data scientist is to devise “accurate” machine learning models. However, this can be a very misleading term. To understand this, let’s us take a look at this table where we are trying to determine a disease based on a diagnostic test which itself is not 100% reliable.
Actually Positive (AP)
|
Actually Negative (AN)
|
|
Positive Prediction (PP)
|
True Positives (TP)
|
False Positives (FP)
|
Negative Prediction (NP)
|
False Negatives (FN)
|
True Negatives (TN)
|
In general accuracy or precision is the ratio of true positives (TP) vs the predicted positives (PP). However if the positive prediction itself is very low (1% occurence of a very rare disease) and if the set of data has a high prevalence of a category (male) then a high accuracy is not relevant when predicting the disease for a female.
Accuracy can be better understood by breaking down the outcome measures into “Sensitivity” and “Specificity”. Sensitivity is the ratio of true positives (TP) to actual positives (AP) and is a good measure of how good the test or algorithm is. Specificity however is the ratio of the true negative (TN) to the actual negatives (AN) and is a good measure of how effective the test or algorithm is.
A highly sensitive fire alarm can detect a fire very early, however this typically also means it has a low specificity resulting in many daily activities like cooking, setting off the alarm and resulting in a lot of false alarms. An alarm with a high degree of specificity might do a great job of ringing only when an actual fire is detected, however this typically results in low sensitivity and it cannot detect the fire early. The most “accurate” systems are ones with the right balance of sensitivity and specificity.
Be wary of bias
Most machine learning algorithms tend to have some kind of bias. This is mostly due to the prevalence of datasets that skew the precision. As an example, a dataset that contains a large set of data of male population will have a low precision of disease detection of the female population even though the overall precision remains high. Similarly a dataset with lots of data from adults will be less precise when used on children. In such cases, breaking down the definition of accuracy into specificity and sensitivity and applying them on relevant groups separately helps in accounting for the bias more accurately.
No comments :
Post a Comment