How to solve Machine Learning Challenges on AWS

Often in Data Science we ask ourselves which algorithm or method to use in order to solve a Data Science or even “AI” problem. Here at PROTOS Technologie we are specialised in cloud solutions, hence I want to also show possible general ways of solving such modern problems with the help of AWS.

To get a glimpse on how we might solve a Data Science problem, we should start to have a look at our data. Do we have some kind of “label” in the data which is showing a data class or do the computer needs to actually find out the clusters of “unlabelled” data? In the first case we have a so-called supervised learning problem, the other one is called unsupervised. For the sake of complexity, we will focus on these two types.

Supervised learning problems can be split into the most important problems: classification and regression. The difference is simply the prediction type: While regression deals with the outcome of a continuous variable (i.e. number), classification outputs a class (i.e. a colour). Here are some typical examples for each type:

regression	find the housing price based on location	predict the temperature
classification	detect if a person is healthy or unhealthy	find the species of an animal

When we deal with a lot of data in regression or classification, another term plays an important rule: factorization machines. The movie recommendation tool of Netflix is a good example. A Factorization Machine is a general supervised learning algorithm that can be used for both classification and regression tasks. This extension of a linear model is designed to economically capture interactions between functions within high-dimensional data in low-density datasets. Low-density means that the database or matrix is sparse (empty): In a rating database i.e. a lot of user rated only some of thousand available movies on Netflix and since most entries will be null.

Easily speaking, unsupervised problems are encountered when the computer needs to determine the label of a data point, i.e. finding the group a data point belongs to. A good example is the K-Nearest Neighbour algorithm in which a class assignment is made considering its nearest neighbours. Another good example of unsupervised learning could be to find some anomalies in pictures.

Pictures however are an interesting type of data structure. An image is a collection of many pixel which itself consists of some colours, i.e. RGB. To process this amount of data, we must compress it somehow in order to find features (i.e. a face) in a picture, Convolutional Neural Networks are often used. CNNs reduce many pixels and map them into one.

AWS High-Level Services

Reading my Machine Learning 101, we can now have a look at some awesome AWS products and when it is a good idea to use them:

If you want to bring out an application as fast as possible, I can recommend using one of the high-level services AWS offer:

Amazon Forecast

Amazon Forecast is a fully managed service for time series forecasting. If you provide historical time series data for Amazon Forecast, you can predict future points in the series. Time series forecasts are useful for different domains such as retail, financial planning, supply chain, and healthcare. You can also use Amazon Forecast to forecast operational metrics for inventory management as well as human resources and resource planning.

AWS Rekognition

With AWS Rekognition you can easily build an image classification application without even knowing a lot about its theory. You can simply use its high-level API. With Amazon Rekognition, it is possible to identify objects, people, text, scenes, and activities in images and videos. You can also identify any inappropriate content. Amazon Rekognition also provides highly accurate facial analysis.

AWS Comprehend

Amazon Comprehend stands for the analysis of unstructured text data using NLP (Natural Language Processing). Various analysis tools of the cloud-based service extract key phrases in order to recognize the sentiment of a text or filter out names or places in the context of entity recognition. In text processing software often sentiment analysis is needed, with AWS Comprehend you can easily find out if a comment is positive, neutral or negative.

Low-Level Services (AWS Sagemaker)

If you need to have a more low-level control and you are a more experienced data scientist, you will love AWS Sagemaker. It enables you to set up jupyter notebooks, import sample notebooks, run your training and deploy your models without bothering about the hardware components. You will pay for the high-performance machine only for the time of training, which will save you money.

Lastly, I want to point out the most interesting and common in-built algorithms in Sagemaker. The following table is showing common problems which can be solved perfectly with the mentioned AWS Sagemaker algorithms:

Sagemaker in-built algorithmCommon problem to solve

Factorization Machines Algorithm

Sagemaker in-built algorithm	Factorization Machines Algorithm	KNN	Image Classification Algorithm	XGBoost Algorithm	BlazingText
Common problem to solve	when building recommendation tools and have a lot of sparse data	unsupervised clustering problems	image recognition and classification	general supervised regression or classification problems (a lot of competitions are won with this algorithm)	natural language processing