Apache Spark MLlib & Ease-of-Prototyping With Docker

The core operational capabilities and how to launch a cluster instantly using just one command

Ali I. Metin
The Startup

--

The Perfect Combination for Fast ML Prototyping

Apache Spark is the most developed library that you can utilize for many of your Machine Learning applications. It provides the users with the ease of developing ML-based algorithms in data scientist’s favorite scientific prototyping environment Jupyter Notebooks.

The Main Differences between MapReduce HDFS & Apache Spark

MapReduce needs files to be stored in a Hadoop Distributed File System, Spark doesn’t need that and it is really really fast compared to MapReduce. In contrast, Spark uses Resilient Distributed Datasets aka RDDs to store data which highly resembles Pandas dataframes. RDDs are also where Spark shines really brightly because you can harness the mighty power of the pandas-like data preprocessing and transformation through them.

The Main Installation Choices for Using Spark

  1. You can launch an AWS Elastic Map Reduce Service and use Zeppelin Notebooks but this is a premium service and you have to deal with creating an AWS account.
  2. Databricks Ecosystem → Has its own file system and dataframe syntax, it is a service started by one of the Spark’s founders, however this will result in a vendor lock-in and it is also not free.
  3. You can install it on your local Ubuntu or Windows too, but this process is very very cumbersome.

How to Install Spark on you local environment with Docker

Disclaimer, I am not going to dive into installation details of docker on your computer here for which many tutorials are available online. This command(embedded below) instantly gives you a Jupyter notebook environment with all the bells & whistles ready to go! I can’t do enough justice to be able to start a development environment with a single command. Therefore I highly highly recommend anyone who wants to take their Data Science & Engineering skills to the next level to learn at the very least how to effectively utilize Docker Hub images.

docker run -it — rm -p 8888:8888 jupyter/pyspark-notebook

There you go! Your environment is ready, to start using Spark begin a session by:

Every Spark Session Starts with this command

Spark MLlib & The Types of Algorithms That Are Available

You can do both supervised and unsupervised machine learning operations with Spark.

Supervised Learning has labelled data already whereas unsupervised learning does not have labelled data and thus it is more akin to seeking patterns in chaos.

  • Supervised Learning Algorithms → Classification, Regression, Gradient Boosting. The algorithms in this category have both the features column and the labels columns.
  • Unsupervised Learning Algorithms → K-means clustering, Value Decomposition, Self-organizing maps, Nearest Neighbours Mapping. With this class of algorithms you only have the features column available and no labels.

Spark Streaming Capabilities

I am planning to write an article about this in depth in the future but for the time being I just wanted to make you aware of Spark Streaming.

Spark Streaming is a great tool for developing real-time online algorithms.

Spark Streaming can ingest data from Kafka Producers, Apache Flume, AWS’s Kinesis Firehose or TCP Websockets.

One of the most crucial advantages of Spark Streaming is that you can do map, filter, reduce and join actions on unbounded data streams.

Natural Language Processing using Spark MLlib

Here is the basic workflow for most NLP tasks:

  1. Create a “corpus” of documents.
  2. Featurize words into numerics so that we can apply Linear Algebra on them.(word2vec)
  3. Compare features between documents using these vectors.
NLP Capabilities of Spark

TF-IDF: Term Frequency Inverse Document Frequency

Compare two documents, extract the set of words, represent documents as a vector of 1's or 0's for having that word or not. A documents representation with word count vectors is called “the Bag of Words Model”.

Term Frequency: How many times that word repeats in that document.

Inverse Document Frequency: How many times that word appears in the corpus.

Tokenizers are used to divide documents into words, here you can also use regex tokenizers to do bespoke splitting.

  • You can also create your own spark functions by calling udf or user defined functions.

Stop Words: These words are excluded because they appear a lot and don’t really mean anything.

  • StopWordsRemover is used to clean the data for creating input vectors.

n-grams: It is a specific way of tokenizing, let’s look at the below example:

for n = 2;
“ali is a computer scientist.” --> the sentence
[“ali is”, “is a”, “a computer”, “computer scientist”] --> n-grams array

N-grams method is used for analyzing words that often appear next to each other.

Building Recommender Systems using Spark MLlib

Two of the most common Recommender Methodologies used are respectively:

  1. Content Based Models: Take into account the attributes of items preferred by a customer and recommends similar items.
  2. Collaborative Filtering(CF) Models: Is a way of using the wisdom of the crowds.

User-Item Matrix is then filled. Latent means hidden. Latent Factors are used to predict missing entries.

  • Alternating Least Squares is used as the main loss function used for creating these systems.

The Cold-Start Problem: Where you do not have past data about user’s history of consumption. If you are a Data Scientist working in a Startup environment you will probably have to deal with this problem a lot in greenfield projects.

One good solution to this problem is to envision what kind of data might be generated from this user case and use services like Mockaroo to create sample data to seed your database with.

Mockaroo has options to generate mockup data

Creating Pipelines with Spark

Pipelines are simply a set of processes applied again and again to tasks. You can bundle each of these steps and define a pipeline method in Spark to avoid doing the same things over and over again.

An example pipeline for data preprocessing in NLP.

Linear Regression Models in Spark MLlib

Vector Assembler joins all the features into one column.

One of the prerequisites of creating Linear Regression models in Spark is putting your dataset into a specific format which spark can understand. VectorAssembler is used to aggregate numerical columns into one and to create a features column. MLlib expects numeric features in this format. All the features are in one column and the predictions are in one column.

  • r-squared value shows how accurate your linear regression model is. This evaluator object is available in Spark.

If you want to utilize different categories in your Linear Regression models, you can convert strings to numeric values fit for regression analysis by using StringIndexer, this assigns an index to a string which makes your dataset fit for model building.

The pipeline for processing string values in your Regression Models.

Just one thing to be aware of when creating Linear Regression Models is if you have high correlations between your features both the RMSE and R² values can turn out very high! Otherwise your model might have been overfitting on your dataset.

Logistic Regression Models in Spark MLlib

For when you have discrete labels that you need to predict this method is utilized, it is not a regression model despite the name this method is a classification model of binary outcomes. Sigmoid function is used here to turn a linear model into a logistic model.

The Whole School of Derivations from the Core Confusion Matrix

The ROC Curve was developed in World War 2 to determine whether a blip on a radar was actually the enemy plane or an irregularity. As you can infer from this example it has a very high discriminatory power. On the X-axis the False Positives are plotted against the True Positives on the Y-axis. Or in other derived terms the ROC Curve is the specificity plotted against sensitivity.

After the war the same evaluation methods were transferred to the main street and were used to score and rank very valuable customer churn analysis models developed using logistic regression.

The Best Model will have the highest area under the curve[1]

There are many ways to evaluate a models performance, in Spark you call an “Evaluator” Object and pass your dataframe as an argument to it.

Usage of Evaluator Spark Objects
  • Mostly for the ROC curve the larger the area under the curve is the better the models prediction abilities.

Decision Trees & Random Forests in Spark

A Typical Decision Tree for a Medium User

To create a decision tree you will have to choose which feature to split on, most commonly entropy criteria and information gains are used to determine these splits. The tree methods work for both classification & regression purposes. If there is no apriori knowledge about which feature is the most important for creating a tree the usual solution is to randomly create a forest of decision trees. Through looking at this forest we identify similar trees and their splits from which the strongest underlying common feature is inferred.

  1. Train a weak model of decision trees.
  2. Identify misclassified features and boost them to train an another model.
  3. Create an ensemble of trees that are good at predicting holistically.

Decision Trees can be used both to find the optimal class for a classification problem or by taking the average value of all predictions to do regression and predict a continuous numeric variable.

  • Binary Classification Evaluator is used to evaluate a decision tree classifier.

K-means Clustering

When you have unlabeled data you do clustering, you look for patterns in data. Therefore domain knowledge is crucial in this unsupervised learning method. Divide data points into groups so that observations are consistent.

You can use clustering to make sense of unorganized data. This method is used for many purposes like criminal forensics, diagnosis of diseases, customer segmentation and pattern identification in general.

Iteratively choose a k-means number and evaluate all the data points by changing the centroid in every iteration. When the centroid stops you have your final clusters.

Elbow Method is used to choose K-means value. The k value is the optimal value in which the SSE(sum of Squared Errors) has the steepest change.

The k = 3 three seems to be the sensible choice.

You can’t use confusion matrices so it is hard to evaluate the performance which is one of the pitfalls for using clustering. Here the domain knowledge comes into play to properly assess model’s metrics.

One thing to be aware of is you may need to scale your data if there are extreme differences between values. StandardScaler is used to scale the data with respect to mean value or standard deviation.

Quirks of Spark

On a final note here are some oddities that you need to be aware of when using Spark.

  1. “.collect()” method returns an array that you can use for plotting. You can’t do plotting with spark objects. Good to know this if you want to do bespoke plotting using bokeh or matplotlib.
  2. VectorAssembler is used to add extra columns to the dataset. Spark otherwise can’t infer the structure of the dataframe.

Finally, thanks for reading and go forth into the world, create your ML models!

References:

  1. For the example ROC Curve: https://ww2.mathworks.cn/help/stats/perfcurve.html

For concepts refer to these marvellous documentations:

--

--

Ali I. Metin
The Startup

Data Engineer — MSc Computer Science Birkbeck, UoL — London, UK