Data Science – Machine Learning

Following are my personal notes that prepared me to take the AWS Certified Machine Learning – Specialty (MLS-C01) exam.

Data Processing

Encode Values

Encoding categorical variables: (panda supports dtype="category")

  • Ordinal: preserve relative order (S < M < L < LG).
    • Use the map() in pandas to apply one-to-one mapping.
    • Use sklearn.preprocessing.LabelEncoder to encode labels.

Encoding nominal variables:

  • Encoding nominal values with integers is wrong. Machine learning algorithms can pick up that relationship!
  • Nominal: no specific information about relative order ([“Banana”, “Apple”, “Orange”, “Apple”]).
    • Use sklearn.preprocessing.OneHotEncoder or pandas.get_dummy()
  • If, OneHotEncoder creates too may additional features, we can create hierarchy structures. For example, for ZIP column one could use regions -> states -> city.

Handling Missing Values

Missing values are very common. Machine learning algorithms can’t handle missing values automatically. Data with too many missing values can’t be used for model training purposes. To identify missing values:

  • df.isnull().sum() how many missing values per column.
  • df.sinull().sum(axis=1) how many missing values per each row.

Handling missing values:

  • df.dropna() drops any rows with “na” values.
  • df.dropna(axis=1) entire column gets dropped with any row having “na” (not available).

Risk of dropping too many rows is significant, promoting over-fitting leading to bias. Loosing important columns on the other hand could lead to under-fitting by missing important features. Why are these values missing? Ask:

  • What mechanism caused the missing values?
  • Are the missing values random?
  • Any missing row, columns I am not aware of?

If values are missing at random we can use impute techniques. If missing values are in one column calculate, mean, median or most frequent. Imputing missing values:

  • Mean. sklearn.preprocessing.Imputer(strategy="mean")
  • MICE (Multiple Imputation by Chain Equation): sklearn.impute.MICEImputer (v0.20)
  • Python, “fancyimpute” package: KNN (K-Nearest Neighbor) Impute, SoftImpute, MICE etc.

If values are not missing in random, try to understand the mechanism behind missing values. Often, you can find much better ways to fill these missing values. Majority of machine learning algorithms can’t work with missing values. They must be either imputed or dropped before feeding them to machine learning algorithms.

Feature Engineering

Features are raw ingredients. Often new features needs to be created.

  • sklearn.feature_extraction numerical, image and other type formats

New features may have better prediction power. Feature engineering is often more art than science. Often, different business problems require completely different features. Use intuition, humans have good intuition especially when supported by business domain knowledge. Usually, generate many features and then apply dimensional reduction techniques if there are too many features. Consider transformations:

  • Multinational: x^{2} – squaring
  • Combinations of attributes: x \times y – attribute product

Filtering and Scaling

  1. Filter selection: apply filters to complex data structures. For example, remove all colors from images or some audio spectrum.
  2. Scaling: reduce range of a feature to speed up computational performance. A lot of machine learning algorithms are sensitive to a wide range of data and this may even lead to optimizations failure. Often, each column is scaled independently. But, the solution is to align all features to the same scale. Not all ML algorithms are sensitive to wide scales for example: decision trees and random forests. Scaling transformations in sklearn:
    • Mean/Variance Standardization – column – sklearn.preprocessing.StandardScaler – widely used – removes the mean and divide by standard deviation. Features will be on the same scale: mean = 0, and standard deviation = 1.
    • MinMax Scaling – column – sklearn.preprocessing.MinMaxScaler – all values are between 0 and 1. Very robust for small standard deviation cases.
    • Maxabs Scaling – column – sklearn.preprocessing.MaxAbsScaler – define every element by its maximum absolute value of that feature – doesn’t destroy sparsity because not centering the observation
    • Robust Scaling – column – sklearn.preprocerssing.RobustScaler – find median, 75th and 25th quantile then the calculate robust scale variable – robust to outliers because outliers have minimal impact when calculation median and quantiles.
    • Normalizer – row – sklearn.preprocerssing.Normalizer – L1,L2, Max Norm – widely used in text analysis.


Act of creating new feature(s) by applying math transformations. Consider polynomial transformations for single or all features: scikit-learn.preprocessing.PolynomialFeatures. Be aware that these features can easily lead to over fitting. Consider other non-linear transformations such as: log or sigmoid. Transformations are very sensitive to extrapolation. Extrapolation beyond data range may lead to data over-fit. Radial basis function is another common use transformation. Radial basis functions are widely used in support vector machines.

Text-Based Features

Preprocessing raw data. Text needs to be cleaned and converted into numerical values. ML algorithms will only accept numerical values for training and forecasting. Bag-of-words model represents a document as a vector of numbers. It doesn’t hold the sequence of the words. One number for each word. Number of distinct words can be large therefore the vector can get large as well, usually sparse matrix implementation is used. Bag-of-words can be extended into the bag-of-n-grams model.

  • Count vectorizer – sklearn.feature_extraction.CountVectorizer – includes lowercasing and tokenization on white space and punctuation, impact of common words such as “the”, “an” is greatly reduced.
  • TfidfVectorizer – sklearn.feature_extraction.text.TfidfVectorizer – term frequency over inverse document frequency.

Model Training

Supervised Learning

Neural Networks

Neural networks concept emerged in 1950s. Simplest neural network is perceptron. Perceptron is a single layer neural network and uses many input features, intercept and activation function. Activation is usually non-linear and depend on the problem we are trying to solve. Sigmoid function is a natural way.

Perceptron is very simple network. It has just one layer with input (for example sum of all features). Neural network usually contains multiple layers and each layer contains a lot of nodes. Therefore, the structure is very complicated and very difficult to interpret and also time consuming to train. People have created many frameworks: MXNet, TensorFlow, Caffee and PyTorch to design, trend, estimate and implement neural networks in brand new fashion. Huge amount of high quality data is very useful. Popular neural networks are implement in these frameworks and all of them can be accessed using python.

Convolutional neural networks is especially useful in image analysis. The input is an image or sequence of images that is weighted. Filters are used to convolve the image to create next image. Pooling layer is a size reduction technique process. For example, reduces 4×4 matrix will reduce max-pooling or average-pooling of each 2×2 matrix. Tensor will become vector which produces fully connected layer which in turn is used to link to the output. In this case, output is specific category that describes the image.

Another type of neural network is called recurrent neural network. In this neural network input data sequence really means something. This is specially useful for language modeling or time data features. For example, single characters means nothing but they mean something as a word. In this neural network the output layer and the input layer are actually connected.

K-Nearest Neighbors

Figure out new observation based on how close is it to its closest neighbors. First, distance needs to be calculated using Euclidean, Manhattan or any other L-Norms. Second, k value needs to provided to find this number of closest neighbors. K-Nearest neighbors will classify new observation based on these two parameters. Small number of k, will classify new observation as local, the larger the k, the more global the observation will become. Optimal number of k depends in business problem. Good point to start: k = \frac{\sqrt{N}}{2} where N is the number of samples. Usually, we try few values of k to decide which performs the best.

K-Nearest neighbor is non-parametric, which means that it is not used to fit any models and there are not equations to use. The only number used is the relative position of training data points, new observation data points and a distance measure. In this algorithm, new observation distance must be calculated relatively to all training data points and from these get the smallest.

K-Nearest neighbor is also called instance-based or lazy-learning because the model has keep the entire data set which could become expensive. On the other hand, parametric models, once trained, contains few variables and a function to which new observation could be applied.

K-Nearest neighbor suffers from a curse of conditionality. That means points become increasingly isolated with more dimensions, for a fixed-size training set. We can use sklearn.neighbors.KNeighborsClassifier.

Support Vector Machines


In a linear separable data set, such that there is observable separation between two classes, optimal hyperplane is going to maximize the margin which separates these two classed. That is exactly at the middle such that it will make minimal observation error.

Linear support vector machines are really popular. Simplest application maximizes the margin, the distance between the decision boundary and support vectors (data points at the boundary). To implement linear support vector machines we can use sklearn.svm.SVC.


If the two classes are not linearly separable, we can use non-linear support vector machines. This requires some data transformation. Non-linear piece is realized using “Kernelize” function. We need to choose distance function called “Kernel”. With this distance function we can map the learning tasks to a higher dimensional space. In higher dimensional space those two classes could become linearly separable. Computation requires more resources therefore it is more expensive. To implement non-linear support vector machines we can also use sklearn.svm.SVC.

Decision Trees

In decision tree machine learning method, the algorithm is going to decide which feature to use to split at the beginning and on all following layers and also decide what threshold to use in each layer. It will do this on training data we provide.

Entropy is a relative measure of disorder in data. Disorder is present when distinction between two or more groups is not pure. Entropy is “0” when all labels belong to one group. Entropy is “1” when, in case of binary split, when there is 50% split.

In decision trees nodes are split based on the feature that has a largest information gain (IG) between parent and its split nodes.

Decision trees are easy to interpret, flexible (based on data type), have less need for feature transformation but are susceptible to over-fitting. We must “prune” the decision tree to reduce potential over-fitting. To implement decision tree we can use sklearn.tree.DecisionTreeClassifier.

Random Forests

Ensemble method is used to minimize over-fitting. This technique relies on training multiple models, each having randomly selected features, and then use a majority voting or averaging method for prediction. This is also known as a random forest algorithm. This method is more expensive to train and run. To implement random forest we can use sklearn.tree.RandomForestClassifier.