Python Machine Learning

Machine Learning

Types of Machine Learning algorithms

Supervised
Unsupervised

ML - Supervised Learning

We have a dataset that has the "right answer".
It has one or more "features" (or independent variables, or just variables) (X) and one or more results (y).
We would like to create a function that given a new set of values for X will predict the value(s) of y.
Regression
Classification

Regression problem

Predict continuous valued output.
Housing prices: area (x) vs price (y)
How to predict the price based on the area?
Try to match a linear line, maybe a 2nd degree polynom, or an exact match on the known points using very high degree polynom?
Size, weight, color, nutrition value of various crops based on rainfall, sun, etc.
Life expectancy at birth or when a certain disase is found in a person.

Classification problem

Predict a discrete valued output (yes/no) or (A, B, C, D)
Brest Cancer: Tumor size (x) vs being malignant or benign (y) - two distinct possibilities. Given a tumor (and its size) what is the probability that it is malignant?
The Tumor size is a "feature". In other problems we might have many more features.
e.g. We might know both the tumor size and the age of the patient.

ML - Unsupervised Learning

We don't have the right answer for the data set
Clustering
News items
DNA sequences how much certains Genes are expressed
Social network analysis
Market segmentations
Astronomical data analysis
Cocktail party algorithm (separating two voices as recorded by two microphones) (Noise cancellation)

Linear regression with sklearn

Using generated data

examples/ml/basic_linear_regression.ipynb
examples/ml/use_basic_linear_expression.ipynb

from joblib import load
import sys

if len(sys.argv) < 2:
    exit(f"Usage: {sys.argv[0]} Xes")

input_values = []
for val in sys.argv[1:]:
    input_values.append([float(val)])

model = load('linear.joblib')
print(model.predict(input_values))

Split data set

train_test_split}
In supervised learning you receive a dataset of N elements (N rows) in each row you have X features (column) + 1 or more results y (also column)
You can divide the rows into two parts: training and testing.
You use the training part to train your model and you use the testing part to check how good your model can predict other values.
train_test_split() of scikit-learn can do this.
examples/ml/basic_linear_regression_more_data.ipynb
fix the seed by setting random_state to any fixed non-negative integer
stratify splitting for classification of inbalanced datasets

Food-truck linear regression

examples/ml/food-truck.csv from the first exercise of the Machine learning course of Andrew Ng
examples/ml/food-truck.ipynb

Basic Classification example

examples/ml/basic_classification.ipynb

Kaggle

Kaggle has lots of datasets

Kaggle - USA housing listing

examples/ml/usa-housing-listings.ipynb

Kaggle - Iris

iris
examples/ml/iris.ipynb

Machine Learning 2

Number of features

Can be large.
Infinite number of features?

Linear regression

Housing prices (size in feet => price in USD)

m - number of examples in the dataset
X's - input variables, features
y's - output variables, target variables
(X, y) - single training example
(Xi, yi) - i-th training example
Training set => Learning Algorithm => h (hypothesis)
is function that converts X to estimated y. y = h(X) as it is a linear function we can also write h(x) = ax^2 + b (a, b could be theta 0 and 1)
Linear regression with one variable (aka.) Univariate Linear regression.

Cost function

Squared error function: J(a, b) = (sum of (h(xi) - yi)^2)/2m where h(x) = ax^2 + b
It is probably the most common used for linear regression problems because it seems to work the best in most cases.
We would like to find a and b so J(a, b) is minimal.
If we assume b=0 then we are looking at min(J(a, 0)) which is a 2D function
In the general case though min(J(a, b)) is a 3D function for which we need to find the minimum
Contour plots (contour figures)

Gradient descent

Gradient descent a generic algorithm to find a local minimum of a function.
Start at a random location.
Make a small step downhill.
Stop when around you everything is higher than where you are.
Problem is that depending on the starting point this can lead us to different local(!) minumum.
Learning rate (alpha) - the size of the steps we take on every iteration.
Derivative term - (a function of a and b).
If the learning rate is too large, the algorithm might diverge.
If the learning rate is too small, it might take a lot of steps to converge.

Gradient descent can converge even if the learning rate is fixed because the closer we get to the local minimum, the derivative of the cost function is smaller (closer to 0) and thus the multiplication of the cost function by the derivative is going to be smaller and the step we take is going to be smaller.

The above cost function of Linear regression is a Convex function so there is only one local minimum which is also the global minimum.
"Batch" Gradient Descent - means that at every step we use all the training examples.
There are other versions of Gradient descent.

Matrices

Dimension of matrix = number of rows x number of columns (4x3)
Addition of two matrices of the same dimension - element wise (same for subtraction)
"Scalar Multiplication" - Multiplication of a matrix by a scalar (multiply each element by the scalar) (also scalar division)
R(3,2) x R(2) = R(3) multiply a matrix of 3 rows and 2 columns by a vector of 2 element (2 rows) (element-wise mnultiple and then sum the results)
3x2 matrix multiply by a 2x1 matrix the result is 3x1 matrix
R(m, n) x R(n) = R(m)

| 1  3 |   | 1 |   | 16 |
| 4  0 | x | 5 | = |  4 |
| 2  1 |           |  7 |

Matrix Matrix Multiplication
R(m,n) x R(n,k) = R(m, k)

| 1 3 2 |    | 1 3 |   | 11 10 |
| 4 0 1 | x  | 0 1 | = |  9 14 |
             | 5 2 |

Matrix multiplication is not commutative, that is A x B is not the same as B x A.
Matrix multiplication is associative, that is (A x B) x C is the same as A x (B x C)
Identity Matrix I or I(nxn) is a square matrix in which everything is 0 except the diagonal that is filled with the number 1.
A x I = I x A = A
Matrix Inverse A x A(to the power of -1) = I - Only square matrices have inverse, but not all square matrices have inverse. (e.g. the all 0s matrix does not have one)
The matricses that don't have an invers are somehow close to the all 0 matrix. They are also called "singular" or "degenerate" matrices.
Matrix Transpose - (1st row becomes the 1st column; 2nd row becomes 2nd column, etc....)

Machine Learning - Multiple features

n - number of features (number of columns in the table)
last column might be called y (the result)
m - number of samples (number of rows)
x(i) - row i, vector of values of a sample
x(i, j) - the value of row i column j
Also called "Multivariate linear regression"
Gradient descent for Multiple features

Feature Scaling

If one feature has numbers in the range of 0-2000 and the other feature has in the range of 0-5 then the inequality can make it much harder for the gradient descent to reach the minimum. It is better to have all the features in the same range of numbers. We can normalize the values by let's say dividing each number by the max value of that feature. We might prefer that each feature will be in the range of -1 <= value <= 1. This is not a hard rule though.
Mean normalization - replace xi with x(i) - mu(i) where mu(i) is the mean or average of that feature. This way the feature will have 0 mean. Also: (x(i)-mu(i))/std(i) where mu(i) is the mean and std(i) is the standard deviation.

Gradient Descent - Learning Rate

Draw the graph of the value of the cost function as a function of the number of iterations in gradient descent.
It should have a dowwards slope, but after a while its descent might slow down. (It is hard to tell how many iterations it will take.)
If the convergence is some small (e.g. less than 1/1000 or epsylon, but it might be difficult to choose this number)
If it is increasing than probably the learning rate is too big and it will never converge. (Fix is to use smaller learning rate.)

Features

We can defined new features based on other features. (e.g. multiply two feature by each other to get the new feature)

Polynomial Regression

When the allow for a function like a + bx + cx^2 + dx^4 ... (given a single feature x)


import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Generate the input data uisng random numbers
size = 20
x = np.random.randint(1, 100, size=size)
error = np.random.rand(size)
#error = np.zeros(size)
y = x * x + error
#print(error)
#print(x)
#print(y)

X = x.reshape((-1, 1))
#print(X)

transformer = PolynomialFeatures(degree=2, include_bias=False)
transformer.fit(X)
X = transformer.transform(X)
# X = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X)
#print(X)


model = LinearRegression().fit(X, y)
r_sq = model.score(X, y)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('coefficients:', model.coef_)

Normal Equation

An analytical way to find the best function

numpy.linalg.pinv(x.transpose * x) * x.transpose * y

Gradient Descent vs. Normal Equation
The latter migh work faster but only if the number of features is small. n = 10,000 might be the limit, depending on the computer power.
Noninvertibility
Redundant features: If two features are linearly dependent then the matrix is noninvertable (e.g. area in square mater and square feet)
Too many features (m <= n) - delete some features or use regularization

Multiple features

Function of more than one X

exmaples/ml/multi_feature_linear_regression.ipynb

Logistic regression (for classification)

Email: spam or not spam
Tumor: malignant or benign
Online Transaction: Fraudlent or not?

Binary classification:

y can be either 0 or 1,

0 = Negative class
1 = Positive class

Multi-class classification problem when y can have more than 2 distinct values

Linear regression using a threshold value
Sigmoid function / Logistic function
Decision boundary
The "Logistic regression cost function" based on the Sigmoid function is a non-convex function so Gradient Descent isn't guarnteed to reach global minimum. So intead of that we use some log() function.

Optimization algorithms

Gradient descent
Conjugate gradient
BFGS
L-BFGS

The other 3 algorthms have the advantage of not needing to pick a alfa (learning pace), and they are often faster than Gradient descent. However they are more complex to implement.

Multi-feature Classification (Iris)

iris

multi_feature_classification_iris.ipynb

Kaggle - Melbourne housing listing

examples/ml/melbourne-housing-snapshot.ipynb

Machine Learning Resources

Machine Learning by Andrew Ng

Regression Analyzis

Ways to measure correctness of a model

Classification Analysis

Accuracy
Precision
Recall
F1 Score

Unbiased evaluation of a model

Assesment
Validation

We need fresh data that has not been seen by the model before.

Splitting data

Trainig set - for training, fitting the model, finding optimal coefficients.
Validation set - for evaluation, hyperparameter tuning, performance assesment.
Test set - unbiased evaluation of the model.

Also to notice:

Underfitting
Overfitting

Model selection and validation

sckit-learn model_selection
Cross validation (e.g. K-fold validation)
Learning curves
Hyperparameter tuning

K-fold valiadtion

divide the data into k (5-10) subsets
do the training and testing on each subset
each tim use one fold as the test-set and all the other folds as the train set
KFold()
StratifiedKFold()
LeaveOneOut()

Learning Curves

The relation of data-set size in training and the score
Find the optimal training size for best score in a reasonable time/dataset size.

Hypermatameter tuning (optimization)

to determine the best model parameters
GridSearchCV()
RandomizedSearchCV()
validation_curve()

The k-Nearest Neighbors (kNN)

import requests
import os
import shutil
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingRegressor


def get_files():
    data_url =  "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
    names_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names"

    filenames = []
    for url in (data_url, names_url):
        filename = url.split('/')[-1]
        filenames.append(filename)
        if not os.path.exists(filename):
            with requests.get(url, stream=True) as response:
                with open(filename, 'wb') as fh:
                    shutil.copyfileobj(response.raw, fh)
    return filenames



if __name__ == "__main__":
    data_file, names_file = get_files()
    columns = ["Sex", "Length", "Diameter", "Height", "Whole weight", "Shucked weight", "Viscera weight", "Shell weight", "Rings"]
    df = pd.read_csv(data_file, names=columns)
    #print(df.head())
    df = df.drop("Sex", axis=1)
    #print(df.head())

    #df["Rings"].hist(bins=15)
    #plt.show()

    #correlation_matrix = df.corr()
    #print(correlation_matrix["Rings"])

    X = df.drop("Rings", axis=1)
    X = X.values
    y = df["Rings"]
    y = y.values

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    knn_model = KNeighborsRegressor(n_neighbors=3).fit(X_train, y_train)


    train_predictions = knn_model.predict(X_train)
    train_mse = mean_squared_error(y_train, train_predictions)
    train_rmse = sqrt(train_mse)
    print(train_rmse) # 1.67

    test_predictions = knn_model.predict(X_test)
    test_mse = mean_squared_error(y_test, test_predictions)
    test_rmse = sqrt(test_mse)
    print(test_rmse) # 2.36
    # That is the number of years as errors between the prediction and the actual value
    # This looks like overfitting

    # cmap = sns.cubehelix_palette(as_cmap=True)
    # f, ax = plt.subplots()
    # # Length and Diameter, the two columns with strong correllation
    # points = ax.scatter(X_test[:, 0], X_test[:, 1], c=test_predictions, s=50, cmap=cmap)
    # f.colorbar(points)
    # plt.show()

    # cmap = sns.cubehelix_palette(as_cmap=True)
    # f, ax = plt.subplots()
    # # Length and Diameter, the two columns with strong correllation
    # points = ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=50, cmap=cmap)
    # f.colorbar(points)
    # plt.show()


    # Tuning Hypermarameters
    # What should be the value of k ? k = 1 means you depend too much on a potentially outlier neighbour
    # If k is all the neighbours then for every prediction you will get the same answer.

    # Look for the best value for k in the range of 1-50
    # parameters = {"n_neighbors": range(1, 50)}
    # gridsearch = GridSearchCV(KNeighborsRegressor(), parameters)
    # gscv = gridsearch.fit(X_train, y_train)
    # # print(gscv)
    # print(gridsearch.best_params_)  # {'n_neighbors': 17}

    # train_preds_grid = gridsearch.predict(X_train)
    # train_mse = mean_squared_error(y_train, train_preds_grid)
    # train_rmse = sqrt(train_mse)
    # test_preds_grid = gridsearch.predict(X_test)
    # test_mse = mean_squared_error(y_test, test_preds_grid)
    # test_rmse = sqrt(test_mse)
    # print(train_rmse)
    # print(test_rmse)


    # Weighted Average of Neighbors Based on Distance

    parameters = {
        "n_neighbors": range(1, 50),
        "weights": ["uniform", "distance"],
    }
    gridsearch = GridSearchCV(KNeighborsRegressor(), parameters)
    gridsearch.fit(X_train, y_train)
    # print(gridsearch.best_params_)  # {'n_neighbors': 17}
    # train_preds_grid = gridsearch.predict(X_train)
    # train_mse = mean_squared_error(y_train, train_preds_grid)
    # train_rmse = sqrt(train_mse)
    # test_preds_grid = gridsearch.predict(X_test)
    # test_mse = mean_squared_error(y_test, test_preds_grid)
    # test_rmse = sqrt(test_mse)
    # print(train_rmse)
    # print(test_rmse)

    best_k = gridsearch.best_params_["n_neighbors"]
    best_weights = gridsearch.best_params_["weights"]
    bagged_knn = KNeighborsRegressor(
        n_neighbors=best_k, weights=best_weights
    )

    bagging_model = BaggingRegressor(bagged_knn, n_estimators=100)
    bagging_model.fit(X_train, y_train)
    train_preds_grid = bagging_model.predict(X_train)
    train_mse = mean_squared_error(y_train, train_preds_grid)
    train_rmse = sqrt(train_mse)
    test_preds_grid = bagging_model.predict(X_test)
    test_mse = mean_squared_error(y_test, test_preds_grid)
    test_rmse = sqrt(test_mse)
    print(train_rmse)
    print(test_rmse)

K-Means Clustering

K-Means Clustering

Boston housing prices

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor

x, y = load_boston(return_X_y=True)
print(x.shape)
print(y.shape)
#print(x)
#print(y)


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)

linear_model = LinearRegression().fit(x_train, y_train)
print(f"LinearRegression score train: {linear_model.score(x_train, y_train)}")
print(f"LinearRegression score test: {linear_model.score(x_test, y_test)}")

gradient_model = GradientBoostingRegressor(random_state=0).fit(x_train, y_train)
print(f"GradientBoostingRegressor score train: {gradient_model.score(x_train, y_train)}")
print(f"GradientBoostingRegressor score test: {gradient_model.score(x_test, y_test)}")

forest_model = RandomForestRegressor(random_state=0).fit(x_train, y_train)
print(f"RandomForestRegressor score train: {forest_model.score(x_train, y_train)}")
print(f"RandomForestRegressor score test: {forest_model.score(x_test, y_test)}")

Decision Tree

Measure the the Mean Absolute Error of both the training and testing set from sklearn.metrics import mean_absolute_error
too shallow: underfitting
too deep: overfitting

Random Forrest

remove outlier from food-track calculate smallest profitable city

Resnet 50

import os
import sys
import cv2
import numpy as np
from tensorflow.keras.applications.resnet50 import ResNet50

if len(sys.argv) < 2:
    exit(f"{sys.argv[0]} IMAGEs")


os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
resnet50_weights = 'resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5'
#rn50 = ResNet50(include_top=False, weights=None, pooling='avg')
#rn50 = ResNet50(include_top=False, weights=resnet50_weights, pooling='avg', input_shape=(512, 128, 1)) #(256, 256, 3))
rn50 = ResNet50(weights=resnet50_weights) #(256, 256, 3))
#rn50 = ResNet50(include_top=False, weights=None, input_shape=(640, 480, 3), pooling='avg')
#rn50 = ResNet50(include_top=False, weights=None)
#rn50 = ResNet50(include_top=False, weights="imagenet", pooling="avg")


#rn50.load_weights(resnet50_weights)
exit()
target_size = (640, 480)

for path in sys.argv[1:]:
    print(path)
    im = cv2.imread(path)
    print(im.shape)
    im = cv2.resize(im, target_size)
    print(im.shape)
    #old = im.copy()
    #im[0][0][0] = 0
    #im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB) / 255.
    #print(np.array_equal(old, im))
    im = im[np.newaxis, ...]
    act = rn50.predict(im)
    print(act)

Keyboard shortcuts

Python and Machine Learning