Python Modules

This will review Python module mechanism and on a future post, we’ll build upon it to review Python packages, and how to create packages.

Modules are useful for code reuse, and keeping the source files tidy and clean.

A module basically consists of a .py file, defining the module, and inside we will add all the functions and classes that the module will comprise.

Consider the following code, we’ll use it as our module:

# module.py

def func1():
    print("this is func 1")
    
def func2():
    print("this is func 2")
    
class Car:
    max_speed = 180
    gears = 6
    
    def __init__(self, max_speed, gears):
        self.max_speed = max_speed
        self.gears = gears

Now, for use to use it as a module, we’ll have to save this code under a .py file, let’s call it module.py

When writing our main python , we’ll just need to import module.py

import module

this_car = module.Car(200, 5)
print(this_car)
module.func1()
module.func2()

And we’ll get:

<module.Car object at 0x0000024A476E0290>
this is func 1
this is func 2

For getting a list of all the classes and functions a module holds, we can use the built-in dir function.

dir(module)

Remove a conda environment

This is a small appendix to the previous posts regarding anaconda environments.

Anaconda is a scientific computing / data science suite. It combines the power of Python and R as underlying programming languages, along with visualization tools, IDE’s, package managers and a powerful environment manager that allows us to have different environments, each containing it’s own set of packages without affecting the other environments.

So, after creating an environment, be it using an environment yaml file, or manually defining the package list, this is the conda command to remove the environment –

First, if the environment is currently active, we need to deactivate it:

conda deactivate

Then we can remove the environment – parametrically called here ENV_NAME.

  1. –name: command line argument for notifying conda we will explicitly declare the environment name
  2. –all: meaning we want to remove all packages assigned to that environment
conda remove --name ENV_NAME --all

k-Nearest Neighbours with Scikit-learn

The method

k-Nearest Neighbours, or kNN, is a supervised learning algorithm, used to classify (meaning – predict the class of) a data point based on the training set.

The dataset

To tackle kNN classification, let’s consider the following imaginary task. We have a bank dataset which features consist of income, balance, monthly loan payment and a binary variable saying if this account has defaulted on it’s loan.

The dataset is completely imaginary, created using random values by me. The rules when creating the dataset, were:

  1. If loan > income – default
  2. if balance < 0 – default
  3. if income < 0 – default

Please note that this dataset is compact and includes only 3 independent variables and one dependent variable for easier visualization of the algorithm and it’s results.

Let’s visualize the dataset using matplotlib

The red dots depict defaulted accounts, and the blue ones depict accounts that did not default on their loan. As can be seen, and as described before in the rules in which the dataset was created, negative income defaults. The rest of the rules are a bit more tricky to be seen in this viewing angle. Let’s proceed to use scikit-learn to compute the kNN classifier based on this dataset. This classifier can be useful, for example, in a bank, for red flagging accounts that are predicted to default on their loan.

Using kNN

First, we’ll transform the DataFrame to a numpy array for use in scikit-learn.

import numpy as np
import pandas as pd

df = pd.read_csv('loan_default.csv')

x = df[['Balance', 'Income', 'Loan']]
y = df[['Default']]

Now, we’ll divide the dataset to a training set and a testing set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
x, 
y, 
test_size=0.2, 
random_state=4)

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)
Train set: (800, 3) (800, 1)
Test set: (200, 3) (200, 1)

As can be seen, we’re using 20% of our dataset for the testing set and 80% for our training set. Next, we’ll want to normalize the training data (independent variables), so that the data will have a zero mean, and unit variance. This is not mandatory, but very good practice.

from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier

x_train_norm = preprocessing.StandardScaler().fit(X_train).transform(X_train.astype(float))

Next, the model calculation, with a 4-neighbor setting.

k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(x_train_norm,y_train.squeeze())

Next , for calculating yhat (the predicted classification for the test set), we’ll need to normalize the test set as well:

x_test_norm = preprocessing.StandardScaler().fit(X_test).transform(X_test.astype(float))

Next, to the yhat calculation:

yhat = neigh.predict(x_test_norm)

Accuracy analysis

Let’s examine how to check accuracy of the trained model.

There are several metrics for checking the accuracy of the model.

  • Accuracy score – very straightforward, the amount of correctly classified samples, divided by the total samples classified (including all falsely classified)
  • Jaccard score – is basically Jaccard index, or Jaccard Similarity Coefficient Score taken from group theory. If we take two sets, one is the correct labels in our test set (y true) and the other set is the predictions made using our model, the Jaccard score will calculate the intersection of the two sets divided by the union of the two sets. Maximal Jaccard similarity is 1.
from sklearn import metrics

print("Train set Accuracy: ", 
      metrics.accuracy_score(y_train, neigh.predict(x_train_norm)))
print("Test set Accuracy: ", 
      metrics.accuracy_score(y_test, yhat))

Choosing k

k is an arbitrary parameter, it is up to us to select the correct k that maximizes accuracy. For that, we’ll should retrain the model with different k’s and empirically find the one that maximizes accuracy.

Visualizing the trained model

Like the previous plot we had, that showed our 3-dimensional independent variables classified into the two classes (default / not default) and visualized with different color for every class, let’s see the trained and tested dataset visualized.

In this plot, the cyan points are the trained non-default data, the magenta points are the trained default data. the blue and red are the non-default and default data for the test data.

Linear regression in Scikit-learn

A basic machine learning method will consist of the following steps:

  1. Data acquisition
  2. Data handling – formatting, filtering, etc.
  3. Split data to train set and test set
  4. Train ML model using train set
  5. Test ML model using test set
  6. Analyze model accuracy
  7. Deploy

In this post we’ll explore this ML flow implemented in python using scikit-learn library for select ML methods

Linear regression

linear regression, be it simple regression or multiple regression, is the process of estimating a continuous dependent variable using one (simple) or more (multiple) independent variables.

Linear regression makes a few assumptions as to the data collected:

  1. Homogeneity of variance – the size of the sampling error is relatively constant across the data set. Meaning – no outliers.
  2. Independence of observations – The observations in the dataset were sampled in statistically valid methods, and there are no inter-dependencies between the independent variables. In multiple regression, it is quite possible that there is a certain correlation between two supposedly independent variables. For that case, we should analyze the dataset as a preprocess, and check if such correlation appears. If so – take only one of those variables.
  3. Normality – the data adheres to normal distribution.
  4. Linearity – The best fit through the data is a straight line and not a curved one.

Basically, we use the train set to calculate a linear equation that fits the data ‘best’, and using that line equation we can estimate the value at a point that is outside the training set. What is ‘best’? a line that minimizes the error for the train set, and minimizing the error can be either the summed absolute error, summed square error, root of summed square error, or whatever other metric that fits. There are pros and cons for those different techniques.

from sklearn import linear_model
regression = linear_model.LinearRegression()
train_x = np.asanyarray(train[['independent variable']])
train_y = np.asanyarray(train[['dependent variable']])
regression.fit(train_x, train_y)

print(regression.coef_)
print(regression.intercept_)

Here, we:

  1. Import the linear_model module out of scikit-learn library
  2. Initialize a LinearRegression object
  3. Create an array of the independent variable for the train set (X)
  4. Create an array of the dependent variable for the train set (y)
  5. Train the regression model, or, estimate the line equation
  6. Output the regression coefficients and intercept

For analyzing accuracies, we’ll use the following code

from sklearn.metrics import r2_score

test_x = np.asanyarray(test[['independent variable']])
test_y = np.asanyarray(test[['dependent variable']])
test_yhat = regression.predict(test_x)

print("MSE: %.2f" % np.mean(np.absolute(test_yhat - test_y)))
print("Residual: %.2f" % np.mean((test_yhat - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y , test_yhat))

Here we use the r2_score module from scikit-learn, generate the test set, and estimate the regression model prediction for the x values (independent variable). We call these values yhat it is a standard notation in the world of ML and statistics for estimated values rather than ground truth.

We then calculate the MSE (Mean square error) for the prediction compared to the test y values, the residual and the R2 score.

For multiple linear regression, instead of using the np.asanyarray() function on a single field (the one independent variable in simple linear regression), we use the same function to create a matrix in which each row is a different independent variable, and the y array (the dependent variable) will remain the same. We just need to make sure the order is kept.

from sklearn import linear_model
regression = linear_model.LinearRegression()

x = np.asanyarray(train[['ind_1', 'ind_2', 'ind_3']])
y = np.asanyarray(train['dep'])

regression.fit(train_x, train_y)

print(regression.coef_)
print(regression.intercept_)

As seen in this snippet – ind_1,…, ind_3 are three independent variables, and we use np.asanyarray() to transform them into the rows of a matrix.

The rest of the process is identical to simple linear regression.

Basic use of Pandas

Pandas is a very useful tool when handling large datasets in Python, for many different applications.

It is a n open source solution for data analysis, used heavily when doing data science or machine learning. It handles extremely large datasets and optimized for adequate performance under python.

We will start by importing pandas and using it under the pd alias for better readability.

import pandas as pd

Then, let’s use pandas to load a table like csv file. For this example, we’ll download an open source dataset, specifying European Union immigration trends in 2022. The dataset is available from Kaggle – in: https://www.kaggle.com/datasets/umarzafar/immigration-trends-and-population-in-europe/

I have no affiliations with that dataset, it was selected for simplicity – credit for the dataset belongs to it’s author.

Next, we’ll use pandas for reading the dataset. We’ll define an dataframe object called df, and initialize it with the dataset in the file ‘EU Immigrants.csv’

df = pd.read_csv("EU Immigrants.csv")

Next, let’s look at the head of the dataset using the head() function. This head consists of the header row, and the first five records in the dataset (by default)

df.head()

Entering another number as an argument, let’s call it N, will output the first N records in the dataset.

Let’s consider additional dataframe constructors.

Let’s say we want to create a new dataframe, define the fields ourselves and add records.

custom_df = pd.DataFrame([{'field_A': 10, 'field_B': 100}])

custom_df.loc[len(custom_df.index)] = [100, 1000]

In the first line, we define a new DataFrame called custom_df, and define it’s structure as having two columns (or fields) – field_A and field_B

Another alternative to append to a DataFrame is to define the next row as an additional DataFrame, and then append the two together, as displayed in the following code:

another_df = pd.DataFrame([{'field_A': 1e6, 'field_B': 1e8}])
custom_df = custom_df.append(another_df, ignore_index=True)

in the second line we append an additional row, having values to these two fields. If we would define named fields for another_df which are different than the fields originally defined for custom_df, the append function would still work, but we would have two additional columns in the table which are not defined for the previous rows.

the describe() function, another built-in pandas function, will output some interesting statistical data describing the dataset, like number of records (lines), mean, std, min, max and percentiles for each column in the dataset (field).

For easier access to fields, let’s rename the first three columns. We’ll make use of these first three fields afterwards, and shorter titles will be useful for better readability as well. We’ll define a new dataframe called rdf which will be the renamed dataframe.

# rename field names so they are smaller
rdf = df.rename(columns={"EU COUNTRIES": "Country"})

rdf = rdf.rename(columns={"TOTAL IMMIGRANTS(IN THOUSANDS)":
                          "Immigrants (1000s)"})

rdf = rdf.rename(columns={"IMMIGRANTS WITH NATIONALITY(THOUSANDS)": 
                          "Immigrants with nationality (1000s)"})

Next, let’s consider a scenario in which we do not need all of the original fields of the dataset, but a subset of that. For example, we’ll define a new dataframe containing only the first three fields.

This can be done easily using a dataframe constructor, taking the required fields’ names as arguments. We define a new dataframe called cdf, which only takes the following fields out of the dataframe rdf.

  1. Country
  2. Immigrants (1000s)
  3. Immigrants with nationality (1000s)
cdf = rdf[['Country',
           'Immigrants (1000s)',
           'Immigrants with nationality (1000s)']]
cdf.head(9)

Now, let’s explore some visual elements supported by pandas via matplotlib. First, we import pyplot from matplotlib.

import matplotlib.pyplot as plt

Next, let’s see a histogram of the two numerical fields.

viz = cdf[['Immigrants (1000s)',
           'Immigrants with nationality (1000s)']]

viz.hist()

plt.show()

In this example, we create a new dataframe containing the two fields we want to generate histogram for. this dataframe is called viz, and simply calling viz.hist() will generate a histogram plot handle.

Next, let’s see how to use data stored in a dataframe as a numpy array.

import numpy as np
immigrants = np.asanyarray(rdf[["Immigrants (1000s)"]])

We take the first numeric column, and transform it to a numpy array.

Introduction to Regression

Regression is the process of predicting a continuous value.

In regression there are two types of variables – a dependent variable y, and an independent variable x. The dependent variable can be seen as the target, the state or the value we wish to estimate, using our independent variables. Our independent variables can be seen as the causes of this state.

A regression model relates y (dependent variable) to a function of x (independent variables). Using a regression model, our dependent variable has to be continuous, whereas the independent variables can be discrete.

There are two basic regression models:

  1. Simple regression
  2. Multiple regression

Simple Regression – one independent variable is used to predict a dependent variable. Simple regression can be either linear or non linear. For example, imagine we have a dataset containing people’s height and weight. Suppose we create a regression model linking weight to height based on that dataset, meaning given a specific weight (independent variable) we can return a predicted height (dependent variable). This model will be somewhat lacking as connection between height and weight is not simple, a definitely should include additional variables – but for this example it will suffice.

Multiple Regression – more than one independent variable is used to predict a dependent variable. The regression model can be linear and non linear here as well. To explain multiple regression we can take a similar example as the one for simple regression, but add an additional independent variable to include in our model, like age. Now, the regression model will predict height based on weight and age.

There are many regression algorithms out there, and we’ll cover some in the future

  • Ordinal regression
  • Linear regression
  • Polynomial regression
  • Poisson regression
  • Bayesian linear regression
  • Fast forest quantile regression
  • Decision forest regression
  • Neural network regression
  • K nearest neighbors

Setting up Anaconda with a yml environment file

Before, I’ve covered how to start a new Anaconda environment using the Anaconda prompt. This included listing all available environments, creating a new one, and setting focus to that one.

Now, I’ll tackle how to deploy a new environment (similar to before), but using a yml (YAML) file, that contains all the libraries we want to install for that environment, and their respective dependencies.

This is super important when we want to deploy the same environment multiple times, like when running on different computers, or keeping a backup of an environment definition. Installing individual libraries one by one is a tedious job.

In case we want to define an environment, that has specific library constraints, or perhaps we want to deploy the same environment on several computers, without the hassle of defining library dependencies for each computer separately – we might want to create a single yaml configuration file, for that purpose.

Let’s see an example of such a file

env.yml

name: my_environment
channels:
 - default
dependencies:
 - numpy=1.21.0
 - matplotlib=3.8.0
 - pandas=2.1.0

In this very simple example, we define an environment, named my_environment under an environment yaml called env.yml. The source location for downloading remote libraries, is defined under the channels section. In this case, we’re only using the default libraries location. The next section is dependencies, which contains a list of all required libraries and their respective version (in other words, the environments dependencies). In this case, we’re only installing numpy 1.21.0, matplotlib 3.8.0 and pandas 2.1.0, but of course the list can be expanded as needed. As Anaconda installs these libraries, it will also download all of the libraries’ default dependencies as well.

To use that yml file, we need to use the following command in Anaconda prompt

conda env create -f env.yml

Handling error in CUDA API calls

When running CUDA code, it is necessary to go through a long sequence of CUDA API calls in order for our kernel to perform the necessary job correctly.

A typical invocation of CUDA kernel will look like this

  1. Transfer memory from main program (CPU) to CUDA memory (GPU) – input
  2. Setting parallelization properties (block size and grid size)
  3. Invoking kernel
  4. Checking for errors on kernel invocation
  5. Device synchronization, to check for errors after all threads finished kernel run
  6. Transfer memory from CUDA memory back to main program (CPU) – output

Steps 1 and 6 may repeat several times, depending on data needed for the kernel run.

Steps 1, 4, 5 and 6 return a status code, typed cudaError_t.

One should check every returned cudaError_t, to see if the invoked action completed successfully. On some cases, like when working with very large datasets on constrained memory environment, we might get error transferring memory from host to device (or vice-versa of course), on other cases we might encounter error when invoking kernel code, and also quite possible to catch errors when returning all threads from kernel invocation.

Let’s consider the following code –

status = cudaMemcpy(someDataDevicePtr, someDataHostPtr, someDataSize * sizeof(float), cudaMemcpyHostToDevice);
if (status != cudaSuccess)
{
    ...
    cleanups
    maybe add error to log
    ...

    return false;
}

Here, we add an if statement to check the returned value from the CUDA API call. If the code contains many, many such calls, the code swells up to enormous proportions.

Let’s consider a different approach using a wrapper function.

void SafeCall(cudaError_t code, 
    const char* file, 
    int line, 
    const char* errorInfo = "")
{
    if (code != cudaSuccess)
    {
        const std::string errorString(cudaGetErrorString(error));

        ...
        output error to log
        ...

        ...
        cleanups
        ...
    }
}

This code will get a cudaError_t (error or status code), check if it’s defined as an error, and if so will take care of cleanups and propagating error to calling code / write error to log.

We will call this function with the following arguments –

SafeCall(
    cudaMemcpy(someDataDevicePtr, 
        someDataHostPtr, 
        someDataSize * sizeof(float), 
        cudaMemcpyHostToDevice), 
    __FILE__, __LINE__, "error transferring data from host to device"
);

Now, we have the exact same handling for all CUDA API calls, taking care of error cases, writing proper error messages in log if necessary and handling cleanups when needed. We will not miss checking a returned value this way, and validate that all calls were actually successful.

This is also elegant and depending on length of error strings can be more compact than using an if-statement.

Note that __FILE__ and __LINE__ are standard pre-defined macros and are cross platform.

C++ Memory Management (5) – make_unique

In the last post, I’ve instantiated a unique pointer with the following construct (for example, just a pointer to singe int variable)

std::unique_ptr<int> a(new int);

In the original implementation of unique_ptr, from C++11, this was the de-facto way to initialize a unique_ptr.

Starting C++14, a new variadic template function called std::make_unique was introduced in order to fix some caveats in this construct.

What’s wrong with using ‘new’ in unique pointer initialization?

  1. Breaks the ‘no new/delete in code’ best practice rule. Before make_unique, this rule was ‘no new outside unique_ptr constructor’.
  2. May cause memory leaks. Why? let’s explore.

Expression Evaluation Order

Prior to C++17, expression evaluation order is de-facto undefined. Look at the following example.

void Func(float a, float b);
float CalcArg1();
float CalcArg2();

int main()
{
    Func(CalcArg1(), CalcArg2());

    return 0;
}

Func is given two floating point variables as arguments.

Let’s assume two other functions exist, that generate a valid float for use with that function.

When invoking Func inside main(), we have no way of knowing which of the functions CalcArg1() or CalcArg2() is going to be evaluated first. This is regardless of their position as arguments.

Now, let’s take a similar example using a unique pointer instantiated inside a function’s argument list.

#include <memory>

void Func(std::unique_ptr<float> a, float b);
float CalcArg2();

int main()
{
    Func(std::unique_ptr<float>(new float(10)), CalcArg2());

    return 0;
}

In this case, as we don’t know the order in which each function is evaluated when calling Func, we might get the following evaluation order:

  1. new float(10)
  2. CalcArg2()
  3. std::unique_ptr<float> constructor

So the memory allocated by new() is not yet wrapped inside that std::unique_ptr. It will not be deleted in case of an exception in CalcArg2().

We get a memory leak.

In this case it’s just one floating point, but in case of arrays, or big objects, this is a big problem.

This can be fixed by (A) using named arguments – meaning defining them before invoking the function, and handling the possible exception accordingly or (B) using the std::make_unique function for initializing the unique pointer.

So in both cases, if CalcArg2() throws an exception, stack unwinding will call the destructor for the std::unique_ptr in argument 1, and we are safe.

C++ Memory Management (4) – Unique pointer in Modern C++

Modern C++ helps a lot to the memory management model by allowing the developer to relinquish responsibility over the de-allocation of heap allocated memory.

While this idea is not new by any means, it only became truly possible with the introduction of move semantics in C++11 which allowed for std::unique_ptr.

And, of course, as it is C++, there’s still the option to manage your memory allocations and deallocations using the older methods specified in the previous articles.

std::unique_ptr

a unique_ptr is a templated wrapper, that wraps a regular pointer, and automatically invokes the matching delete() function as the program leaves the scope in which the unique_ptr was declared.

To achieve this, a unique_ptr cannot be copied, as copy will negate it’s uniqueness. It can only be moved – a process in which the underlying pointer is moved from the original unique_ptr to another instance of matching unique_ptr, leaving the original unique_ptr empty. To further understand this process, I will need to delve into the specifics of move semantics, which will probably happen in a later post.

small example –

#include <memory>
#include <string>

class Record
{
    std::string firstName;
    std::string lastName;
    long id;

    public:
    Record() = default;
};

int main()
{
    std::unique_ptr<Record> p(new Record);
    return 0;
}

First, to use unique pointers, we need to include the memory header as it holds all definitions for standard library smart pointers.

Here the class Record symbolizes some kind of record keeping for person identification. In main function, a unique pointer p, that is of type unique_ptr<Record> is initialized and the object’s memory is allocated in the new statement, right in the unique pointer’s constructor.

Clearly, the delete statement is missing here.

As said before, no delete is required. when p goes out of scope, a delete will be automatically invoked by the unique pointer’s destructor. That delete will also include accordingly a call to Records destructor.

This is super handy.

Keeping track of all your memory allocations as a C++ developer is a lot of mental baggage. The low, but still present probability you’ll forget a delete somewhere in your code base is high risk for memory leaks.

There are other tools in modern C++ to help lower the probability of memory leaks, for example –

  1. std::vector instead of dynamically allocated arrays. std::vector de-allocates automatically as it gets out of scope, much like unique_ptr in a sense, but it’s not as lean as unique_ptr, and useful mostly for arrays.
  2. RAII (Resource Acquisition Is Initialization). Fancy term for having resource acquired at constructor and disposed of in the destructor. I will use concrete examples at a later post.

Notice that in the example above, I am invoking ‘new’ in the unique pointer’s constructor. This is a simple example just to show the connection between smart pointers and raw pointers. It is more common and better practice to initialize unique pointers with std::make_unique functions.