One of the core principles of software engineering is “encapsulation”. However, this principle is often overlooked when deploying a machine-learned model. A machine-learned model is a composite of two things: transformations (such as One Hot Encoding, Imputing Values, etc.) and scoring (e.g., Linear Regression). Nevertheless, the serialized model often only contains the model’s scoring part, and transformations are implemented at the application layer.

This post shows how both transformations and scoring can be serialized together using sklearn-pandas and baikal. The below diagram represents the architecture of an ensemble model to predict house price. The dataset I used for this model…


Assume you have a table where each row contains various features about customers. These features can be static (eg. account creation time, gender, etc) and/or dynamic (eg. “number of searches”, “last visited on”, etc). One of the challenges encountered in maintaining such a table is that different processes updated these features at different times. For instance, some of the dynamic features can be real-time whereas others might be coming from Hadoop and might have 24-hour latency. From a debugging perspective having a table that gets updated by different processes can be a nightmare as there is no way to travel…


PyTorch is one of the most used libraries for deep learning but is also one of the very difficult libraries to understand due to lot of side-effects that one object can have over another. For instance, calling the “step” method of an optimizer updates the module object’s parameters. Trying to wrap my head around PyTorch objects better and how they interact with each other, I found this Coursera course to be very helpful. It shows how different abstractions (such as Dataset, Dataloader, Module, Optim, etc) interact with each other. This post has a similar motivation.

I start with the explicit…


One of the first challenges in machine learning on structured data is “Feature Engineering.” It involves deciding whether to treat a variable as a numerical variable or a categorical variable and further to choose various transformations of data such as log transformation, one-hot encoding, target encoding, etc. These decisions are often not straightforward and require exploring data for unique counts, missing values, distribution, etc, and now imagining doing this for 10s, or even worse 100s, of columns. Very quickly, this can become an extremely repetitive and tedious task to manage.

I recently found myself in such a situation. A semi-automated…


From self-driving cars to medical assistants, businesses have found a multitude of ways to leverage machine learning and artificial intelligence to develop smart and responsive products. However, I am often intrigued to notice that the application of machine learning at the infrastructure almost non-existing. For instance, while we have models to predict weather patterns, predicting hardware requirements for the next purchase cycle still happens on excel sheets. Or how often have you seen the usage of machine learning to determine optimal data models? AI² is my vision of an adaptive infrastructure that uses artificial intelligence (and machine learning, data science…


While analyzing an experiment data, I encountered an interesting brain teaser. I wanted to use the bootstrap method and for that needed to sample my data with replacement. After some iterations, I got the perfect way to do it. The trick is using UNNEST with sequence operation to duplicate the data. Below is an example.

In the below example, I assume we have two different treatment groups (A and B), and we need 3 samples of data. In each sample, we need 40% of users i.e. (2 of 5 users from each treatment group).

The most important step is UNNEST("sequence"(0…


The importance of “Sampling” cannot be overstated. The conclusions we draw from the data as well as the quality of the machine-learned model significantly depend on how we sample the data. However, there are many different ways to sample the data and expressing these different ways of sampling in SQL can often be tricky. Below are examples of few sampling techniques that can be easily expressed using the Presto query engine.

1.0 Random Sampling

In random sampling, the probability of selecting any given row is the same. In other words, all rows are equally weighted. …


There are many metrics for evaluating a regression model. But often they seem cryptic. Below is an attempt to help understand the intuition two often used such metrics: mean/median absolute error and R2 (or the coefficient of determination)

Average Accuracy of the Model (Mean/Median Absolute Error)

Let’s assume you got a model that can predict house prices. Naturally, you won’t trust it unless you evaluate it and establish some confidence in the expected error. So, you feed in features (such as room number, lot size, etc) for a certain house and compare the predicted (say 130K) to its actual (say 120K) price. In this particular case, we can…


Recently working on a recommendation engine, I stumbled across an interesting problem. I wanted to put some automated checks in place so that if the newly generated list of recommendations significantly differs from the previous one then the system raises some kind of flags (such as an email alert). The challenge was however how to quantify similarities or differences between the new and the old recommendation list. The output of the recommendation engine is a sorted list of items. Thus, the problem was to quantify the similarity between two ranked lists.

As it turns out, the problem of comparing two…

Ritesh Agrawal

Senior Machine Learning Engineer, Varo Money; Contributor and Maintainer of sklearn-pandas library

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store