Leveraging Annotation Library (PigeonXT) for Feature Engineering
One of the first challenges in machine learning on structured data is “Feature Engineering.” It involves deciding whether to treat a variable as a numerical variable or a categorical variable and further to choose various transformations of data such as log transformation, one-hot encoding, target encoding, etc. These decisions are often not straightforward and require exploring data for unique counts, missing values, distribution, etc, and now imagining doing this for 10s, or even worse 100s, of columns. Very quickly, this can become an extremely repetitive and tedious task to manage.
I recently found myself in such a situation. A semi-automated solution took care of some of the columns, but there were many more columns for which I need to decide how to handle them. After going through a few columns manually, I realized the process could be made much simpler by leveraging PigeonXT library. The library is for generating labeled data points within a jupyter notebook and perfectly serve this case. The adjacent video demonstrates the outcome. The video goes through various columns of the Iowa housing dataset. For each column, I look at the given data type, the number of unique values, and determine whether to treat the variable as numerical or categorical.
You can find the complete code over here. I have a custom display function (“display_series”) that takes a pandas series as input and renders various aspects of it. From the form, you can then select whether to treat the variable as a numerical or categorical and further select the kind of transformation to apply. The program will iterate through all the columns and returned selections for each column.
Originally published at http://ragrawal.wordpress.com on July 9, 2020.