Introduction

This is a first article in a series concentrated around feature engineering methods. Out of many different practical aspects of Machine Learning, feature engineering is at the same time one of the most important and yet the least defined one. It can be considered an art, where there are no strict rules and where creativity is the key.

Feature engineering is about creating better representation of information for machine learning model. Even when using non-linear algorithms, not all interactions (relations) between variables in the dataset can be modeled, if raw data is used. This creates a need for manual inspection, processing and data manipulation.

A question arises here - what about deep learning? It is supposed to minimize the need for manual processing and be able to learn a proper data representation by itself. For data such as images, speech or text, when no other ‘metadata’ is given, deep learning will perform better. In case of tabular data, nothing beats Gradient Boosted Trees methods, such as XGBoost or LightGBM. Machine learning competitions prove this - in almost every winning solution for tabular data tree-based model is the best, whereas deep learning models usually cannot achieve this good results (but they blend with trees very well ;).

Basis of feature engineering is domain knowledge. This is why approach to feature engineering should be different for every dataset, depending on the problem to be solved. Still, there are some methods which can be used widely or at least tried to check is they improve the model. Great amount of practical information can be found in HJ van Veen presentation. Some of the methods were implemented based on the slides and descriptions from the presentation.

Methods described in this article will use KaggleDays dataset as an example and are based on the above presentation.

Dataset

Data comes from reddit, it is a set of questions and answers to those questions. Goal of the competition was to predict number of upvotes on an answer. This dataset is especially interesting, because it consists both of text and standard features.

Loading the data can be done with:

X = pd.read_csv('../input/train.csv', sep="\t", index_col='id')

Columns:

['question_id',
 'subreddit',
 'question_utc',
 'question_text',
 'question_score',
 'answer_utc',
 'answer_text',
 'answer_score']

We are given a question_id, each corresponding to a specific question, which is found in question_text. Each question_id occurs many times, as each row contains a different answer to this question, which is given in answer_text. Datetime for both question and answer is provided in _utc columns. There is also information about a subreddit, in which the question was posted. Number of upvotes for question is question_score, for answer - answer_score. Target variable is the answer_score.

Categorical and numerical features

Machine learning models can deal only with numbers. Numerical (continuous, quantitative) variables - variables which may take any value within a finite or infinite interval are naturally represented as numbers, so can be directly used in a model. Situation is different for categorical variables, which are discrete and, like the name suggests, represent different categories. Raw categorical variables are usually in form of strings and need to be transformed before feeding them to the model.

Good example of categorical variable is subreddit, which contains 41 unique categories, of which five are as follows:

['AskReddit', 'Jokes', 'politics', 'explainlikeimfive', 'gaming']

Let’s take a look at the most popular categories with X.subreddit.value_counts()[:5]:

AskReddit    275667
politics     123003
news          42271
worldnews     40016
gaming        32117
Name: subreddit, dtype: int64

Numerical variable is question_score, which can be explored with X.question_score.describe():

mean        770.891169
std        3094.752794
min           1.000000
25%           2.000000
50%          11.000000
75%         112.000000
max       48834.000000
Name: question_score, dtype: float64

Categorical features encoding

Two basic methods of encoding are OneHot, which can be done with pandas.get_dummies. Result for variable with K categories is binary matrix of K columns, where 1 in i-th column indicates that observation belongs to i-th category.

Second basic method is Label encoding, where categories are simply transformed into numbers. This functionality is provided by pandas.factorize or with cat.codes for pandas column of category type. Using this method, original dimensionality is preserved.

There are methods of encoding, which are less standard and are worth trying out for possible improvement in model accuracy. Three methods will be described here:

  • Count encoding
  • Labelcount encoding
  • Target encoding

Prerequisites

Libraries needed for the functions:

import gc

import numpy as np
import pandas as pd

In this case, data should be split into training and validation subsets by question_id, a split imitating the one between train and test Kaggle sets.

question_ids = X.question_id.unique()
question_ids_train = set(pd.Series(question_ids).sample(frac=0.8))
question_ids_valid = set(question_ids).difference(question_ids_train)

X_train = X[X.question_id.isin(question_ids_train)]
X_valid = X[X.question_id.isin(question_ids_valid)]

In principle, statistics should always be computed on the training subset and then merged onto the validation and test sets or computed for each set individually. In practice, in competitions, when test set is already given, sometimes computing a statistics on concatenated data shows a better result. For real-world use, this should never be done!

Count encoding

Count encoding is based on replacing categories with their counts computed on the train set. This method is sensitive to outliers, so the result can be normalized or transformed, for example using log transformation. Categories, which are unknown can be replaced with 1.

Although not very probable, counts may be the same for some of the variables, which may result in collision - encoding two categories as the same value. Whether this will lead to a degradation in model quality or an improvement, it is impossible to say, although in principle such behavior is not desirable.

def count_encode(X, categorical_features, normalize=False):
    print('Count encoding: {}'.format(categorical_features))
    X_ = pd.DataFrame()
    for cat_feature in categorical_features:
        X_[cat_feature] = X[cat_feature].astype(
            'object').map(X[cat_feature].value_counts())
        if normalize:
            X_[cat_feature] = X_[cat_feature] / np.max(X_[cat_feature])
    X_ = X_.add_suffix('_count_encoded')
    if normalize:
        X_ = X_.astype(np.float32)
        X_ = X_.add_suffix('_normalized')
    else:
        X_ = X_.astype(np.uint32)
    return X_

Let’s encode subreddit column:

train_count_subreddit = count_encode(X_train, ['subreddit'])

and check the result, top 5 from original column:

AskReddit    221941
politics      98233
news          33559
worldnews     32010
gaming        25567
Name: subreddit, dtype: int64

encoded:

221941    221941
98233      98233
33559      33559
32010      32010
25567      25567
Name: subreddit_count_encoded, dtype: int64

Subreddit categories were basically replaced with their counts. Counts can also be divided by the most frequent category to obtain normalized values:

1.000000    221941
0.442609     98233
0.151207     33559
0.144228     32010
0.115197     25567
Name: subreddit_count_encoded_normalized, dtype: int64

LabelCount encoding

Second described method is called LabelCount encoding and it revolves around ranking categories by their counts in the train set. Because it ranks the values, either ascending or descending order can be used. LabelCount has certain advantages in comparison to standard count encoding - it is not sensitive to outliers and should not give the same encoding to different values.

def labelcount_encode(X, categorical_features, ascending=False):
    print('LabelCount encoding: {}'.format(categorical_features))
    X_ = pd.DataFrame()
    for cat_feature in categorical_features:
        cat_feature_value_counts = X[cat_feature].value_counts()
        value_counts_list = cat_feature_value_counts.index.tolist()
        if ascending:
            # for ascending ordering
            value_counts_range = list(
                reversed(range(len(cat_feature_value_counts))))
        else:
            # for descending ordering
            value_counts_range = list(range(len(cat_feature_value_counts)))
        labelcount_dict = dict(zip(value_counts_list, value_counts_range))
        X_[cat_feature] = X[cat_feature].map(
            labelcount_dict)
    X_ = X_.add_suffix('_labelcount_encoded')
    if ascending:
        X_ = X_.add_suffix('_ascending')
    else:
        X_ = X_.add_suffix('_descending')
    X_ = X_.astype(np.uint32)
    return X_

Encoding:

train_lc_subreddit = labelcount_encode(X_train, ['subreddit'])

By default, I am using descending order, which would give following result on top 5 categories in subreddit column:

0    221941
1     98233
2     33559
3     32010
4     25567
Name: subreddit_labelcount_encoded_descending, dtype: int64

AskReddit category is the most frequent and will thus be transformed to value 0, signifying the first place in the ranking.

If ascending order is used, following results are obtained:

40    221941
39     98233
38     33559
37     32010
36     25567
Name: subreddit_labelcount_encoded_ascending, dtype: int64

Target encoding

Last comes the most tricky method - Target encoding. It is based on encoding categorical variable values with mean of target variable per value. A statistic (here - mean) of target variable can be computed for every group in the train set and afterwards merged to validation and test sets to capture relationships between a group and the target.

To give a more explicit example - for each subreddit, we can compute mean of answer_score, which gives us a general idea how many upvotes can we expect when posting in a certain subreddit.

When using target variable, is is very important not to leak any information into the validation set. Every such feature should be computed on the training set and then only merged or concatenated with the validation and test subsets. Even though target variable is present in the validation set, it cannot be used for any kind of such computation or overly optimistic estimate of validation error will be given.

If KFold is used, features based on target should be computed in-fold. If a single split is performed, then this should be done after splitting the data into train and validation set.

What is more, smoothing can be added to avoid setting certain categories to 0. Adding random noise is another way of avoiding possible overfit.

When done properly, target encoding is the best encoding for both linear and non-linear models.

def target_encode(X, X_valid, categorical_features, X_test=None,
                  target_feature='target'):
    print('Target Encoding: {}'.format(categorical_features))
    X_ = pd.DataFrame()
    X_valid_ = pd.DataFrame()
    if X_test is not None:
        X_test_ = pd.DataFrame()
    for cat_feature in categorical_features:
        group_target_mean = X.groupby([cat_feature])[target_feature].mean()
        X_[cat_feature] = X[cat_feature].map(group_target_mean)
        X_valid_[cat_feature] = X_valid[cat_feature].map(group_target_mean)
    X_ = X_.astype(np.float32)
    X_ = X_.add_suffix('_target_encoded')
    X_valid_ = X_valid_.astype(np.float32)
    X_valid_ = X_valid_.add_suffix('_target_encoded')
    if X_test is not None:
        X_test_[cat_feature] = X_test[cat_feature].map(group_target_mean)
        X_test_ = X_test_.astype(np.float32)
        X_test_ = X_test_.add_suffix('_target_encoded')
        return X_, X_valid_, X_test_
    return X_, X_valid_

Encoding:

train_tm_subreddit, valid_tm_subreddit = target_encode(
    X_train, X_valid, categorical_features=['subreddit'],
    target_feature='answer_score')

If we take a look at encoded values, there are significant differences of average number of upvotes for each reddit:

23.406061    220014
13.082699     98176
19.020845     33916
17.521887     31869
18.235424     25520
21.535477     24692
18.640282     20416
23.688890     20009
3.159401      18695
Name: subreddit_target_encoded, dtype: int64
AskReddit              220014
politics                98176
news                    33916
worldnews               31869
gaming                  25520
todayilearned           24692
funny                   20416
videos                  20009
teenagers               18695
Name: subreddit, dtype: int64

An answer in AskReddit on average gets 23.4 upvotes, whereas in politics only 13.1 and 3.2 in teenagers subreddit. Such features can be very powerful, as they enable us to explicitly encode some of the target information in feature set.

Getting encoded values for categories:

If we would like to merge obtained values onto validation or test set and we do not want to modify the functions, this can be done the following way:

encoded = train_lc_subreddit.subreddit_labelcount_encoded_descending.value_counts().index.values
raw = X_train.subreddit.value_counts().index.values
encoding_dict = dict(zip(raw, encoded))

X_valid['subreddit_labelcount_encoded_descending'] = X_valid.loc[:,
                                                                 'subreddit'].map(
                                                                 encoding_dict)

References:

  • HJ van Veen presentation: source of the encoding methods
  • Datarevenue: functions in their original form were developed as part of a feature engineering database project for Datarevenue.