Scikit-learn feature preview: output Pandas DataFrames in transformers¶
Discover how the new feature set_output
can be used with pipelines to configure each transformer to return Pandas DataFrames.
Building machine learning models in Python is not what it was a few years ago, mainly thanks to scikit-learn. This library takes care of the implementation of all the optimization algorithms that lie underneath the different models, making the fitting procedure both efficient and convenient. Besides all the different model families, scikit-learn also contains a huge range of utilities to preprocess our variables, deal with missing data, select the most relevant features, and so on. Most of these utilities are known as transformers
, and they have nothing to do with the more recent Deep Learning architecture that revolutionized the NLP scene. These utilities, like other sklearn estimators, have a a fit
method, that learns model parameters, and a transform
method, that applies the transformation to the data (6. Dataset transformations).
In this post we are going to test one of scikit-learn newest features, not yet in the current stable release of the library (version 1.1.2). To test new features, the first thing we need to do is to install the development version of the library. You can find detailed instructions here. The best option is to go with the nightly builds, but at the time of writting this there is a slight problem with these builds, so I decided to compile it from source. Once the environment is created and activated, let us check that we are indeed working with the dev version:
import sklearn
sklearn.show_versions()
System: python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0] executable: /home/atorres/miniconda3/envs/sklearn-env/bin/python machine: Linux-5.4.0-128-generic-x86_64-with-glibc2.31 Python dependencies: sklearn: 1.2.dev0 pip: 22.3 setuptools: 65.5.0 numpy: 1.23.4 scipy: 1.9.2 Cython: 0.29.32 pandas: 1.5.1 matplotlib: 3.6.0 joblib: 1.2.0 threadpoolctl: 3.1.0 Built with OpenMP: False threadpoolctl info: user_api: blas internal_api: openblas prefix: libopenblas filepath: /home/atorres/miniconda3/envs/sklearn-env/lib/libopenblasp-r0.3.21.so version: 0.3.21 threading_layer: pthreads architecture: Haswell num_threads: 16
The new feature, set_output
, allows transformers to output Pandas dataframes when the input is also a dataframe. In previous versions the output was always a numpy array, even when the input was a dataframe. Let us look at a specific example with the diamonds dataset. First, we will load the data and separate the input and output variable (price):
import seaborn as sns
import pandas as pd
X = sns.load_dataset('diamonds')
y = X.pop("price")
Then, let us transform each group of features separately:
- Numerical columns: apply standard scaler
- Categorical columns: create one-hot encoding
- Ordinal column ("cut"):use an ordinal encoder instead, that transforms the values "Fair", "Good", "Very Good", … into consecutive numbers 1, 2, 3, …
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
cat_order = [["Fair", "Good", "Very Good", "Premium", "Ideal"]]
pre = ColumnTransformer(
transformers=[
("num", StandardScaler(), make_column_selector(dtype_include="number")),
("cat", OneHotEncoder(drop="first", sparse_output=False), ["color", "clarity"]),
("ord", OrdinalEncoder(categories=cat_order), ["cut"]),
]
)
X_pre = pre.fit_transform(X)
The output in the current version is a numpy array with 20 columns:
print(X_pre.shape)
print(type(X_pre))
(53940, 20) <class 'numpy.ndarray'>
Getting the names of the columns for this numpy array is not a trivial task, since it depends on:
- The order of the transformations
- The type of the transformations, since some are "expansive" (they create several columns from a single one, like one-hot encoding), some are "reductive" (they reduce the number of columns, like PCA) and some do not change the number of columns at all (for instance,
StandardScaler
). - The parameters of the transformations: for instance, you will get different number of columns with
drop="first"
inOneHotEncoder
than without it.
Fortunately, this was simplified in versions 1.0+, when they introduced the method get_feature_names_out
:
pre.get_feature_names_out()
array(['num__carat', 'num__depth', 'num__table', 'num__x', 'num__y', 'num__z', 'cat__color_E', 'cat__color_F', 'cat__color_G', 'cat__color_H', 'cat__color_I', 'cat__color_J', 'cat__clarity_IF', 'cat__clarity_SI1', 'cat__clarity_SI2', 'cat__clarity_VS1', 'cat__clarity_VS2', 'cat__clarity_VVS1', 'cat__clarity_VVS2', 'ord__cut'], dtype=object)
The transformed output can be casted to a DataFrame with:
X_df = pd.DataFrame(X_pre, columns=pre.get_feature_names_out())
The upcoming scikit-learn version 1.2 introduces a more convenient way of doing this operation with the method set_output
:
X_df = pre.set_output(transform="pandas").fit_transform(X)
print(X_df.shape)
print(type(X_df))
(53940, 20) <class 'pandas.core.frame.DataFrame'>
So far we have seen set_output
working with any of the sklearn’s transformers or a ColumnTransformer
. However, the real benefit of this new feature is that it can also be used with pipelines, allowing each transformer in the pipeline to work with DataFrames.
Example¶
To demonstrate this, let us apply the previous preprocessing and then fit a simple linear regression model using sklearn’s Pipeline
class and the new set_output
:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
pipeline = Pipeline([("pre", pre), ("model", LinearRegression())]).set_output(transform="pandas")
pipeline.fit(X, y)
Pipeline(steps=[('pre', ColumnTransformer(transformers=[('num', StandardScaler(), <sklearn.compose._column_transformer.make_column_selector object at 0x7face55b0c70>), ('cat', OneHotEncoder(drop='first', sparse_output=False), ['color', 'clarity']), ('ord', OrdinalEncoder(categories=[['Fair', 'Good', 'Very ' 'Good', 'Premium', 'Ideal']]), ['cut'])])), ('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre', ColumnTransformer(transformers=[('num', StandardScaler(), <sklearn.compose._column_transformer.make_column_selector object at 0x7face55b0c70>), ('cat', OneHotEncoder(drop='first', sparse_output=False), ['color', 'clarity']), ('ord', OrdinalEncoder(categories=[['Fair', 'Good', 'Very ' 'Good', 'Premium', 'Ideal']]), ['cut'])])), ('model', LinearRegression())])
ColumnTransformer(transformers=[('num', StandardScaler(), <sklearn.compose._column_transformer.make_column_selector object at 0x7face55b0c70>), ('cat', OneHotEncoder(drop='first', sparse_output=False), ['color', 'clarity']), ('ord', OrdinalEncoder(categories=[['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']]), ['cut'])])
<sklearn.compose._column_transformer.make_column_selector object at 0x7face55b0c70>
StandardScaler()
['color', 'clarity']
OneHotEncoder(drop='first', sparse_output=False)
['cut']
OrdinalEncoder(categories=[['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']])
LinearRegression()
Now, we can just access the name of the features that go into the model with feature_names_in_
:
print(pipeline["model"].feature_names_in_)
['num__carat' 'num__depth' 'num__table' 'num__x' 'num__y' 'num__z' 'cat__color_E' 'cat__color_F' 'cat__color_G' 'cat__color_H' 'cat__color_I' 'cat__color_J' 'cat__clarity_IF' 'cat__clarity_SI1' 'cat__clarity_SI2' 'cat__clarity_VS1' 'cat__clarity_VS2' 'cat__clarity_VVS1' 'cat__clarity_VVS2' 'ord__cut']
Note that it doesn’t matter whether we have a single transformation in the pipeline or several, since scikit-learn will keep track of the column names through all of them. This can be useful, for instance, to plot the value of the coefficients. To make the plot a little bit nicer, we will remove from the name of the features the prefix just before the "__":
col_names = pd.Series(pipeline["model"].feature_names_in_).str.replace(".*__", "", regex=True)
coef = pd.Series(pipeline["model"].coef_, index=col_names)
coef.sort_values().plot(kind="barh");
Although this is a very simple model, it seems that both the clarity and the carats are important factors to predict the price of diamonds.