It is relatively old now, but on version 0.25, pandas introduced NamedAgg. It is mostly a convenience, it's not huge, but for me, it's life-changing.
The as keyword from SQL is a wonderful one. You can create a complex function on a database column, and then give it an explicit name. This name can be used later in the pipeline as a reference, or even just as-is for convenience when reading the table.
pandas, even though superior to SQL in so many ways, really lacked this until fairly recently. You either do a renaming stage, after receiving multi-index columns or feed the agg function with a complex dictionary structure. With NamedAgg, it becomes as easy as the as keyword, and in my mind, even more elegant.

In [84]:
import pandas as pd

Dataset

Let's load the wine dataset and play with it a bit.

In [85]:
from sklearn.datasets import load_wine
wine = load_wine()
df = pd.DataFrame(wine['data'], columns=wine['feature_names']).loc[:, ['alcohol','color_intensity', 'hue']]
df['class'] = ['class_' + str(x) for x in wine['target']]
df.sample(6)
Out[85]:
alcohol color_intensity hue class
124 11.87 2.80 0.75 class_1
86 12.16 2.45 1.33 class_1
163 12.96 5.28 0.68 class_2
4 13.24 4.32 1.04 class_0
166 13.45 10.68 0.85 class_2
137 12.53 5.00 0.82 class_2

Group By

Pandas .groupby always had a lot of flexability, but it was not perfect. Until lately. Let's start with the basics. If you just want one aggregation function, and it happens to be a very basic one, just call it.

The easy stuff

In [86]:
df.groupby('class').sum()
Out[86]:
alcohol color_intensity hue
class
class_0 810.94 326.170000 62.660
class_1 871.79 219.150000 74.996
class_2 631.38 355.019999 32.770

And you can also call it for only one column, and get back a Series.

In [87]:
df.groupby('class').alcohol.sum()
Out[87]:
class
class_0    810.94
class_1    871.79
class_2    631.38
Name: alcohol, dtype: float64

If you want more basic aggregations on this column, it's also not a problem.

In [88]:
df.groupby('class').alcohol.agg(['sum','count'])
Out[88]:
sum count
class
class_0 810.94 59
class_1 871.79 71
class_2 631.38 48

It's getting more complex

Now let's say you need two aggregation functions, and on more than one column, but not all columns. It's becoming more complex, and the code gets less elegant.

In [89]:
df.groupby('class')[['alcohol','hue']].agg(['sum','count'])
Out[89]:
alcohol hue
sum count sum count
class
class_0 810.94 59 62.660 59
class_1 871.79 71 74.996 71
class_2 631.38 48 32.770 48

Notice how pandas created a MultiIndex on the columns. A MultiIndex has many advantages for efficiency, but when we are exploring data on reasonably sized datasets, they are harder to work with. I'd rather have a flat index. Flat indices are much easier to work with as they have very explicit means of accessing. Also, maybe I don't need all functions on all columns? Perhaps I only need the sum of alcohol and the count of hue?

Replacing the as keyword from SQL

SQL provides an important keyword when using group by. The as keyword essentially lets you call whichever aggregation function you want. If I want two different aggregation functions, on two separate columns, simply write the functions and give them a new name.

SELECT class, sum(alcohol) as sum_alcohol, count(hue) as count_hue
from df
group by class

It used to be hard replicating that. You needed a very confusing dictionary structure that looked like this

df.groupby('class').agg({
    'alcohol': {
        'sum_alcohol': 'sum'
    },
    'hue': {
        'count_hue': 'count'
    }
})

It would have worked, but first of all, it's pretty not very elegant with all those nested dicts, and it would still return a MultiIndex on the columns.
The solution comes with pd.NamedAgg. The way to use it is still with the agg function, which now accepts **kwargs. Just provide it with a NamedAgg using the following template.

Language Syntax
SQL f(column) as my_column name
pandas my_column_name=pd.NamedAgg(column, f)

Let's see.

You need pd.NamedAgg

In [90]:
df.groupby('class').agg(
    sum_alcohol = pd.NamedAgg('alcohol', 'sum'),
    count_hue = pd.NamedAgg('hue', 'count')
)
Out[90]:
sum_alcohol count_hue
class
class_0 810.94 59
class_1 871.79 71
class_2 631.38 48

This is clear, concise, and tells a story that is easy to follow with the analysis intent. Also, it is easy to use it with comment switches. If I want to turn some aggregation on and off, comment out its line.

Complex agg function

The agg method is very robust. One of its coolest behaviors is its ability to accept lambda functions.

In [91]:
df.groupby('class').agg({'alcohol':['sum', 'count'], 'hue': ['min', lambda x: x.prod() ** (1 / len(x))]})
Out[91]:
alcohol hue
sum count min <lambda_0>
class
class_0 810.94 59 0.82 1.055648
class_1 871.79 71 0.69 1.037434
class_2 631.38 48 0.48 0.673688

But as you can observe, we remain with a MultiIndex on columns and a column named <lambda_0>. Renaming this, and would be hell. Imagine reading your own code 6 months from now, when you suddenly meet this line: df.rename(columns={'<lambda_0>':'geomean'}). Blah 🤢. pandas will give it a readable name if you use def function(x): but, that may sometimes have the overhead of writing small unnecessary functions. NamedAgg takes care of all this hassle. It can easily be fed lambda functions with names given on the agg method.

In [92]:
df.groupby('class').agg(
    sum_alcohol = pd.NamedAgg('alcohol', 'sum'),
    geomean_of_hue = pd.NamedAgg('hue', lambda x: x.prod() ** (1 / len(x)))
)
Out[92]:
sum_alcohol geomean_of_hue
class
class_0 810.94 1.055648
class_1 871.79 1.037434
class_2 631.38 0.673688

You don't really need pd.NamedAgg

In reality, if you feed agg with many **kwargs, all have a parameter name and tuple as a value, maybe there's no need to write pd.NamedAgg every time?
Indeed, NamedAgg more than being an API innovation, it's a framework innovation. It is its own class, but this class is more of a blueprint.

In [93]:
tmp_agg = pd.NamedAgg('tmp_col', 'somefunc')
print(tmp_agg)
print(type(tmp_agg))
NamedAgg(column='tmp_col', aggfunc='somefunc')
<class 'pandas.core.groupby.generic.NamedAgg'>

You can see that it can be instantiated with a non-existing function, on a non-existing column name, without a dataframe. It will only try to work inside an aggregation. The agg method is smart enough. When given many **kwargs as tuples, it knows it should instantiate them as NamedAggs. And this is where it ends, with this beautiful groupby aggregation, that finally, truly replaces the as keyword from SQL.

In [94]:
df.groupby('class').agg(
    sum_alcohol = ('alcohol', 'sum'),
    geomean_of_hue = ('hue', lambda x: x.prod() ** (1 / len(x)))
)
Out[94]:
sum_alcohol geomean_of_hue
class
class_0 810.94 1.055648
class_1 871.79 1.037434
class_2 631.38 0.673688

tl;dr

Syntax

Language Syntax
SQL f(column) as my_column_name
old pandas {{column: {my_column_name: f}}
pandas my_column_name = pd.NamedAgg(column, f)
better pandas my_column_name = (column, f)

Example

SQL pandas
SELECT class, sum(alcohol) as sum_alcohol, count(hue) as count_hue
from df
group by class
df.groupby('class').agg(
sum_alcohol = ('alcohol', 'sum'),
count_hue = ('hue', 'count)
)


Comments

comments powered by Disqus