It is relatively old now, but on version 0.25, pandas introduced NamedAgg. It is mostly a convenience, it's not huge, but for me, it's life-changing.
The as keyword from SQL is a wonderful one. You can create a complex function on a database column, and then give it an explicit name. This name can be used later in the pipeline as a reference, or even just as-is for convenience when reading the table.
pandas, even though superior to SQL in so many ways, really lacked this until fairly recently. You either do a renaming stage, after receiving multi-index columns or feed the agg function with a complex dictionary structure. With NamedAgg, it becomes as easy as the as keyword, and in my mind, even more elegant.

In [84]:

import pandas as pd

Dataset¶

Let's load the wine dataset and play with it a bit.

In [85]:

from sklearn.datasets import load_wine
wine = load_wine()
df = pd.DataFrame(wine['data'], columns=wine['feature_names']).loc[:, ['alcohol','color_intensity', 'hue']]
df['class'] = ['class_' + str(x) for x in wine['target']]
df.sample(6)

Out[85]:

	alcohol	color_intensity	hue	class
124	11.87	2.80	0.75	class_1
86	12.16	2.45	1.33	class_1
163	12.96	5.28	0.68	class_2
4	13.24	4.32	1.04	class_0
166	13.45	10.68	0.85	class_2
137	12.53	5.00	0.82	class_2

Group By¶

Pandas .groupby always had a lot of flexability, but it was not perfect. Until lately. Let's start with the basics. If you just want one aggregation function, and it happens to be a very basic one, just call it.

The easy stuff¶

In [86]:

df.groupby('class').sum()

Out[86]:

	alcohol	color_intensity	hue
class
class_0	810.94	326.170000	62.660
class_1	871.79	219.150000	74.996
class_2	631.38	355.019999	32.770

And you can also call it for only one column, and get back a Series.

In [87]:

df.groupby('class').alcohol.sum()

Out[87]:

class
class_0    810.94
class_1    871.79
class_2    631.38
Name: alcohol, dtype: float64

If you want more basic aggregations on this column, it's also not a problem.

In [88]:

df.groupby('class').alcohol.agg(['sum','count'])

Out[88]:

	sum	count
class
class_0	810.94	59
class_1	871.79	71
class_2	631.38	48

It's getting more complex¶

Now let's say you need two aggregation functions, and on more than one column, but not all columns. It's becoming more complex, and the code gets less elegant.

In [89]:

df.groupby('class')[['alcohol','hue']].agg(['sum','count'])

Out[89]:

	alcohol		hue
	sum	count	sum	count
class
class_0	810.94	59	62.660	59
class_1	871.79	71	74.996	71
class_2	631.38	48	32.770	48

Notice how pandas created a MultiIndex on the columns. A MultiIndex has many advantages for efficiency, but when we are exploring data on reasonably sized datasets, they are harder to work with. I'd rather have a flat index. Flat indices are much easier to work with as they have very explicit means of accessing. Also, maybe I don't need all functions on all columns? Perhaps I only need the sum of alcohol and the count of hue?

Replacing the `as` keyword from SQL¶

SQL provides an important keyword when using group by. The as keyword essentially lets you call whichever aggregation function you want. If I want two different aggregation functions, on two separate columns, simply write the functions and give them a new name.

SELECT class, sum(alcohol) as sum_alcohol, count(hue) as count_hue
from df
group by class

It used to be hard replicating that. You needed a very confusing dictionary structure that looked like this

df.groupby('class').agg({
    'alcohol': {
        'sum_alcohol': 'sum'
    },
    'hue': {
        'count_hue': 'count'
    }
})

It would have worked, but first of all, it's pretty not very elegant with all those nested dicts, and it would still return a MultiIndex on the columns.
The solution comes with pd.NamedAgg. The way to use it is still with the agg function, which now accepts **kwargs. Just provide it with a NamedAgg using the following template.

Language	Syntax
SQL	`f(column) as my_column name`
pandas	`my_column_name=pd.NamedAgg(column, f)`

Let's see.

You need `pd.NamedAgg`¶

In [90]:

df.groupby('class').agg(
    sum_alcohol = pd.NamedAgg('alcohol', 'sum'),
    count_hue = pd.NamedAgg('hue', 'count')
)

Out[90]:

	sum_alcohol	count_hue
class
class_0	810.94	59
class_1	871.79	71
class_2	631.38	48

This is clear, concise, and tells a story that is easy to follow with the analysis intent. Also, it is easy to use it with comment switches. If I want to turn some aggregation on and off, comment out its line.

Complex `agg` function¶

The agg method is very robust. One of its coolest behaviors is its ability to accept lambda functions.

In [91]:

df.groupby('class').agg({'alcohol':['sum', 'count'], 'hue': ['min', lambda x: x.prod() ** (1 / len(x))]})

Out[91]:

	alcohol		hue
	sum	count	min	<lambda_0>
class
class_0	810.94	59	0.82	1.055648
class_1	871.79	71	0.69	1.037434
class_2	631.38	48	0.48	0.673688

But as you can observe, we remain with a MultiIndex on columns and a column named <lambda_0>. Renaming this, and would be hell. Imagine reading your own code 6 months from now, when you suddenly meet this line: df.rename(columns={'<lambda_0>':'geomean'}). Blah 🤢. pandas will give it a readable name if you use def function(x): but, that may sometimes have the overhead of writing small unnecessary functions. NamedAgg takes care of all this hassle. It can easily be fed lambda functions with names given on the agg method.

In [92]:

df.groupby('class').agg(
    sum_alcohol = pd.NamedAgg('alcohol', 'sum'),
    geomean_of_hue = pd.NamedAgg('hue', lambda x: x.prod() ** (1 / len(x)))
)

Out[92]:

	sum_alcohol	geomean_of_hue
class
class_0	810.94	1.055648
class_1	871.79	1.037434
class_2	631.38	0.673688

You don't really need `pd.NamedAgg`¶

In reality, if you feed agg with many **kwargs, all have a parameter name and tuple as a value, maybe there's no need to write pd.NamedAgg every time?
Indeed, NamedAgg more than being an API innovation, it's a framework innovation. It is its own class, but this class is more of a blueprint.

In [93]:

tmp_agg = pd.NamedAgg('tmp_col', 'somefunc')
print(tmp_agg)
print(type(tmp_agg))

NamedAgg(column='tmp_col', aggfunc='somefunc')
<class 'pandas.core.groupby.generic.NamedAgg'>

You can see that it can be instantiated with a non-existing function, on a non-existing column name, without a dataframe. It will only try to work inside an aggregation. The agg method is smart enough. When given many **kwargs as tuples, it knows it should instantiate them as NamedAggs. And this is where it ends, with this beautiful groupby aggregation, that finally, truly replaces the as keyword from SQL.

In [94]:

df.groupby('class').agg(
    sum_alcohol = ('alcohol', 'sum'),
    geomean_of_hue = ('hue', lambda x: x.prod() ** (1 / len(x)))
)

Out[94]:

	sum_alcohol	geomean_of_hue
class
class_0	810.94	1.055648
class_1	871.79	1.037434
class_2	631.38	0.673688

tl;dr¶

Syntax¶

Language	Syntax
SQL	`f(column) as my_column_name`
old pandas	`{{column: {my_column_name: f}}`
pandas	`my_column_name = pd.NamedAgg(column, f)`
better pandas	`my_column_name = (column, f)`

Example¶

SQL	pandas
SELECT class, sum(alcohol) as sum_alcohol, count(hue) as count_hue from df group by class	df.groupby('class').agg( sum_alcohol = ('alcohol', 'sum'), count_hue = ('hue', 'count) )

DeanLa

Why is Nobody Talking about Pandas NamedAgg?

Table of Contents

Dataset¶

Group By¶

The easy stuff¶

It's getting more complex¶

Replacing the `as` keyword from SQL¶

You need `pd.NamedAgg`¶

Complex `agg` function¶

You don't really need `pd.NamedAgg`¶

tl;dr¶

Syntax¶

Example¶

Comments

DeanLa

Table of Contents

Dataset¶

Group By¶

The easy stuff¶

It's getting more complex¶

Replacing the as keyword from SQL¶

You need pd.NamedAgg¶

Complex agg function¶

You don't really need pd.NamedAgg¶

tl;dr¶

Syntax¶

Example¶

Comments

Replacing the `as` keyword from SQL¶

You need `pd.NamedAgg`¶

Complex `agg` function¶

You don't really need `pd.NamedAgg`¶