It is relatively old now, but on version 0.25, pandas
introduced NamedAgg
. It is mostly a convenience, it's not huge, but for me, it's life-changing.
The as
keyword from SQL is a wonderful one. You can create a complex function on a database column, and then give it an explicit name. This name can be used later in the pipeline as a reference, or even just as-is for convenience when reading the table.
pandas
, even though superior to SQL in so many ways, really lacked this until fairly recently. You either do a renaming stage, after receiving multi-index columns or feed the agg
function with a complex dictionary structure. With NamedAgg
, it becomes as easy as the as
keyword, and in my mind, even more elegant.
import pandas as pd
Dataset¶
Let's load the wine dataset and play with it a bit.
from sklearn.datasets import load_wine
wine = load_wine()
df = pd.DataFrame(wine['data'], columns=wine['feature_names']).loc[:, ['alcohol','color_intensity', 'hue']]
df['class'] = ['class_' + str(x) for x in wine['target']]
df.sample(6)
Group By¶
Pandas .groupby
always had a lot of flexability, but it was not perfect. Until lately. Let's start with the basics.
If you just want one aggregation function, and it happens to be a very basic one, just call it.
The easy stuff¶
df.groupby('class').sum()
And you can also call it for only one column, and get back a Series
.
df.groupby('class').alcohol.sum()
If you want more basic aggregations on this column, it's also not a problem.
df.groupby('class').alcohol.agg(['sum','count'])
It's getting more complex¶
Now let's say you need two aggregation functions, and on more than one column, but not all columns. It's becoming more complex, and the code gets less elegant.
df.groupby('class')[['alcohol','hue']].agg(['sum','count'])
Notice how pandas created a MultiIndex
on the columns. A MultiIndex
has many advantages for efficiency, but when we are exploring data on reasonably sized datasets, they are harder to work with. I'd rather have a flat index. Flat indices are much easier to work with as they have very explicit means of accessing.
Also, maybe I don't need all functions on all columns? Perhaps I only need the sum of alcohol and the count of hue?
Replacing the as
keyword from SQL¶
SQL provides an important keyword when using group by
. The as
keyword essentially lets you call whichever aggregation function you want. If I want two different aggregation functions, on two separate columns, simply write the functions and give them a new name.
SELECT class, sum(alcohol) as sum_alcohol, count(hue) as count_hue
from df
group by class
It used to be hard replicating that. You needed a very confusing dictionary structure that looked like this
df.groupby('class').agg({
'alcohol': {
'sum_alcohol': 'sum'
},
'hue': {
'count_hue': 'count'
}
})
It would have worked, but first of all, it's pretty not very elegant with all those nested dicts, and it would still return a MultiIndex
on the columns.
The solution comes with pd.NamedAgg
. The way to use it is still with the agg function, which now accepts **kwargs
. Just provide it with a NamedAgg
using the following template.
Language | Syntax |
---|---|
SQL | f(column) as my_column name |
pandas | my_column_name=pd.NamedAgg(column, f) |
Let's see.
You need pd.NamedAgg
¶
df.groupby('class').agg(
sum_alcohol = pd.NamedAgg('alcohol', 'sum'),
count_hue = pd.NamedAgg('hue', 'count')
)
This is clear, concise, and tells a story that is easy to follow with the analysis intent. Also, it is easy to use it with comment switches. If I want to turn some aggregation on and off, comment out its line.
Complex agg
function¶
The agg
method is very robust. One of its coolest behaviors is its ability to accept lambda functions.
df.groupby('class').agg({'alcohol':['sum', 'count'], 'hue': ['min', lambda x: x.prod() ** (1 / len(x))]})
But as you can observe, we remain with a MultiIndex
on columns and a column named <lambda_0>
. Renaming this, and would be hell. Imagine reading your own code 6 months from now, when you suddenly meet this line: df.rename(columns={'<lambda_0>':'geomean'})
. Blah 🤢.
pandas
will give it a readable name if you use def function(x):
but, that may sometimes have the overhead of writing small unnecessary functions.
NamedAgg
takes care of all this hassle. It can easily be fed lambda functions with names given on the agg
method.
df.groupby('class').agg(
sum_alcohol = pd.NamedAgg('alcohol', 'sum'),
geomean_of_hue = pd.NamedAgg('hue', lambda x: x.prod() ** (1 / len(x)))
)
You don't really need pd.NamedAgg
¶
In reality, if you feed agg
with many **kwargs
, all have a parameter name and tuple
as a value, maybe there's no need to write pd.NamedAgg
every time?
Indeed, NamedAgg
more than being an API innovation, it's a framework innovation. It is its own class, but this class is more of a blueprint.
tmp_agg = pd.NamedAgg('tmp_col', 'somefunc')
print(tmp_agg)
print(type(tmp_agg))
You can see that it can be instantiated with a non-existing function, on a non-existing column name, without a dataframe. It will only try to work inside an aggregation.
The agg
method is smart enough. When given many **kwargs
as tuples, it knows it should instantiate them as NamedAgg
s. And this is where it ends, with this beautiful groupby
aggregation, that finally, truly replaces the as
keyword from SQL.
df.groupby('class').agg(
sum_alcohol = ('alcohol', 'sum'),
geomean_of_hue = ('hue', lambda x: x.prod() ** (1 / len(x)))
)
tl;dr¶
Syntax¶
Language | Syntax |
---|---|
SQL | f(column) as my_column_name |
old pandas | {{column: {my_column_name: f}} |
pandas | my_column_name = pd.NamedAgg(column, f) |
better pandas | my_column_name = (column, f) |
Example¶
SQL | pandas |
---|---|
SELECT class, sum(alcohol) as sum_alcohol, count(hue) as count_hue |
df.groupby('class').agg( |
Comments
comments powered by Disqus