The following post was first presented as a talk for the IE@DS community. It will also be presented at PyData meetup in December.
All the resources for this post, including a runable notebook, can be found in the github repo.
Let's begin
This notebook aims to show some nice ways modern Pandas makes your life easier. It is not about efficiency. I'm pretty sure using Pandas' built-in methods will be more efficient than reinventing pandas, but the main goal is to make the code easier to read, and more imoprtant - easier to write.
import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sys
import blog
%matplotlib inline
import my_utils
blog.set_blog_style()
First Hacks!¶
Reading the data and a few housekeeping tasks. This is the first place we can make our code more readable.
df_io = pd.read_csv('./data.csv',index_col=0,parse_dates=['date_'])
df_io.head()
df = df_io.copy().sort_values('date_').set_index('date_').drop(columns='val_updated')
df.head()
Beautiful pipes!¶
One line method chaining is hard to read and prone to human error, chaining each method in its own line makes it a lot more readable.
df_io\
.copy()\
.sort_values('date_')\
.set_index('date_')\
.drop(columns='val_updated')\
.head()
But it has a problem. You can't comment out and even comment in between
# This block will result in an error
df_io\
.copy()\ # This is an inline comment
# This is a regular comment
.sort_values('date_')\
# .set_index('date_')\
.drop(columns='val_updated')\
.head()
Even an unnoticeable space character may break everything
# This block will result in an error
df_io\
.copy()\
.sort_values('date_')\
.set_index('date_')\
.drop(columns='val_updated')\
.head()
The Penny Drops¶
I like those "penny dropping" moments, when you realize you knew everything that is presented, yet it is presented in a new way you never thought of.
# We can split these value inside ()
users = (134856, 195373, 295817, 294003, 262166, 121066, 129678, 307120,
258759, 277922, 220794, 192312, 318486, 314631, 306448, 297059,206892,
169046, 181703, 146200, 199876, 247904, 250884, 282989, 234280, 202520,
138064, 133577, 301053, 242157)
# Penny Drop: We can also Split here
df = (df_io
.copy() # This is an inline comment
# This is a regular comment
.sort_values('date_')
.set_index('date_')
.drop(columns='val_updated')
)
df.head()
Map with dict¶
A dict is a callable with $f(key) = value$, there for you can call .map
with it. In this example I want to make int key codes into letter.
df.event_type.map(lambda x: x+3).head()
# A dict is also a calleble
df['event_type'] = df.event_type.map({
1:'A',
5:'B',
7:'C'
})
df.head()
Time Series¶
Resample¶
Task: How many events happen each hour?
The Old Way¶
bad = df.copy()
bad['day'] = bad.index.date
bad['hour'] = bad.index.hour
(bad
.groupby(['day','hour'])
.count()
)
- Many lines of code
- unneeded columns
- Index is not a time anymore
- missing rows (Did you notice?)
A Better Way¶
df.resample('H').count() # H is for Hour
But it's even better on non-round intervals
rs = df.resample('10T').count()
# T is for Minute, and pandas understands 10 T, it will also under stand 11T if you wonder
rs.head()
Slice Easily¶
Pandas will automatically make string into timestamps, and it will understand what you want it to do.
# Take only timestamp in the hour of 21:00.
rs.loc['2018-10-09 21',:]
# Take all time stamps berfore 18:31
rs.loc[:'2018-10-09 18:31',:]
Time Windows: Rolling, Expanding, EWM¶
If your Dataframe is indexed on a datetime index (Which ours is), you have many options for window functions, which are again sort of "groupby" operations that in old times we used to do on our own.
fig, ax = plt.subplots()
rs.plot(ax=ax,linestyle='--')
(rs
.rolling(6)
.mean()
.rename(columns = {'event_type':'rolling mean'})
.plot(ax=ax)
)
rs.expanding(6).mean().rename(columns = {'event_type':'expanding mean'}).plot(ax=ax)
rs.ewm(6).mean().rename(columns = {'event_type':'ewm mean'}).plot(ax=ax)
plt.show()
With Apply¶
Intuitively, windows are like GroupBy, so you can apply anything you want after the grouping, e.g.: geometric mean.
fig, ax = plt.subplots()
rs.plot(ax=ax,linestyle='--')
(rs
.rolling(6)
.apply(lambda x: np.power(np.product(x),1/len(x)), raw=True )
.rename(columns = {'event_type':'Rolling Geometric Mean'})
.plot(ax=ax)
)
plt.show()
Combine with GroupBy 🤯¶
Pandas has no problem with groupby and resample together. It's as simple as groupby[col1,col2]
. In our specific case, we want to cound events in an interval per event type.
per_event = (df
.groupby('event_type')
.resample('15T')
.apply('count')
.rename(columns={'event_type':'amount'})
)
per_event.head()
Stack, Unstack¶
Unstack¶
In this case, working with a wide format indexed on intervals, with event types as columns, will make a lot more sense.
The Old way¶
Pivot table in modern pandas is more robust than it used to be. Still, it requires you to specify everything.
pt = pd.pivot_table(per_event,values = 'amount',columns='event_type',index='date_')
pt.head()
A better way¶
When you have just one column of values, unstack does the same easily
pt = per_event.unstack('event_type')
pt.columns = pt.columns.droplevel() # Unstack creates a multiindex on columns
pt.head()
Unstack¶
And some extra tricks
pt.stack().head()
This looks kind of what we had expected but:
- It's a series, not a DataFrame
- The levels of the index are reversed
- The main sort is on the date, yet it used to be on the event type
stack_back = (pt
.stack()
.to_frame('amount') # Turn Series to DF without calling the DF constructor
.swaplevel() # Swaps the levels of the index
.sort_index() # Sort by index, makes so much sense yet I used to do .reset_index().sort_values()
)
stack_back.head()
stack_back.equals(per_event)
Sorting¶
By Values¶
per_event.sort_values(by=['amount']).head(10)
By Index¶
per_event.sort_index().head(7)
per_event.sort_index(level=1).head(7)
By Both (New in 0.23)¶
If the index has a name, just treat it has a column.
per_event.sort_values(['amount','event_type'], ascending=(False, True)).head(10)
Clip¶
Let's say, we know from domain knowledge the that an event takes place a minimum of 5 and maximum of 12 at each timestamp. We would like to fix that. In a real world example, we many time want to turn negative numbers to zeroes or some truly big numbers to sum known max.
The Old Way¶
Iterate over columns and change values that meet condition.
cl = pt.copy()
lb = 5
ub = 12
# Needed A loop of 3 lines
for col in ['A','B','C']:
cl['clipped_{}'.format(col)] = cl[col]
cl.loc[cl[col] < lb,'clipped_{}'.format(col)] = lb
cl.loc[cl[col] > ub,'clipped_{}'.format(col)] = ub
my_utils.plot_clipped(cl) # my_utils can be found in the github repo
A better way¶
.clip(lb,ub)
cl = pt.copy()
# Beutiful One Liner
cl[['clipped_A','clipped_B','clipped_C']] = cl.clip(5,12)
my_utils.plot_clipped(cl) # my_utils can be found in the github repo
Reindex¶
Now I have 3 event types from 17:00 to 23:00. Let's imagine, I know that actually I have 5 event types. I also know that the period was from 16:00 to 00:00.
etypes = list('ABCDZ') # New columns
# Define a date range - Pandas will automatically make this into an index
idx = pd.date_range(start='2018-10-09 16:00:00',end='2018-10-09 23:59:00',freq='15T')
type(idx)
pt.reindex(idx, columns=etypes, fill_value=0).head()
### Let's put this in a function - This will help us later.
def get_all_types_and_timestamps(df, min_date='2018-10-09 16:00:00',
max_date='2018-10-09 23:59:00', etypes=list('ABCDZ')):
ret = df.copy()
time_idx = pd.date_range(start=min_date,end=max_date,freq='15T')
# Indices work like set. This is a good practive so we don't override our intended index
idx = ret.index.union(time_idx)
etypes = df.columns.union(set(etypes))
ret = ret.reindex(idx, columns=etypes, fill_value=0)
return ret
Method Chaining¶
Assign¶
Assign is for creating new columns on the dataframes. This is instead of
df[new_col] = function(df[old_col])
. They are both one lines, but .assign
doesn't break the flow.
pt.assign(mean_all = pt.mean(axis=1)).head()
Pipe¶
Think R's %>% (Or rather, avoid thinking about R), .pipe
is a method that accepts a function. pipe
, by default, assumes the first argument of this function is a dataframe and passes the current dataframe down the pipeline. The function should return a dataframe also, if you want to continue with the pipe. Yet, it can also return any other value if you put it in the last step.
This is incredibly valueable because it takes you one step further from "sql" where you do things "in reverse".
$f(g(h(df)))$ = df.pipe(h).pipe(g).pipe(f)
def do_something(df, col='A', n = 200):
ret = df.copy()
# A dataframe is mutable, if you don't copy it first, this is prone to many errors.
# I always copy when I enter a function, even if I'm sure it shouldn't change anything.
ret[col] = ret[col] + n
return ret
do_something(do_something(do_something(pt), 'B', 100), 'C',500).head()
(pt
.pipe(do_something)
.pipe(do_something, col='B', n=100)
.pipe(do_something, col='C', n=500)
.head(5))
You can always do this with multiple lines of df = do_something(df)
but I think this method is more elegant.
Beautiful Code Tells a Story¶
Your code is not just about making the computer do things. It's about telling a story of what you wish to happen. Sometimes other people will want to read you code. Most time, it is you 3 months in the future who will read it. Some say good code documents itself; I'm not that extreme. Yet, storytelling with code may save you from many lines of unnecessary comments.
The next and final block tells the story in one block, or rather one pipe. It's elegant, it tells a story. If you build utility functions and pipe
them while following meaningful naming, they help tell a story. if you assign
columns with meaningful names, they tell a story. you drop
, you apply
, you read
, you groupby
and you resample
- they all tell a story.
(Well... Maybe they could have gone with better naming for resample
)
df = (pd
.read_csv ('./data.csv', index_col=0, parse_dates=['date_'])
.assign (event_type=lambda df: df.event_type.map({1: 'A', 5: 'B', 7: 'C'}))
.sort_values ('date_')
.set_index ('date_')
.drop (columns='val_updated')
.groupby ('event_type')
.resample ('15T')
.apply ('count')
.rename (columns={'event_type': 'amount'})
.unstack ('event_type')
.pipe (my_utils.remove_multi_index)
.pipe (get_all_types_and_timestamps) # Remember this from before?
.assign (mean_event=lambda df: df.mean(axis=1))
.loc [:, ['mean_event']]
.pipe (my_utils.make_sliding_time_windows, steps_back=6)
# Imagine last pipe is an api. Not sure how it's implemented, but the name is meaningful
.dropna ()
)
df.head()
Resources¶
- Modern Pandas by Tom Augspurger
- Basic Time Series Manipulation with Pandas by Laura Fedoruk
- Pandas Docs. You don't have to thoroughly go over everything, just randomly open a page in the docs and you're sure to learn a new thing.
Comments
comments powered by Disqus