ENH: Initial pass at implementing DataFrame.asof, GH 2941 #10266

bwillers · 2015-06-04T02:37:01Z

This can almost certainly be made quicker, still digging into the
internals to understand the various underlying indexers.

There are 4 distinct logics you can apply to how to deal with the
missings (as opposed to the Series asof, wheres theres just two,
either skipna or dont skipna). The default is to be equivalent to
df.apply(lambda s: s.asof(where)).

Likely not ready to actually be merged, but at a stage where I could
use some feedback on both the logic and implementation.

jreback · 2015-06-04T12:12:41Z

pls rebase on master and repush had an issue with some builds

bwillers · 2015-06-05T22:44:04Z

Sorry about that, must've been some changes between my rebase and the travis build. Anyway, it's green now.

Implements DataFrame.asof with various possible logics for skipping missing elements. Default case is equivalent to df.apply(lambda s: s.asof(where))

bwillers · 2015-06-12T11:21:59Z

Reworked this using take_2d_multi, thoughts?

shoyer · 2015-06-12T18:51:51Z

Is there a good reason why this is restricted to timestamps? I suspect not, and it would be nice to generalize things a bit.

I recently cleaned up the related Index.asof method -- your method here depends on Index.asof_locs, which could also definitely use another look. That would be greatly appreciated and would make me more confident in your new feature. Note, for example, that Index.asof_locs currently has no tests.

shoyer · 2015-06-12T18:53:45Z

pandas/core/frame.py

+                return Series(index=self.columns, data=np.nan)
+
+            s = self.iloc[loc, :].copy()
+            s.name = None


The name of this series should probably be the where argument. This will be necessary for the consistency check I describe below.

shoyer · 2015-06-12T19:00:14Z

pandas/core/frame.py

+        else:
+            raise ValueError("skipna must be one of percolumn, none, any, all.")
+
+        if not hasattr(where, '__iter__'):


might be better simpler just to call self.asof([where]).iloc[0] -- though you might need to disable the "setting item with copy warning" in that case.

jreback · 2015-06-12T19:00:58Z

.asof is really just an implementation of reindexing. In fact, this is simply a one-sided method='nearest' (the right side). Any reason why we are not simply doing that? E.g. this should just be a generalized method to Index, and then the front end is just simple wrappers around .reindex(method='asof').

shoyer · 2015-06-12T19:08:09Z

@jreback Indeed, this does almost work already with .reindex(method='pad') (or maybe it's method='backfill', I can't keep those straight). The difference here is that asof will drop missing values, though perhaps that's better done explicitly with .dropna(). Also, this version lets you do asof across multiple columns simultaneously.

jreback · 2015-06-12T19:10:02Z

no I get that this handles multiple versions, and maybe that's a reason to special case somewhat, however, this should be an internal type of method (meaning implemented in the Index) as its simply a reindexer

bwillers · 2015-06-12T21:34:55Z

Thanks for the feedback. The motivation for df.asof was largely that there already is a series.asof. But as @shoyer points out, there's really no reason this needs to be specialized to timestamps (any sortable index should work).

Making it a part of index functionality as @jreback suggests, seems a natural choice to make the behaviour available to both frame and series, and also to non timeseries indexes.

There's really only two fundamental cases: either skip missings everywhere .reindex(method='ffill'), or skip them no where (not currently available), the other options can be achieve by the latter and fillna.

I'll take a stab at implementing a method='asof' for reindex to handle the latter case, and let's abandon this PR. Might also make sense to deprecate series.asof in favor of series.reindex(method='ffill'). I'll confirm that it passes the tests for asof and open a seperate issue.

shoyer · 2015-06-12T21:45:13Z

@bwillers I think you have it backwards -- .reindex(method='ffill') will not skip any missing values in the existing series/frame.

Here's my mapping from this method to reindex:

.asof(where, skipna='percolumn') -> .reindex(where, method='ffill').fillna(method='ffill')
.asof(where, skipna='none') -> .reindex(where, method='ffill')
.asof(where, skipna='all') -> .dropna(how='all').reindex(where, method='ffill')
.asof(where, skipna='any') -> .dropna(how='any').reindex(where, method='ffill')

Which case would .reindex(method='asof') cover? Honestly, this feels more like a documentation issue to me :).

bwillers · 2015-06-12T22:23:59Z

@shoyer Ah, well I feel quite the idiot right now. For some reason I was under the impression that .reindex(where, model='ffill') is ignored rows with missing values when it tries to find what to pull forward. Demonstrably however, this is not so:

print(df)
            a   b
2008-01-02  1   1
2008-01-09  2 NaN
2008-01-16  3   3

print(df.reindex(new_dates, method='ffill'))
             a   b
2008-01-01 NaN NaN
2008-01-08   1   1
2008-01-15   2 NaN
2008-01-22   3   3

I was expecting that bottom right NaN to be a 1 - I must have somehow conflated method='ffill' in reindex and fillna.

So since all the df.asof varieties can be achieved using the existing .reindex(where='ffill') approach, all this PR does is avoid an intermediate frame, but that's probably not worth cluttering the API for.

Sorry for the cycles wasted on my poor understanding.

shoyer · 2015-06-12T22:26:12Z

@bwillers No worries! We would still appreciate documentation updates and/or cleanup/deprecation for Series.asof

bwillers force-pushed the dataframe_asof branch from b9c96d1 to f05b126 Compare June 4, 2015 22:58

ENH: Initial pass at implementing DataFrame.asof, GH 2941

5c92545

Implements DataFrame.asof with various possible logics for skipping missing elements. Default case is equivalent to df.apply(lambda s: s.asof(where))

bwillers force-pushed the dataframe_asof branch from f05b126 to 5c92545 Compare June 12, 2015 03:26

shoyer reviewed Jun 12, 2015
View reviewed changes

shoyer added the Indexing Related to indexing on series/frames, not to indexes themselves label Jun 12, 2015

shoyer added this to the 0.17.0 milestone Jun 12, 2015

shoyer reviewed Jun 12, 2015
View reviewed changes

bwillers closed this Jun 12, 2015

bwillers deleted the dataframe_asof branch June 12, 2015 23:39

This was referenced Jun 13, 2015

DataFrame.asof #2941

Closed

CLN: Series.asof #10343

Closed

DEPR: deprecating series asof GH10343 #10345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Initial pass at implementing DataFrame.asof, GH 2941 #10266

ENH: Initial pass at implementing DataFrame.asof, GH 2941 #10266

bwillers commented Jun 4, 2015

jreback commented Jun 4, 2015

bwillers commented Jun 5, 2015

bwillers commented Jun 12, 2015

shoyer commented Jun 12, 2015

shoyer Jun 12, 2015

shoyer Jun 12, 2015

jreback commented Jun 12, 2015

shoyer commented Jun 12, 2015

jreback commented Jun 12, 2015

bwillers commented Jun 12, 2015

shoyer commented Jun 12, 2015

bwillers commented Jun 12, 2015

shoyer commented Jun 12, 2015

ENH: Initial pass at implementing DataFrame.asof, GH 2941 #10266

ENH: Initial pass at implementing DataFrame.asof, GH 2941 #10266

Conversation

bwillers commented Jun 4, 2015

jreback commented Jun 4, 2015

bwillers commented Jun 5, 2015

bwillers commented Jun 12, 2015

shoyer commented Jun 12, 2015

shoyer Jun 12, 2015

Choose a reason for hiding this comment

shoyer Jun 12, 2015

Choose a reason for hiding this comment

jreback commented Jun 12, 2015

shoyer commented Jun 12, 2015

jreback commented Jun 12, 2015

bwillers commented Jun 12, 2015

shoyer commented Jun 12, 2015

bwillers commented Jun 12, 2015

shoyer commented Jun 12, 2015