Skip to content

ENH: Initial pass at implementing DataFrame.asof, GH 2941 #10266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

bwillers
Copy link
Contributor

@bwillers bwillers commented Jun 4, 2015

Fixes #2941

This can almost certainly be made quicker, still digging into the
internals to understand the various underlying indexers.

There are 4 distinct logics you can apply to how to deal with the
missings (as opposed to the Series asof, wheres theres just two,
either skipna or dont skipna). The default is to be equivalent to
df.apply(lambda s: s.asof(where)).

Likely not ready to actually be merged, but at a stage where I could
use some feedback on both the logic and implementation.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

pls rebase on master and repush had an issue with some builds

@bwillers
Copy link
Contributor Author

bwillers commented Jun 5, 2015

Sorry about that, must've been some changes between my rebase and the travis build. Anyway, it's green now.

Implements DataFrame.asof with various possible logics for
skipping missing elements. Default case is equivalent to
	df.apply(lambda s: s.asof(where))
@bwillers
Copy link
Contributor Author

Reworked this using take_2d_multi, thoughts?

@shoyer
Copy link
Member

shoyer commented Jun 12, 2015

Is there a good reason why this is restricted to timestamps? I suspect not, and it would be nice to generalize things a bit.

I recently cleaned up the related Index.asof method -- your method here depends on Index.asof_locs, which could also definitely use another look. That would be greatly appreciated and would make me more confident in your new feature. Note, for example, that Index.asof_locs currently has no tests.

return Series(index=self.columns, data=np.nan)

s = self.iloc[loc, :].copy()
s.name = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this series should probably be the where argument. This will be necessary for the consistency check I describe below.

@shoyer shoyer added the Indexing Related to indexing on series/frames, not to indexes themselves label Jun 12, 2015
@shoyer shoyer added this to the 0.17.0 milestone Jun 12, 2015
else:
raise ValueError("skipna must be one of percolumn, none, any, all.")

if not hasattr(where, '__iter__'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better simpler just to call self.asof([where]).iloc[0] -- though you might need to disable the "setting item with copy warning" in that case.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

.asof is really just an implementation of reindexing. In fact, this is simply a one-sided method='nearest' (the right side). Any reason why we are not simply doing that? E.g. this should just be a generalized method to Index, and then the front end is just simple wrappers around .reindex(method='asof').

@shoyer
Copy link
Member

shoyer commented Jun 12, 2015

@jreback Indeed, this does almost work already with .reindex(method='pad') (or maybe it's method='backfill', I can't keep those straight). The difference here is that asof will drop missing values, though perhaps that's better done explicitly with .dropna(). Also, this version lets you do asof across multiple columns simultaneously.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

no I get that this handles multiple versions, and maybe that's a reason to special case somewhat, however, this should be an internal type of method (meaning implemented in the Index) as its simply a reindexer

@bwillers
Copy link
Contributor Author

Thanks for the feedback. The motivation for df.asof was largely that there already is a series.asof. But as @shoyer points out, there's really no reason this needs to be specialized to timestamps (any sortable index should work).

Making it a part of index functionality as @jreback suggests, seems a natural choice to make the behaviour available to both frame and series, and also to non timeseries indexes.

There's really only two fundamental cases: either skip missings everywhere .reindex(method='ffill'), or skip them no where (not currently available), the other options can be achieve by the latter and fillna.

I'll take a stab at implementing a method='asof' for reindex to handle the latter case, and let's abandon this PR. Might also make sense to deprecate series.asof in favor of series.reindex(method='ffill'). I'll confirm that it passes the tests for asof and open a seperate issue.

@shoyer
Copy link
Member

shoyer commented Jun 12, 2015

@bwillers I think you have it backwards -- .reindex(method='ffill') will not skip any missing values in the existing series/frame.

Here's my mapping from this method to reindex:

  1. .asof(where, skipna='percolumn') -> .reindex(where, method='ffill').fillna(method='ffill')
  2. .asof(where, skipna='none') -> .reindex(where, method='ffill')
  3. .asof(where, skipna='all') -> .dropna(how='all').reindex(where, method='ffill')
  4. .asof(where, skipna='any') -> .dropna(how='any').reindex(where, method='ffill')

Which case would .reindex(method='asof') cover? Honestly, this feels more like a documentation issue to me :).

@bwillers
Copy link
Contributor Author

@shoyer Ah, well I feel quite the idiot right now. For some reason I was under the impression that .reindex(where, model='ffill') is ignored rows with missing values when it tries to find what to pull forward. Demonstrably however, this is not so:

print(df)
            a   b
2008-01-02  1   1
2008-01-09  2 NaN
2008-01-16  3   3

print(df.reindex(new_dates, method='ffill'))
             a   b
2008-01-01 NaN NaN
2008-01-08   1   1
2008-01-15   2 NaN
2008-01-22   3   3

I was expecting that bottom right NaN to be a 1 - I must have somehow conflated method='ffill' in reindex and fillna.

So since all the df.asof varieties can be achieved using the existing .reindex(where='ffill') approach, all this PR does is avoid an intermediate frame, but that's probably not worth cluttering the API for.

Sorry for the cycles wasted on my poor understanding.

@bwillers bwillers closed this Jun 12, 2015
@shoyer
Copy link
Member

shoyer commented Jun 12, 2015

@bwillers No worries! We would still appreciate documentation updates and/or cleanup/deprecation for Series.asof

@bwillers bwillers deleted the dataframe_asof branch June 12, 2015 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants