Skip to content

PERF: fastpath DataFrame constructor from BlockManager #33357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 9, 2020

Conversation

jbrockmendel
Copy link
Member

When trying to make fast_apply unnecessary I found that there is non-trivial overhead in the constructors that we can avoid. Moreover, since users shouldn't be passing BlockManagers around anyway, we might be able to get that case out of the DataFrame constructor entirely.

In [2]: ser = pd.Series(range(5))                                                                                                                                        
In [3]: df = ser.to_frame()                                                                                                                                              
In [4]: mgr = df._mgr

In [5]: %timeit pd.DataFrame(mgr)                                                                                                                                       
2.08 µs ± 51.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [6]: %timeit pd.DataFrame._from_mgr(mgr)                                                                                                                             
1.26 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

@simonjayhawkins
Copy link
Member

Moreover, since users shouldn't be passing BlockManagers around anyway, we might be able to get that case out of the DataFrame constructor entirely.

makes sense but would changing obj._constructor -> type(obj)._from_mgr create issues with subclassing. I recall a recent discussion that _constructor should be type(self)

@simonjayhawkins simonjayhawkins added Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance labels Apr 7, 2020
@jbrockmendel
Copy link
Member Author

makes sense but would changing obj._constructor -> type(obj)._from_mgr create issues with subclassing.

Fair point. If _constructor is something other than type(self) this would be change behavior. Will update.

I recall a recent discussion that _constructor should be type(self)

This sounds familiar. I'd be +1.

@@ -230,6 +230,15 @@ def _init_mgr(cls, mgr, axes=None, dtype=None, copy=False):
mgr = mgr.astype(dtype=dtype)
return mgr

@classmethod
def _from_mgr(cls, mgr: BlockManager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the return type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be NDFrame or FrameOrSeries?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

Suggested change
def _from_mgr(cls, mgr: BlockManager):
def _from_mgr(cls: Type[FrameOrSeries], mgr: BlockManager) -> FrameOrSeries:

though not checked

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we need to change from classmethod in order to get _constructor, unless we want to do something like

constructor = cls._constructor.fget(cls)

which seems sketchy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does obj._constructor -> obj._constructor._from_mgr not work with the classmethod. obj._constructor returns a class type not an instance?

"""
Fastpath constructor for if we have only a BlockManager.
"""
obj = object.__new__(cls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might consider calling finalize here (have to pass the method as an arg)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the use cases i have in mind is in groupby _chop where we are potentially creating many many Series/DataFrame objects, hoping we can avoid the extra call

@jorisvandenbossche
Copy link
Member

Moreover, since users shouldn't be passing BlockManagers around anyway, we might be able to get that case out of the DataFrame constructor entirely.

This is however used by downstream libraries (eg pyarrow).

I recall a recent discussion that _constructor should be type(self)

That is currently not required to be the case (and for geopandas it also is not the case). After that earlier discussion, this issue was opened about that: #32638

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can simply change away from using _constructor(that at least needs more investigation to check the impact on subclasses), or assume _constructor is a class (#32638)

But couldn't we provide a fast path in the DataFrame constructor that does more or less the same?
Basically check that data is a BlockManager and that all other keywords are None? (that are some more checks of course, so would need to check if that doesn't undo the gain)

@simonjayhawkins
Copy link
Member

@jorisvandenbossche thanks for the link. you wrote

we also don't require _constructor to be a class, it can be a constructor function

so #33357 (comment) is probably not applicable.

@jorisvandenbossche
Copy link
Member

The only requirement we have is that _constructor returns a callable, so obj._constructor._from_mgr will not necessarily work.

@jbrockmendel
Copy link
Member Author

Basically check that data is a BlockManager and that all other keywords are None? (that are some more checks of course, so would need to check if that doesn't undo the gain)

if this or something like it isnt accepted, then that may be worthwhile. if we do get this fastpath, then id prefer not to make the already-complicated constructor more complicated.

The only requirement we have is that _constructor returns a callable, so obj._constructor._from_mgr will not necessarily work.

How about

def _from_mgr(self, mgr):
    constr = self._constructor
    if constr is type(self):
         [the current implementation]
    else:
        return constr(mgr)

@jorisvandenbossche
Copy link
Member

How about

Then I personally think having the shortcut in the main constructor is the simpler solution

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am certainly +1 on optimizing the construction of a DataFrame from a BlockManager, but as mentioned before, I would rather do this in the main DataFrame constructor instead of adding _from_mgr (although I am generally fan of more specialized constructors):

  • It will be a less invasive change, as we don't need to change _constructor into _from_mgr in all the places where you did it now
  • Related to the above, it is also simpler: we just use _constructor, and don't need to remember to use _from_mgr instead of _constructor in certain cases (it is also simpler to understand, as it is not obvious apart from looking at its implementation that _from_mgr is still using _constructor for subclass-compatibility)
  • Although it would add an extra check to DataFrame.__init__, it is not expanding the scope of the init, since it already accepts a BlockManager as argument right now
  • Doing this in the main constructor ensures subclasses also benefit from the performance improvement.

@jbrockmendel
Copy link
Member Author

On further reflection, I'm on board with Joris's suggestion. If at some point we add restrictions to _constructor that allow _from_mgr to be re-simplified, ill revisit at that point.

@jorisvandenbossche jorisvandenbossche changed the title PERF: fastpath constructor _from_mgr PERF: fastpath DataFrame constructor from BlockManager Apr 9, 2020
NDFrame.__init__(self, data)
self.name = name
return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any places where we are actually setting fastpath=True now ? we should deprecate this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series.take, DataFrame._box_col_values. I generally agree it would be nice to be rid of

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And several more places as well.
But that's an independent discussion I would say

@jorisvandenbossche jorisvandenbossche merged commit fa28ede into pandas-dev:master Apr 9, 2020
@jorisvandenbossche
Copy link
Member

Thanks Brock!

@jbrockmendel jbrockmendel deleted the ref-from_mgr branch April 9, 2020 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants