-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: fastpath DataFrame constructor from BlockManager #33357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: fastpath DataFrame constructor from BlockManager #33357
Conversation
makes sense but would changing obj._constructor -> type(obj)._from_mgr create issues with subclassing. I recall a recent discussion that _constructor should be type(self) |
Fair point. If _constructor is something other than
This sounds familiar. I'd be +1. |
pandas/core/generic.py
Outdated
@@ -230,6 +230,15 @@ def _init_mgr(cls, mgr, axes=None, dtype=None, copy=False): | |||
mgr = mgr.astype(dtype=dtype) | |||
return mgr | |||
|
|||
@classmethod | |||
def _from_mgr(cls, mgr: BlockManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the return type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be NDFrame or FrameOrSeries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
def _from_mgr(cls, mgr: BlockManager): | |
def _from_mgr(cls: Type[FrameOrSeries], mgr: BlockManager) -> FrameOrSeries: |
though not checked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we need to change from classmethod in order to get _constructor
, unless we want to do something like
constructor = cls._constructor.fget(cls)
which seems sketchy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does obj._constructor -> obj._constructor._from_mgr not work with the classmethod. obj._constructor returns a class type not an instance?
pandas/core/generic.py
Outdated
""" | ||
Fastpath constructor for if we have only a BlockManager. | ||
""" | ||
obj = object.__new__(cls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might consider calling finalize here (have to pass the method as an arg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the use cases i have in mind is in groupby _chop
where we are potentially creating many many Series/DataFrame objects, hoping we can avoid the extra call
This is however used by downstream libraries (eg pyarrow).
That is currently not required to be the case (and for geopandas it also is not the case). After that earlier discussion, this issue was opened about that: #32638 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we can simply change away from using _constructor
(that at least needs more investigation to check the impact on subclasses), or assume _constructor
is a class (#32638)
But couldn't we provide a fast path in the DataFrame constructor that does more or less the same?
Basically check that data
is a BlockManager and that all other keywords are None? (that are some more checks of course, so would need to check if that doesn't undo the gain)
@jorisvandenbossche thanks for the link. you wrote
so #33357 (comment) is probably not applicable. |
The only requirement we have is that |
if this or something like it isnt accepted, then that may be worthwhile. if we do get this fastpath, then id prefer not to make the already-complicated constructor more complicated.
How about
|
Then I personally think having the shortcut in the main constructor is the simpler solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am certainly +1 on optimizing the construction of a DataFrame from a BlockManager, but as mentioned before, I would rather do this in the main DataFrame constructor instead of adding _from_mgr
(although I am generally fan of more specialized constructors):
- It will be a less invasive change, as we don't need to change
_constructor
into_from_mgr
in all the places where you did it now - Related to the above, it is also simpler: we just use
_constructor
, and don't need to remember to use_from_mgr
instead of_constructor
in certain cases (it is also simpler to understand, as it is not obvious apart from looking at its implementation that_from_mgr
is still using_constructor
for subclass-compatibility) - Although it would add an extra check to
DataFrame.__init__
, it is not expanding the scope of the init, since it already accepts a BlockManager as argument right now - Doing this in the main constructor ensures subclasses also benefit from the performance improvement.
On further reflection, I'm on board with Joris's suggestion. If at some point we add restrictions to _constructor that allow _from_mgr to be re-simplified, ill revisit at that point. |
NDFrame.__init__(self, data) | ||
self.name = name | ||
return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any places where we are actually setting fastpath=True
now ? we should deprecate this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series.take, DataFrame._box_col_values. I generally agree it would be nice to be rid of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And several more places as well.
But that's an independent discussion I would say
Thanks Brock! |
When trying to make
fast_apply
unnecessary I found that there is non-trivial overhead in the constructors that we can avoid. Moreover, since users shouldn't be passing BlockManagers around anyway, we might be able to get that case out of the DataFrame constructor entirely.