-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: de-duplicate DataFrame/SparseDataFrame arithmetic code #23414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @jbrockmendel! Thanks for submitting the PR.
|
Codecov Report
@@ Coverage Diff @@
## master #23414 +/- ##
=========================================
Coverage ? 92.23%
=========================================
Files ? 161
Lines ? 51324
Branches ? 0
=========================================
Hits ? 47338
Misses ? 3986
Partials ? 0
Continue to review full report at Codecov.
|
columns=self.columns, | ||
default_fill_value=fill_value) | ||
|
||
return res.__finalize__(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would this call finalize? why would the DataFrame one not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, that's why I opened #23028.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we basically need to always call __finalize__
after every op (binary or unary). It prob isn't done consistenly. So I would do it for both here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this, but am not wild about it. Current __finalize__/_metadata
handling is quarter-assed, and I'd prefer to leave it that way until it is specifically addressed than make it half-assed. Still though, will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this actually makes things less consistent within DataFrame (though slightly more consistent between DataFrame - SparseDataFrame) since lots of ops dont go through this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just call the super function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DataFrame constructor doesn't take a fill_value
argument. We would end up special-casing within the DataFrame method in a way equivalent to (but messier than and less performant than) overriding in the appropriate subclass.
|
||
fill_value = self._get_op_result_fill_value(other, func, axis) | ||
|
||
res = self._constructor(result, index=self.index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you passing columns to the constructor here, whereas in DataFrame you are setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very specifically not changing the existing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, let's do that then. in reality you could actually call the superclass (maybe should just do that), if you have the ability to pass a fill_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too much special casing on the dispatch function, pls try to unify these.
result = self._constructor(result, index=self.index, copy=False) | ||
# Pin columns instead of passing to constructor for compat with | ||
# non-unique columns case | ||
result.columns = self.columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to call __finalize__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maintains the current behavior.
At some point we can/should make a concerted effort to be internally-consistent about calling __finalize__
, but that is a large, orthogonal undertaking. Calling it here without doing it elsewhere would make things more inconsistent.
The two versions of this method are:
The only way I can see to combine these would be:
The point of implementing this method is because it is the part of the apply_index/apply_columns/etc that cant be shared. With this in place, then we have a real shot at de-duplicating the rest of those methods. |
can merge master and i'll look again. |
The actually-useful step is the one after this: using _wrap_dispatched_op to make SparseDataFrame._combine_frmae, _combine_match_index, _combine_match_columns, and _combine_const unncessary. But for reasons I'm not clear on, doing so breaks a bunch of tests. Since SparseDataFrame may be on its way out anyway, closing. |
Not quite all the way de-duplicated, but this is as much as I could do without it getting convoluted.