Skip to content

REF: de-duplicate DataFrame/SparseDataFrame arithmetic code #23414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

jbrockmendel
Copy link
Member

Not quite all the way de-duplicated, but this is as much as I could do without it getting convoluted.

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Oct 30, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@592fd64). Click here to learn what that means.
The diff coverage is 97.22%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #23414   +/-   ##
=========================================
  Coverage          ?   92.23%           
=========================================
  Files             ?      161           
  Lines             ?    51324           
  Branches          ?        0           
=========================================
  Hits              ?    47338           
  Misses            ?     3986           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.62% <97.22%> (?)
#single 42.3% <22.22%> (?)
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 94.64% <100%> (ø)
pandas/core/frame.py 97.03% <100%> (ø)
pandas/core/ops.py 94.24% <80%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 592fd64...7ecb03f. Read the comment docs.

@sinhrks sinhrks added Sparse Sparse Data Type Clean labels Oct 30, 2018
columns=self.columns,
default_fill_value=fill_value)

return res.__finalize__(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this call finalize? why would the DataFrame one not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, that's why I opened #23028.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we basically need to always call __finalize__ after every op (binary or unary). It prob isn't done consistenly. So I would do it for both here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this, but am not wild about it. Current __finalize__/_metadata handling is quarter-assed, and I'd prefer to leave it that way until it is specifically addressed than make it half-assed. Still though, will update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this actually makes things less consistent within DataFrame (though slightly more consistent between DataFrame - SparseDataFrame) since lots of ops dont go through this path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just call the super function here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataFrame constructor doesn't take a fill_value argument. We would end up special-casing within the DataFrame method in a way equivalent to (but messier than and less performant than) overriding in the appropriate subclass.


fill_value = self._get_op_result_fill_value(other, func, axis)

res = self._constructor(result, index=self.index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you passing columns to the constructor here, whereas in DataFrame you are setting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very specifically not changing the existing behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, let's do that then. in reality you could actually call the superclass (maybe should just do that), if you have the ability to pass a fill_value

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too much special casing on the dispatch function, pls try to unify these.

result = self._constructor(result, index=self.index, copy=False)
# Pin columns instead of passing to constructor for compat with
# non-unique columns case
result.columns = self.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to call __finalize__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maintains the current behavior.

At some point we can/should make a concerted effort to be internally-consistent about calling __finalize__, but that is a large, orthogonal undertaking. Calling it here without doing it elsewhere would make things more inconsistent.

@jbrockmendel
Copy link
Member Author

this is too much special casing on the dispatch function, pls try to unify these.

The two versions of this method are:

    def _wrap_dispatched_op(self, result, other, func, axis=None):
        result = self._constructor(result, index=self.index, copy=False)
        result.columns = self.columns
        return result

    def _wrap_dispatched_op(self, result, other, func, axis=None):
        fill_value = self._get_op_result_fill_value(other, func, axis)

        res = self._constructor(result, index=self.index,
                                columns=self.columns,
                                default_fill_value=fill_value)

        return res.__finalize__(self)

The only way I can see to combine these would be:

    def _wrap_dispatched_op(self, result, other, func, axis=None):
        if isinstance(self, ABCSparseDataFrame):
            [the version currently in SparseDataFrame]
        else:
            [the version currently in DataFrame]

The point of implementing this method is because it is the part of the apply_index/apply_columns/etc that cant be shared. With this in place, then we have a real shot at de-duplicating the rest of those methods.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

can merge master and i'll look again.

@jbrockmendel
Copy link
Member Author

The actually-useful step is the one after this: using _wrap_dispatched_op to make SparseDataFrame._combine_frmae, _combine_match_index, _combine_match_columns, and _combine_const unncessary. But for reasons I'm not clear on, doing so breaks a bunch of tests. Since SparseDataFrame may be on its way out anyway, closing.

@jbrockmendel jbrockmendel deleted the arith2 branch April 5, 2020 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants