Skip to content

REF: standardize usage in DataFrame vs SparseDataFrame ops #28027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Sep 17, 2019

Conversation

jbrockmendel
Copy link
Member

I think that after this we're not far from being able to use the base class versions of _combine_frame, _combine_match_index, _combine_match_columns, and _combine_const. That'll be a good day.

The only actual behavior changed is in SparseDataFrame._combine_match_columns, which is changed to match the other methods. See #28025

@jorisvandenbossche
Copy link
Member

The only actual behavior changed is in SparseDataFrame._combine_match_columns, which is changed to match the other methods. See #28025

As mentioned in the issue: we could also leave the SparseDataFrame code alone as is (it's deprecated, so I don't think we should change behaviour, and I would personally also not put effort in cleaning it up), but still refactor the DataFrame ops (which would mean less sharing of code, for now, but I don't think that is a problem since the sparse code will be removed anyway, so then there will be no duplication anymore)

@jbrockmendel
Copy link
Member Author

(which would mean less sharing of code, for now, but I don't think that is a problem since the sparse code will be removed anyway, so then there will be no duplication anymore)

If I understand the suggestion correctly, this would involve copy/pasting all of the ops-based code that is currently shared. Or do you have something else in mind?

@jorisvandenbossche
Copy link
Member

If I understand the suggestion correctly, this would involve copy/pasting all of the ops-based code that is currently shared. Or do you have something else in mind?

Yes, copy/paste all ops-based code that sparse needs and is currently shared with the other frame ops to the sparse module (and then we basically don't touch that code anymore until it is removed), and then you are free to refactor the ops code without needing to care about the deprecated sparse (and eg don't need to care about the small differences like #28025)

(note this is just talk without looking at the actual code, I don't know if it is practically feasible. I just wouldn't care in this specific case about temporarily duplicating some code if that allows to refactor more easily the code that we are going to keep).

@jbrockmendel
Copy link
Member Author

Thanks for clarifying. Does the status quo for SparseDataFrame include bugfixes until it is removed? If so, that would push towards continuing to share code.

@jorisvandenbossche
Copy link
Member

Does the status quo for SparseDataFrame include bugfixes until it is removed?

Well, if somebody wants to do a PR for a bug fix, that is of course welcome (and non-shared code can still receive bugfixes). But if you ask me if I find it worth your time to do bug fixes for SparseDataFrame, then I would say no. Even if it would be benefiting from a generic bug fix that you do in the ops code, I don't think it is a problem that sparse wouldn't automatically get that bug fix anymore

@jbrockmendel
Copy link
Member Author

I think I understand the suggestion, thanks. Would you be OK with moving forward with the continue-sharing approach in this PR and in the next pass I'll try to use the approach you've described?

@jorisvandenbossche
Copy link
Member

Would you be OK with moving forward with the continue-sharing approach in this PR and in the next pass I'll try to use the approach you've described?

If we would go with a non-sharing appraoch, then I personally think we should do it already now and remove the sparse changes here. But that is still an "if" of course.
(I understood correctly that this PR was changing behaviour right?)

@jbrockmendel
Copy link
Member Author

(I understood correctly that this PR was changing behaviour right?)

This PR changes behavior, but we have not established that the behavior in question was intentional.

then I personally think we should

I understand your preference. My question is if the strength of that preference would make you -1 on moving forward with the less-preferred option (this) (conditional on resolving the behavior change)

@jorisvandenbossche
Copy link
Member

My question is if the strength of that preference would make you -1 on moving forward with the less-preferred option (this) (conditional on resolving the behavior change)

Well, that depends on the potential behaviour change .. :)

we have not established that the behavior in question was intentional.

In the issue, I asked about a code example, to be able to see what the change is in an example (and not in a theoretical explanation of the code that I can only understand by reading and understanding the code). But that is actually another reason to stop the sharing, then we wouldn't need to need to solve this question, we could just ignore the potential difference with SparseDataFrame, and use our mental energy for discussing more interesting things :-)

To be clear, I don't really know what effort it would take to not share the code (I would think that, in this PR, it would mean reverting the changes in the sparse code, and copying (the original version of) some of the methods in that non-sparse code that you changed to the sparse code; but I didn't try it of course), so I do maybe a bit too lightly about it

@jbrockmendel
Copy link
Member Author

@jreback if you're around, thoughts? after more experimentation, im increasing convinced that a) the small behavior change is probably slightly more correct, b) it really doesn't matter, and c) it is possible to rip out a lot of the sparse ops code after implementing _construct_result

@jreback
Copy link
Contributor

jreback commented Aug 23, 2019

@jreback if you're around, thoughts? after more experimentation, im increasing convinced that a) the small behavior change is probably slightly more correct, b) it really doesn't matter, and c) it is possible to rip out a lot of the sparse ops code after implementing _construct_result

agree on all points here; +1 on consolidating code and then simplifying ops. trying to isolate sparse via copy paste i think is just a mess and not worth any effort.

@jorisvandenbossche
Copy link
Member

im increasing convinced that a) the small behavior change is probably slightly more correct,

Then again, can you try to explain (show) the behaviour change?

@jbrockmendel
Copy link
Member Author

can you try to explain (show) the behaviour change?

df = pd.DataFrame([0])
sdf = df.to_sparse(fill_value=1)

# unchanged
>>> sdf.add(sdf[0], axis=0).default_fill_value
2.0

>>> (sdf + sdf[0]).default_fill_value
2.0  #  <-- PR
1    # <-- master

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 23, 2019

Essentially, binops propagate the default fill value more consistently?

It can probably be called a bug fix (sorry about the close and reopen).

@TomAugspurger
Copy link
Contributor

OK, looking at this again, I think the different default_fill_value for ops between two dataframes vs. a dataframe and a series is a bug.

@jbrockmendel can you merge master to fix the merge conflicts?

@jorisvandenbossche any objections to declaring #28027 (comment) a bug?

@jbrockmendel
Copy link
Member Author

can you merge master to fix the merge conflicts?

done

@jreback jreback added the Sparse Sparse Data Type label Sep 2, 2019
# non-unique columns case
out.columns = self.columns
return out
# TODO: finalize? we do for SparseDataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalize doesn't do really anything for frames, prob can't hurt though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially in the frame+frame case it isn't obvious that is correct; I'd prefer to revisit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle we should finalize I think, but since it was not done right now, fine to leave it for later?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Any other comments @jreback?

@jbrockmendel
Copy link
Member Author

gentle ping. This and #28268 are pseudo-blockers for implementing blockwise arithmetic.

(psuedo-blocker as in not actual blockers, but will make the implementation much easier)

@jbrockmendel
Copy link
Member Author

@jreback if we have to choose one more to get in more the weekend productivity streak ends, this should be it

@TomAugspurger
Copy link
Contributor

@jreback I think this is blocking #28414. Thoughts on merging?

@jreback jreback added this to the 1.0 milestone Sep 12, 2019
@jreback
Copy link
Contributor

jreback commented Sep 12, 2019

yeah this was fine (got lost in the queue). maybe rebase @jbrockmendel to be sure.

@jorisvandenbossche
Copy link
Member

Did we decide on removing SparseDataFrame?
In which case this PR might not be needed (and the one it is blocking maybe also not), but at the same time then I also don't care about changing behaviour of SparseDataFrame ;) (or fixing bugs, as it turned out to be)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 12, 2019 via email

@jbrockmendel
Copy link
Member Author

This should be merged regardless of the decision about removing SparseDataFrame.

  1. The edits here are worthwhile for DataFrame alone
  2. if we are going to remove SparseDataFrame that will take a while, during which this will block other ops work

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 12, 2019 via email

@jreback jreback merged commit 416931e into pandas-dev:master Sep 17, 2019
@jreback
Copy link
Contributor

jreback commented Sep 17, 2019

thanks @jbrockmendel

@TomAugspurger you will likely need to rebase on the removals PR

@jbrockmendel jbrockmendel deleted the blockops2 branch September 17, 2019 14:03
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants