Skip to content

Add missing __finalize__ calls in indexers/iterators #46101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 26, 2022

Conversation

theOehrly
Copy link
Contributor

This PR intends to add missing __finalize__ calls to indexers and locators (this section of the docs: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#indexing-iteration). I went over all the methods listed in that section and in my opinion, everything should be covered after this PR.

This is generally in reference to #28283, although the methods covered here are not listed there. #19850 is also somewhat relevant.

The missing calls are only related to cases where a Series object was returned from indexing/iterating over a DataFrame. Though, I was somewhat generous with adding new test cases. Some newly added tests cases would have already passed before adding these calls, but I thought that it was still worth adding them.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke existing tests

@theOehrly
Copy link
Contributor Author

This broke existing tests

Should be good now. I had to un-xfail some more tests that were previously marked as not implemented.
I just need to fix the small typing validation issue if there are no other things after review.

@theOehrly
Copy link
Contributor Author

@phofl All tests are passing now. Can you review again, please?

@jreback jreback added the metadata _metadata, .attrs label Feb 26, 2022
@jreback jreback added this to the 1.5 milestone Feb 26, 2022
@jreback
Copy link
Contributor

jreback commented Feb 26, 2022

looks pretty good @theOehrly if you can add a release note (I think we have a few already that are about finalize so if you can group with those).

@jreback
Copy link
Contributor

jreback commented Feb 26, 2022

cc @phofl if you can review when you have a chance

@theOehrly
Copy link
Contributor Author

@jreback That one failure seems unrelated? At least I'm failing to make sense of it, and it did pass previously.
Correct me if I'm wrong, though.

@phofl
Copy link
Member

phofl commented Feb 28, 2022

Looks unrelated, yes

@@ -23,6 +23,7 @@ Fixed regressions

Bug fixes
~~~~~~~~~
- Fixed metadata propagation in :meth:`NDFrame.xs`, :meth:`DataFrame.iterrows`, :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` (:issue:`28283`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of loc and iloc:

Is this true for all cases? This has lots of different branches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is*, and sorry for the delayed response. I wanted to validate this again, more extensively. I've made some slight changes to the code, for testing purposes only, so that every single call to loc and iloc adds attrs to the original object and validates that they still exist on the returned object. After that, I ran the entire test suite with these additional checks built in. Assuming that this gives 100% coverage of all branches of loc and iloc, there were only three failures, all of which were related to _get_item_cache.
Basically, if attrs/_metadata/flags change after an object was cached, the next time a sliced object is returned using _get_item_cache, it will have the old attrs/_metadata/flags. I'm not sure if we'd want to add a finalize call to _get_item_cache though? If attrs/_metadata/flags are considered constants for each object, then this might not be a problem. This only came up because my extended testing modified attrs between multiple loc/iloc calls, but was never caused by a test itself.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 3, 2022
@theOehrly
Copy link
Contributor Author

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Still interested in working on this, not stale.

@jreback how to proceed here?

@jreback
Copy link
Contributor

jreback commented Apr 3, 2022

you can merge master and update to any comments

@theOehrly
Copy link
Contributor Author

you can merge master and update to any comments

I was waiting for a response to my comments on the review above. I guess there still are open questions that should be addressed.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2022

can look but pls merge master

@theOehrly theOehrly force-pushed the finalize_indexers_iterators branch from e605849 to e2645bc Compare April 3, 2022 16:40
@theOehrly
Copy link
Contributor Author

@jreback main is merged and tests are fixed again

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to run the performance benchmarks for indexing before/after for this change.

@@ -23,7 +23,7 @@ Fixed regressions

Bug fixes
~~~~~~~~~
-
- Fixed metadata propagation in :meth:`NDFrame.xs`, :meth:`DataFrame.iterrows`, :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` (:issue:`28283`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move to 1.5

@theOehrly
Copy link
Contributor Author

theOehrly commented Apr 12, 2022

@jreback Results of the performance benchmark for indexing (only tests that got worse):

Benchmarks that have got worse:

       before           after         ratio
     [0141d921]       [f75a5dd8]
                      <finalize_indexers_iterators>
+      69.2±0.8μs       85.7±0.5μs     1.24  indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('string', 'nonunique_monotonic_inc')
+       477±100μs         593±80μs     1.24  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int16Engine'>, <class 'numpy.int16'>), 'monotonic_decr', False, 100000)
+       462±100μs         635±40μs     1.38  indexing_engines.NumericEngineIndexing.time_get_loc_near_middle((<class 'pandas._libs.index.UInt8Engine'>, <class 'numpy.uint8'>), 'monotonic_decr', False, 100000)

Or the full output here: https://pastebin.com/raw/3sVMufVc

@theOehrly
Copy link
Contributor Author

@jreback any more feedback on this?

@lithomas1 lithomas1 removed the Stale label Apr 22, 2022
@lithomas1 lithomas1 requested a review from jreback April 22, 2022 14:58
@jreback jreback merged commit 17d2428 into pandas-dev:main Apr 26, 2022
@jreback
Copy link
Contributor

jreback commented Apr 26, 2022

thanks @theOehrly this is very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata _metadata, .attrs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants