-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/df.agg-with-df-with-missing-values-results-in-IndexError #58864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/df.agg-with-df-with-missing-values-results-in-IndexError #58864
Conversation
de69d12
to
5947e3e
Compare
58df051
to
092b60a
Compare
7ad3684
to
3aec102
Compare
3aec102
to
8c6f34b
Compare
Ready for review. |
doc/source/whatsnew/v3.0.0.rst
Outdated
@@ -39,6 +39,7 @@ Other enhancements | |||
- Users can globally disable any ``PerformanceWarning`` by setting the option ``mode.performance_warnings`` to ``False`` (:issue:`56920`) | |||
- :meth:`Styler.format_index_names` can now be used to format the index and column names (:issue:`48936` and :issue:`47489`) | |||
- :class:`.errors.DtypeWarning` improved to include column names when mixed data types are detected (:issue:`58174`) | |||
- :meth:`DataFrame.agg` now correctly handles missing values without raising an IndexError (:issue:`58810`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the bug fix section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
pandas/core/apply.py
Outdated
col_idx_order = list(Index(s.index).get_indexer(fun)) | ||
col_idx_order = [i for i in col_idx_order if 0 <= i < len(s)] | ||
if col_idx_order: | ||
s = s.iloc[col_idx_order] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead I think you can filter col_idx_order
where it's equal to -1
. See the get_indexer
docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense.
pandas/core/apply.py
Outdated
col_idx_order = list(Index(s.index).get_indexer(fun)) | ||
col_idx_order = [i for i in col_idx_order if i != -1] | ||
if col_idx_order: | ||
s = s.iloc[col_idx_order] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
col_idx_order = list(Index(s.index).get_indexer(fun)) | |
col_idx_order = [i for i in col_idx_order if i != -1] | |
if col_idx_order: | |
s = s.iloc[col_idx_order] | |
col_idx_order = Index(s.index).get_indexer(fun) | |
col_idx_order = col_idx_order[col_idx_order != -1] | |
s = s.iloc[col_idx_order] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't produce the expected behavior.
Take the "A" example in the docstring without the condition it will be:
foo NaN
aab NaN
bar NaN
dat NaN
Which is wrong.
This happens because:
col_idx_order
is determined byIndex(s.index).get_indexer(fun)
.- Since
s
only has one value with index["mean"]
andfun = ["max"]
, there is no match, socol_idx_order = [-1]
. - The code
s = s.iloc[col_idx_order]
results in an emptySeries
because-1
indicates no match, producing thr wrong behaviour.
Is this not supposed to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK then you can add back the if not col_idx_order.empty:
condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AttributeError: 'numpy.ndarray' object has no attribute 'empty'
The best way i find was to make it a list and check that way.
I guess we could use a boolean mask directly on the NumPy array returned by get_indexer
applying only the valid indices.
col_idx_order = Index(s.index).get_indexer(fun)
valid_idx = col_idx_order != -1
if valid_idx.any():
s = s.iloc[col_idx_order[valid_idx]]
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that solution works. Thanks.
pandas/core/apply.py
Outdated
|
||
# assign the new user-provided "named aggregation" as index names, and reindex | ||
# it based on the whole user-provided names. | ||
s.index = reordered_indexes[idx : idx + len(fun)] | ||
if len(s) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(s) > 0: | |
if not s.empty: |
Thanks @abeltavares |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.