-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: AttributeError raised by map inside groupby reduction #29160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
new_values = self._data.astype(dtype, copy=copy) | ||
# use _eadata since _data links to _index_data which can be | ||
# overwritten in groupby-reduction | ||
new_values = self._eadata.astype(dtype, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this is needed in astype
, while the comment is about groupby?
And it's not possible to prevent that _data gets written in the first place? (or is it done in a try/except that only fails later .. ). Adding yet another data like attribute doesn't seem ideal ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's not possible to prevent that _data gets written in the first place?
Take a look at how we patch _index_data
in libreduction. I spent some time trying to find a way around that and didn't find any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make the _index_data
in a settable property that on setting also updates the underlying data of the EA? (to keep both synced)
(didn't look in detail, so this might be nonsense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_index_data
is set using object.__setattr__
, so I don't think that would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_index_data is set using object.setattr, so I don't think that would work.
Based on a quick test, it seems that setattr still goes through a property's setter (now, if the actual idea can help, I would need to take a further look)
Do you have an example test that hits this (a failing test if you don't do the above change) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example test that hits this (a failing test if you don't do the above change) ?
If you make the groupby change to stop catching AttributeError but don't change the index code, there is a test (dont know which off the top of my head) that raises an AttributeError in DatetimeIndex.map
because index._data
is an ndarray instead of a DTA.
pandas/core/groupby/generic.py
Outdated
@@ -261,7 +261,7 @@ def aggregate(self, func=None, *args, **kwargs): | |||
|
|||
try: | |||
return self._python_agg_general(func, *args, **kwargs) | |||
except AssertionError: | |||
except (AssertionError, AttributeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what case is this catching? can you comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its catching it just to re-raise instead of doing fallback/ignore
Thanks! It is That applies a function in groupby that uses |
Yes, this is a side-effect of the _index_data workaround and we're going to surface and fix this bugs as we suppress fewer exceptions in the groupby code. |
Hmm, you could also formulate this as "we're going to give bugs to the users"? I looked for a moment into the idea of "properly" setting the data (still a hack of course), seems to pass the groupby tests: diff --git a/pandas/_libs/reduction.pyx b/pandas/_libs/reduction.pyx
index 7ed131e1c..10c113edc 100644
--- a/pandas/_libs/reduction.pyx
+++ b/pandas/_libs/reduction.pyx
@@ -409,7 +409,7 @@ cdef class SeriesGrouper:
cached_typ = self.typ(vslider.buf, index=cached_ityp,
name=name)
else:
- object.__setattr__(cached_ityp, '_data', islider.buf)
+ object.__setattr__(cached_ityp, '_index_data', islider.buf)
cached_ityp._engine.clear_mapping()
object.__setattr__(
cached_typ._data._block, 'values', vslider.buf)
diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py
index c766fcaa4..023ace965 100644
--- a/pandas/core/groupby/generic.py
+++ b/pandas/core/groupby/generic.py
@@ -262,7 +262,7 @@ class SeriesGroupBy(GroupBy):
try:
return self._python_agg_general(func, *args, **kwargs)
- except (AssertionError, TypeError):
+ except (AssertionError, TypeError, AttributeError):
raise
except Exception:
result = self._aggregate_named(func, *args, **kwargs)
diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index 4c15e4b26..81a0b0cef 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -533,6 +533,16 @@ class Index(IndexOpsMixin, PandasObject):
setattr(result, k, v)
return result._reset_identity()
+ @property
+ def _index_data(self):
+ return self._data
+
+ @_index_data.setter
+ def _index_data(self, data):
+ self._data = data
+
@cache_readonly
def _constructor(self):
return type(self)
diff --git a/pandas/core/indexes/datetimelike.py b/pandas/core/indexes/datetimelike.py
index bf89bbbdf..a03702da0 100644
--- a/pandas/core/indexes/datetimelike.py
+++ b/pandas/core/indexes/datetimelike.py
@@ -98,6 +98,17 @@ class DatetimeIndexOpsMixin(ExtensionOpsMixin):
__iter__ = ea_passthrough(DatetimeLikeArrayMixin.__iter__)
mean = ea_passthrough(DatetimeLikeArrayMixin.mean)
+ @property
+ def _index_data(self):
+ return self._data._data
+
+ @_index_data.setter
+ def _index_data(self, data):
+ self._data._data = data
+
@property
def freq(self):
"""
diff --git a/pandas/core/indexes/period.py b/pandas/core/indexes/period.py
index a20290e77..8988307a4 100644
--- a/pandas/core/indexes/period.py
+++ b/pandas/core/indexes/period.py
@@ -297,7 +297,7 @@ class PeriodIndex(DatetimeIndexOpsMixin, Int64Index, PeriodDelegateMixin):
result = object.__new__(cls)
result._data = values
# For groupby perf. See note in indexes/base about _index_data
- result._index_data = values._data
result.name = name
result._reset_identity()
return result
diff --git a/pandas/core/indexes/timedeltas.py b/pandas/core/indexes/timedeltas.py
index 2324b8cf7..b10a9c63f 100644
--- a/pandas/core/indexes/timedeltas.py
+++ b/pandas/core/indexes/timedeltas.py
@@ -272,7 +272,7 @@ class TimedeltaIndex(
result._data = tdarr
result.name = name
# For groupby perf. See note in indexes/base about _index_data
- result._index_data = tdarr._data
result._reset_identity()
return result |
I could, but don't. I formulate it as "there are bugs present, and this makes it easier to find and fix them"
Comments around this hack suggest it is there for performance reasons. If the new hack you're suggesting doesn't involve a huge performance hit, then great. But getting rid of all hack-ness in/around _index_data is a bigger task that is outside the scope of this PR. |
@@ -262,7 +262,7 @@ def aggregate(self, func=None, *args, **kwargs): | |||
|
|||
try: | |||
return self._python_agg_general(func, *args, **kwargs) | |||
except (AssertionError, TypeError): | |||
except (AssertionError, TypeError, AttributeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to remove AttributeError from the clause below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you indicate where the errors come from here as much as possible
It's not necessarily a user bug. Trying to see where is our disagreement: Assume we didn't have the You say: "by raising the error, it makes it easier to find bugs". So maybe to summarize my reasoning: as long as we have this |
Gets the last AttributeError following #29100.