Skip to content

BUG: AttributeError raised by map inside groupby reduction #29160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

jbrockmendel
Copy link
Member

Gets the last AttributeError following #29100.

new_values = self._data.astype(dtype, copy=copy)
# use _eadata since _data links to _index_data which can be
# overwritten in groupby-reduction
new_values = self._eadata.astype(dtype, copy=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is needed in astype, while the comment is about groupby?

And it's not possible to prevent that _data gets written in the first place? (or is it done in a try/except that only fails later .. ). Adding yet another data like attribute doesn't seem ideal ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's not possible to prevent that _data gets written in the first place?

Take a look at how we patch _index_data in libreduction. I spent some time trying to find a way around that and didn't find any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the _index_data in a settable property that on setting also updates the underlying data of the EA? (to keep both synced)

(didn't look in detail, so this might be nonsense)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_index_data is set using object.__setattr__, so I don't think that would work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_index_data is set using object.setattr, so I don't think that would work.

Based on a quick test, it seems that setattr still goes through a property's setter (now, if the actual idea can help, I would need to take a further look)

Do you have an example test that hits this (a failing test if you don't do the above change) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example test that hits this (a failing test if you don't do the above change) ?

If you make the groupby change to stop catching AttributeError but don't change the index code, there is a test (dont know which off the top of my head) that raises an AttributeError in DatetimeIndex.map because index._data is an ndarray instead of a DTA.

@@ -261,7 +261,7 @@ def aggregate(self, func=None, *args, **kwargs):

try:
return self._python_agg_general(func, *args, **kwargs)
except AssertionError:
except (AssertionError, AttributeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what case is this catching? can you comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its catching it just to re-raise instead of doing fallback/ignore

@jorisvandenbossche
Copy link
Member

If you make the groupby change to stop catching AttributeError but don't change the index code, there is a test (dont know which off the top of my head) that raises an AttributeError in DatetimeIndex.map because index._data is an ndarray instead of a DTA.

Thanks! It is pandas/tests/groupby/test_groupby.py::test_groupby_reindex_inside_function.

That applies a function in groupby that uses index.map(..), and it is the DatetimeIndex.map function that calls astype. So that is the reason you need to fix astype to get the test passing in this PR.
But not sure that this is a proper fix then, as the fact that it is astype you need to fix is thus only accidental because the test was using that (the test could also use something else as astype, and it might then fail in another place?)

@jbrockmendel
Copy link
Member Author

the test could also use something else as astype, and it might then fail in another place?

Yes, this is a side-effect of the _index_data workaround and we're going to surface and fix this bugs as we suppress fewer exceptions in the groupby code.

@jorisvandenbossche
Copy link
Member

we're going to surface and fix this bugs

Hmm, you could also formulate this as "we're going to give bugs to the users"?
In my mind, it is currently not a "bug" if the function that you pass as a user to agg raises for some reason an AttributeError in the fast cython path, but works in the fallback slow path. Right now, we ensure that we at least try to slow path if the fast path does not work. If that is needed for certain functions, I don't think that is a bug.


I looked for a moment into the idea of "properly" setting the data (still a hack of course), seems to pass the groupby tests:

diff --git a/pandas/_libs/reduction.pyx b/pandas/_libs/reduction.pyx
index 7ed131e1c..10c113edc 100644
--- a/pandas/_libs/reduction.pyx
+++ b/pandas/_libs/reduction.pyx
@@ -409,7 +409,7 @@ cdef class SeriesGrouper:
                         cached_typ = self.typ(vslider.buf, index=cached_ityp,
                                               name=name)
                     else:
-                        object.__setattr__(cached_ityp, '_data', islider.buf)
+                        object.__setattr__(cached_ityp, '_index_data', islider.buf)
                         cached_ityp._engine.clear_mapping()
                         object.__setattr__(
                             cached_typ._data._block, 'values', vslider.buf)
diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py
index c766fcaa4..023ace965 100644
--- a/pandas/core/groupby/generic.py
+++ b/pandas/core/groupby/generic.py
@@ -262,7 +262,7 @@ class SeriesGroupBy(GroupBy):
 
             try:
                 return self._python_agg_general(func, *args, **kwargs)
-            except (AssertionError, TypeError):
+            except (AssertionError, TypeError, AttributeError):
                 raise
             except Exception:
                 result = self._aggregate_named(func, *args, **kwargs)
diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index 4c15e4b26..81a0b0cef 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -533,6 +533,16 @@ class Index(IndexOpsMixin, PandasObject):
             setattr(result, k, v)
         return result._reset_identity()
 
+    @property
+    def _index_data(self):
+        return self._data
+    
+    @_index_data.setter
+    def _index_data(self, data):
+        self._data = data
+
     @cache_readonly
     def _constructor(self):
         return type(self)
diff --git a/pandas/core/indexes/datetimelike.py b/pandas/core/indexes/datetimelike.py
index bf89bbbdf..a03702da0 100644
--- a/pandas/core/indexes/datetimelike.py
+++ b/pandas/core/indexes/datetimelike.py
@@ -98,6 +98,17 @@ class DatetimeIndexOpsMixin(ExtensionOpsMixin):
     __iter__ = ea_passthrough(DatetimeLikeArrayMixin.__iter__)
     mean = ea_passthrough(DatetimeLikeArrayMixin.mean)
 
+    @property
+    def _index_data(self):
+        return self._data._data
+    
+    @_index_data.setter
+    def _index_data(self, data):
+        self._data._data = data
+
     @property
     def freq(self):
         """
diff --git a/pandas/core/indexes/period.py b/pandas/core/indexes/period.py
index a20290e77..8988307a4 100644
--- a/pandas/core/indexes/period.py
+++ b/pandas/core/indexes/period.py
@@ -297,7 +297,7 @@ class PeriodIndex(DatetimeIndexOpsMixin, Int64Index, PeriodDelegateMixin):
         result = object.__new__(cls)
         result._data = values
         # For groupby perf. See note in indexes/base about _index_data
-        result._index_data = values._data
         result.name = name
         result._reset_identity()
         return result
diff --git a/pandas/core/indexes/timedeltas.py b/pandas/core/indexes/timedeltas.py
index 2324b8cf7..b10a9c63f 100644
--- a/pandas/core/indexes/timedeltas.py
+++ b/pandas/core/indexes/timedeltas.py
@@ -272,7 +272,7 @@ class TimedeltaIndex(
         result._data = tdarr
         result.name = name
         # For groupby perf. See note in indexes/base about _index_data
-        result._index_data = tdarr._data
 
         result._reset_identity()
         return result

@jbrockmendel
Copy link
Member Author

Hmm, you could also formulate this as "we're going to give bugs to the users"?

I could, but don't. I formulate it as "there are bugs present, and this makes it easier to find and fix them"

I looked for a moment into the idea of "properly" setting the data (still a hack of course), seems to pass the groupby tests:

Comments around this hack suggest it is there for performance reasons. If the new hack you're suggesting doesn't involve a huge performance hit, then great. But getting rid of all hack-ness in/around _index_data is a bigger task that is outside the scope of this PR.

@jreback jreback added this to the 1.0 milestone Oct 25, 2019
@@ -262,7 +262,7 @@ def aggregate(self, func=None, *args, **kwargs):

try:
return self._python_agg_general(func, *args, **kwargs)
except (AssertionError, TypeError):
except (AssertionError, TypeError, AttributeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to remove AttributeError from the clause below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you indicate where the errors come from here as much as possible

@jorisvandenbossche
Copy link
Member

I could, but don't. I formulate it as "there are bugs present, and this makes it easier to find and fix them"

It's not necessarily a user bug.
We know we have a hack with the _index_data, we know that a valid groupby operation can sometimes fail because of that hack (but in the fallback it is handled properly). Surfacing our failure to the user is not a good fix IMO.


Trying to see where is our disagreement:
Do you agree that this PR can cause a regression for a user? (of course every PR can give regression in unforeseen ways, but here we know it)

Assume we didn't have the test_groupby_reindex_inside_function. In that case, you wouldn't have thought about "fixing" the DatetimeIndex.astype method. So if we would start raising the AttributeError as you do in this PR (as it would seem there is no reason to catch it), that example would fail.
We have now a test that used astype (so you fixed astype), but there are many other operations that a user could do inside the applied function with the index that does not use astype but something else that relies on the fact that _data is an EA.

You say: "by raising the error, it makes it easier to find bugs".
I agree that this change might turn up more cases where an AttributeError is incorrectly raised that should be fixed. But we know such a case that might happen: because of our _index_data hack, we will see AttributeErrors. But the bug here is out _index_data hack, which we already know.

So maybe to summarize my reasoning: as long as we have this _index_data hack, I think we should continue catching the AttributeError to prevent users from seeing this (as it is something they cannot do anything about, it's something we need to fix in pandas).

@jbrockmendel jbrockmendel deleted the index_data branch November 21, 2019 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants