BUG: AttributeError raised by map inside groupby reduction #29160

jbrockmendel · 2019-10-22T15:07:14Z

Gets the last AttributeError following #29100.

jorisvandenbossche · 2019-10-22T20:03:45Z

pandas/core/indexes/datetimelike.py

-        new_values = self._data.astype(dtype, copy=copy)
+        # use _eadata since _data links to _index_data which can be
+        #  overwritten in groupby-reduction
+        new_values = self._eadata.astype(dtype, copy=copy)


Can you explain why this is needed in astype, while the comment is about groupby?

And it's not possible to prevent that _data gets written in the first place? (or is it done in a try/except that only fails later .. ). Adding yet another data like attribute doesn't seem ideal ..

And it's not possible to prevent that _data gets written in the first place?

Take a look at how we patch _index_data in libreduction. I spent some time trying to find a way around that and didn't find any.

Could we make the _index_data in a settable property that on setting also updates the underlying data of the EA? (to keep both synced)

(didn't look in detail, so this might be nonsense)

_index_data is set using object.__setattr__, so I don't think that would work.

_index_data is set using object.setattr, so I don't think that would work.

Based on a quick test, it seems that setattr still goes through a property's setter (now, if the actual idea can help, I would need to take a further look)

Do you have an example test that hits this (a failing test if you don't do the above change) ?

Do you have an example test that hits this (a failing test if you don't do the above change) ?

If you make the groupby change to stop catching AttributeError but don't change the index code, there is a test (dont know which off the top of my head) that raises an AttributeError in DatetimeIndex.map because index._data is an ndarray instead of a DTA.

jreback · 2019-10-23T12:40:11Z

pandas/core/groupby/generic.py

@@ -261,7 +261,7 @@ def aggregate(self, func=None, *args, **kwargs):

            try:
                return self._python_agg_general(func, *args, **kwargs)
-            except AssertionError:
+            except (AssertionError, AttributeError):


what case is this catching? can you comment.

its catching it just to re-raise instead of doing fallback/ignore

…dex_data

jorisvandenbossche · 2019-10-23T16:18:22Z

If you make the groupby change to stop catching AttributeError but don't change the index code, there is a test (dont know which off the top of my head) that raises an AttributeError in DatetimeIndex.map because index._data is an ndarray instead of a DTA.

Thanks! It is pandas/tests/groupby/test_groupby.py::test_groupby_reindex_inside_function.

That applies a function in groupby that uses index.map(..), and it is the DatetimeIndex.map function that calls astype. So that is the reason you need to fix astype to get the test passing in this PR.
But not sure that this is a proper fix then, as the fact that it is astype you need to fix is thus only accidental because the test was using that (the test could also use something else as astype, and it might then fail in another place?)

jbrockmendel · 2019-10-23T16:22:57Z

the test could also use something else as astype, and it might then fail in another place?

Yes, this is a side-effect of the _index_data workaround and we're going to surface and fix this bugs as we suppress fewer exceptions in the groupby code.

…dex_data

jorisvandenbossche · 2019-10-23T18:07:08Z

we're going to surface and fix this bugs

Hmm, you could also formulate this as "we're going to give bugs to the users"?
In my mind, it is currently not a "bug" if the function that you pass as a user to agg raises for some reason an AttributeError in the fast cython path, but works in the fallback slow path. Right now, we ensure that we at least try to slow path if the fast path does not work. If that is needed for certain functions, I don't think that is a bug.

I looked for a moment into the idea of "properly" setting the data (still a hack of course), seems to pass the groupby tests:

diff --git a/pandas/_libs/reduction.pyx b/pandas/_libs/reduction.pyx
index 7ed131e1c..10c113edc 100644
--- a/pandas/_libs/reduction.pyx
+++ b/pandas/_libs/reduction.pyx
@@ -409,7 +409,7 @@ cdef class SeriesGrouper:
                         cached_typ = self.typ(vslider.buf, index=cached_ityp,
                                               name=name)
                     else:
-                        object.__setattr__(cached_ityp, '_data', islider.buf)
+                        object.__setattr__(cached_ityp, '_index_data', islider.buf)
                         cached_ityp._engine.clear_mapping()
                         object.__setattr__(
                             cached_typ._data._block, 'values', vslider.buf)
diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py
index c766fcaa4..023ace965 100644
--- a/pandas/core/groupby/generic.py
+++ b/pandas/core/groupby/generic.py
@@ -262,7 +262,7 @@ class SeriesGroupBy(GroupBy):
 
             try:
                 return self._python_agg_general(func, *args, **kwargs)
-            except (AssertionError, TypeError):
+            except (AssertionError, TypeError, AttributeError):
                 raise
             except Exception:
                 result = self._aggregate_named(func, *args, **kwargs)
diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index 4c15e4b26..81a0b0cef 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -533,6 +533,16 @@ class Index(IndexOpsMixin, PandasObject):
             setattr(result, k, v)
         return result._reset_identity()
 
+    @property
+    def _index_data(self):
+        return self._data
+    
+    @_index_data.setter
+    def _index_data(self, data):
+        self._data = data
+
     @cache_readonly
     def _constructor(self):
         return type(self)
diff --git a/pandas/core/indexes/datetimelike.py b/pandas/core/indexes/datetimelike.py
index bf89bbbdf..a03702da0 100644
--- a/pandas/core/indexes/datetimelike.py
+++ b/pandas/core/indexes/datetimelike.py
@@ -98,6 +98,17 @@ class DatetimeIndexOpsMixin(ExtensionOpsMixin):
     __iter__ = ea_passthrough(DatetimeLikeArrayMixin.__iter__)
     mean = ea_passthrough(DatetimeLikeArrayMixin.mean)
 
+    @property
+    def _index_data(self):
+        return self._data._data
+    
+    @_index_data.setter
+    def _index_data(self, data):
+        self._data._data = data
+
     @property
     def freq(self):
         """
diff --git a/pandas/core/indexes/period.py b/pandas/core/indexes/period.py
index a20290e77..8988307a4 100644
--- a/pandas/core/indexes/period.py
+++ b/pandas/core/indexes/period.py
@@ -297,7 +297,7 @@ class PeriodIndex(DatetimeIndexOpsMixin, Int64Index, PeriodDelegateMixin):
         result = object.__new__(cls)
         result._data = values
         # For groupby perf. See note in indexes/base about _index_data
-        result._index_data = values._data
         result.name = name
         result._reset_identity()
         return result
diff --git a/pandas/core/indexes/timedeltas.py b/pandas/core/indexes/timedeltas.py
index 2324b8cf7..b10a9c63f 100644
--- a/pandas/core/indexes/timedeltas.py
+++ b/pandas/core/indexes/timedeltas.py
@@ -272,7 +272,7 @@ class TimedeltaIndex(
         result._data = tdarr
         result.name = name
         # For groupby perf. See note in indexes/base about _index_data
-        result._index_data = tdarr._data
 
         result._reset_identity()
         return result

jbrockmendel · 2019-10-23T18:35:28Z

Hmm, you could also formulate this as "we're going to give bugs to the users"?

I could, but don't. I formulate it as "there are bugs present, and this makes it easier to find and fix them"

I looked for a moment into the idea of "properly" setting the data (still a hack of course), seems to pass the groupby tests:

Comments around this hack suggest it is there for performance reasons. If the new hack you're suggesting doesn't involve a huge performance hit, then great. But getting rid of all hack-ness in/around _index_data is a bigger task that is outside the scope of this PR.

…dex_data

jreback · 2019-10-25T12:51:11Z

pandas/core/groupby/generic.py

@@ -262,7 +262,7 @@ def aggregate(self, func=None, *args, **kwargs):

            try:
                return self._python_agg_general(func, *args, **kwargs)
-            except (AssertionError, TypeError):
+            except (AssertionError, TypeError, AttributeError):


you need to remove AttributeError from the clause below

can you indicate where the errors come from here as much as possible

jorisvandenbossche · 2019-10-25T13:26:38Z

I could, but don't. I formulate it as "there are bugs present, and this makes it easier to find and fix them"

It's not necessarily a user bug.
We know we have a hack with the _index_data, we know that a valid groupby operation can sometimes fail because of that hack (but in the fallback it is handled properly). Surfacing our failure to the user is not a good fix IMO.

Trying to see where is our disagreement:
Do you agree that this PR can cause a regression for a user? (of course every PR can give regression in unforeseen ways, but here we know it)

Assume we didn't have the test_groupby_reindex_inside_function. In that case, you wouldn't have thought about "fixing" the DatetimeIndex.astype method. So if we would start raising the AttributeError as you do in this PR (as it would seem there is no reason to catch it), that example would fail.
We have now a test that used astype (so you fixed astype), but there are many other operations that a user could do inside the applied function with the index that does not use astype but something else that relies on the fact that _data is an EA.

You say: "by raising the error, it makes it easier to find bugs".
I agree that this change might turn up more cases where an AttributeError is incorrectly raised that should be fixed. But we know such a case that might happen: because of our _index_data hack, we will see AttributeErrors. But the bug here is out _index_data hack, which we already know.

So maybe to summarize my reasoning: as long as we have this _index_data hack, I think we should continue catching the AttributeError to prevent users from seeing this (as it is something they cannot do anything about, it's something we need to fix in pandas).

BUG: AttributeError raised by map inside groupby reduction

155406f

gfyoung added Bug Groupby labels Oct 22, 2019

jorisvandenbossche reviewed Oct 22, 2019

View reviewed changes

jreback requested changes Oct 23, 2019

View reviewed changes

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

f18144d

…dex_data

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

4180d3b

…dex_data

jbrockmendel added 4 commits October 23, 2019 11:49

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

69426cc

…dex_data

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

e652b0d

…dex_data

CLN: revert catch

4582436

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

5a0ab54

…dex_data

jreback added this to the 1.0 milestone Oct 25, 2019

jreback requested changes Oct 25, 2019

View reviewed changes

jbrockmendel closed this Oct 25, 2019

jbrockmendel deleted the index_data branch November 21, 2019 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: AttributeError raised by map inside groupby reduction #29160

BUG: AttributeError raised by map inside groupby reduction #29160

jbrockmendel commented Oct 22, 2019

jorisvandenbossche Oct 22, 2019

jbrockmendel Oct 22, 2019

jorisvandenbossche Oct 22, 2019

jbrockmendel Oct 22, 2019

jorisvandenbossche Oct 23, 2019

jbrockmendel Oct 23, 2019

jreback Oct 23, 2019

jbrockmendel Oct 23, 2019

jorisvandenbossche commented Oct 23, 2019

jbrockmendel commented Oct 23, 2019

jorisvandenbossche commented Oct 23, 2019

jbrockmendel commented Oct 23, 2019

jreback Oct 25, 2019

jreback Oct 25, 2019

jorisvandenbossche commented Oct 25, 2019

BUG: AttributeError raised by map inside groupby reduction #29160

BUG: AttributeError raised by map inside groupby reduction #29160

Conversation

jbrockmendel commented Oct 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 23, 2019

jbrockmendel commented Oct 23, 2019

jorisvandenbossche commented Oct 23, 2019

jbrockmendel commented Oct 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 25, 2019