Skip to content

value_counts() crashes if groupby object contains empty groups #28479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xuancong84 opened this issue Sep 17, 2019 · 16 comments · Fixed by #28634
Closed

value_counts() crashes if groupby object contains empty groups #28479

xuancong84 opened this issue Sep 17, 2019 · 16 comments · Fixed by #28634
Milestone

Comments

@xuancong84
Copy link

When you group some statistical counts for every day, it is possible that on some day there is no counts at all. This will result in empty groups in the groupby object. Performing value_counts() on such groupby objects causes crash.

The following example illustrates the problem:

import pandas as pd
df = pd.DataFrame({'Timestamp':[1565083561, 1565083561+86400, 1565083561+86500, 1565083561+86400*2, 1565083561+86400*3, 1565083561+86500*3, 1565083561+86400*4],
                   'Food':['apple', 'apple', 'banana', 'banana', 'orange', 'orange', 'pear']})
df['Datetime'] = pd.to_datetime(df['Timestamp'].apply(lambda t:str(t)), unit='s')
display(df)
dfg = df.groupby(pd.Grouper(freq='1D', key='Datetime'))
[print(g) for g in dfg]
display(dfg['Food'].value_counts())

df = df.drop([3])
display(df)
dfg = df.groupby(pd.Grouper(freq='1D', key='Datetime'))
[print(g) for g in dfg]
display(dfg['Food'].value_counts())

This table does not contain days with empty data, value_counts() does not crash:

Timestamp Food Datetime
1565083561 apple 2019-08-06 09:26:01
1565169961 apple 2019-08-07 09:26:01
1565170061 banana 2019-08-07 09:27:41
1565256361 banana 2019-08-08 09:26:01
1565342761 orange 2019-08-09 09:26:01
1565343061 orange 2019-08-09 09:31:01
1565429161 pear 2019-08-10 09:26:01

After groupby each day:
(Timestamp('2019-08-06 00:00:00', freq='D'), Timestamp Food Datetime
0 1565083561 apple 2019-08-06 09:26:01)
(Timestamp('2019-08-07 00:00:00', freq='D'), Timestamp Food Datetime
1 1565169961 apple 2019-08-07 09:26:01
2 1565170061 banana 2019-08-07 09:27:41)
(Timestamp('2019-08-08 00:00:00', freq='D'), Timestamp Food Datetime
3 1565256361 banana 2019-08-08 09:26:01)
(Timestamp('2019-08-09 00:00:00', freq='D'), Timestamp Food Datetime
4 1565342761 orange 2019-08-09 09:26:01
5 1565343061 orange 2019-08-09 09:31:01)
(Timestamp('2019-08-10 00:00:00', freq='D'), Timestamp Food Datetime
6 1565429161 pear 2019-08-10 09:26:01)

Result of value_counts():
Datetime Food
2019-08-06 apple 1
2019-08-07 apple 1
banana 1
2019-08-08 banana 1
2019-08-09 orange 2
2019-08-10 pear 1
Name: Food, dtype: int64

This table contains a day with empty data (2019-08-08), value_counts() will crash:

Timestamp Food Datetime
1565083561 apple 2019-08-06 09:26:01
1565169961 apple 2019-08-07 09:26:01
1565170061 banana 2019-08-07 09:27:41
1565342761 orange 2019-08-09 09:26:01
1565343061 orange 2019-08-09 09:31:01
1565429161 pear 2019-08-10 09:26:01

After groupby each day (note the empty group on 2019-08-08):
(Timestamp('2019-08-06 00:00:00', freq='D'), Timestamp Food Datetime
0 1565083561 apple 2019-08-06 09:26:01)
(Timestamp('2019-08-07 00:00:00', freq='D'), Timestamp Food Datetime
1 1565169961 apple 2019-08-07 09:26:01
2 1565170061 banana 2019-08-07 09:27:41)
(Timestamp('2019-08-08 00:00:00', freq='D'), Empty DataFrame
Columns: [Timestamp, Food, Datetime]
Index: [])

(Timestamp('2019-08-09 00:00:00', freq='D'), Timestamp Food Datetime
4 1565342761 orange 2019-08-09 09:26:01
5 1565343061 orange 2019-08-09 09:31:01)
(Timestamp('2019-08-10 00:00:00', freq='D'), Timestamp Food Datetime
6 1565429161 pear 2019-08-10 09:26:01)

value_counts() crashes:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-543-5efc1c882109> in <module>
     14 [print(g) for g in dfg]
     15 print('This will cause crash:')
---> 16 display(dfg['Food'].value_counts())

~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
   1137 
   1138         # multi-index components
-> 1139         labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
   1140         levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
   1141         names = self.grouper.names + [self._selection_name]

~/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
    469 
    470     """
--> 471     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    472 
    473 

~/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     54 def _wrapfunc(obj, method, *args, **kwds):
     55     try:
---> 56         return getattr(obj, method)(*args, **kwds)
     57 
     58     # An AttributeError occurs if the object does not have

ValueError: operands could not be broadcast together with shape (5,) (4,)

It turns out that this might result from a design flaw in DataFrame construction that it skips empty rows:
pd.DataFrame.from_dict(data={'row1':{'a':1, 'b':2}, 'row2': {'a':3, 'b':4, 'c':5}, 'row3':{}}, orient='index').fillna(0)

  a b c
row1 1 2 0
row2 3 4 5.0

Take note that row3 is not constructed at all. The correct behavior should output:

  a b c
row1 1 2 0.0
row2 3 4 5.0
row3 0 0 0.0
@Liam3851
Copy link
Contributor

Simpler repro; happens on Series as well, not just DataFrame.

import pandas as pd
ser = pd.Series([1, 2], index=pd.DatetimeIndex(['2019-09-01', '2019-09-03']))
ser.groupby(pd.Grouper(freq='D')).value_counts()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-0a3140bd4ce6> in <module>
----> 1 ser.groupby(pd.Grouper(freq='D')).value_counts()

C:\Miniconda3\envs\py37\lib\site-packages\pandas\core\groupby\generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
   1242
   1243         # multi-index components
-> 1244         labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
   1245         levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
   1246         names = self.grouper.names + [self._selection_name]

C:\Miniconda3\envs\py37\lib\site-packages\numpy\core\fromnumeric.py in repeat(a, repeats, axis)
    469
    470     """
--> 471     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    472
    473

C:\Miniconda3\envs\py37\lib\site-packages\numpy\core\fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     54 def _wrapfunc(obj, method, *args, **kwds):
     55     try:
---> 56         return getattr(obj, method)(*args, **kwds)
     57
     58     # An AttributeError occurs if the object does not have

ValueError: operands could not be broadcast together with shape (3,) (2,)

@xuancong84
Copy link
Author

I wonder how can such a big issue never been reported in the past. Grouping according to days is a very common use right?

@Liam3851
Copy link
Contributor

@xuancong84 I can reproduce on 0.19.2 (my earliest install), so this bug has been around a long time. My offhand guess is that groupby(...).value_counts() is a less common call, with users tending toward groupby(['a', 'b']).count() or some such rather than groupby('a')['b'].value_counts().

@TomAugspurger
Copy link
Contributor

Thanks for the report. Anybody interested in working on this?

@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Sep 19, 2019
@dongho-jung
Copy link
Contributor

Hi, I'm interested in this! May I take this?

@dongho-jung
Copy link
Contributor

I have found a cause of this issue by pdb and somewhat monkeypatch.
the direct cause is that the prototype of rep func is functools.partial(<function repeat at 0x7fd2c1cbfdd0>, repeats=array([1, 1])) but self.grouper.recons_labels is [array([0, 1, 2])]

# num. of times each group should be repeated
rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx))
# multi-index components
labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
names = self.grouper.names + [self._selection_name]

So the number of params doesn't match at map(rep, self.grouper.recons_labels) in line 1262.

To solve this, I managed to find an indirect cause and they were ids and val because the rep func is made from them. with real code and psuedo-pdb:

s = pd.Series([42, 99], index=pd.DatetimeIndex(['2019-09-10', '2019-09-12'])
sg = s.groupby(pd.Grouper('1D'))
sg.value_counts()
    -> ...
    -> ids, val = ids[mask], val[mask]
    -> p ids, val
    [0, 2], [42, 99]
    -> ...
    -> *FAIL*

We need to supplement them when there are empty groups like this issue. In other words, we should
make them dense like as follows:

[0, 2], [42, 99] -> [0, 1, 2], [42, None, 99]
[0, 0, 3, 3, 3], [111, 111, 222, 222, 333] -> [0, 0, 1, 2, 3, 3, 3], [111, 111, None, None, 222, 222, 333]

the former is ids and the latter is val.

Now I have written the code block for making dense ids and val. and I'm writing some tests for this and finding where I should modify else like doc/source/whatsnew. Because this is my first contributing, I might miss something. So please let me know if I made a mistake or didn't follow the rule :D

@dongho-jung
Copy link
Contributor

However, this treatment only works when the index of SeriesGroupby is DatetimeIndex.
So I had it check that before enter the code block I wrote. It's like pretty ad-hoc though.

I think there are more fundamental solutions for this, but I'm just pandas newbie yet and what I able to do was just investigating local code block. I'll upload the rest tomorrow because I'm totally exhausted now D:...

@dongho-jung
Copy link
Contributor

I just tried it like this:

        # groupby removes null keys from groupings
        mask = ids != -1
        ids, val = ids[mask], val[mask]

        # when the index is sparse datetime, we should make it dense
        # otherwise, param of rep func below won't match with the labels
        if isinstance(self.obj.index, DatetimeIndex):
            dense_ids = np.arange(0, ids.max() + 1)

            if not np.array_equal(ids, dense_ids):
                unique_ids, count_ids = np.unique(ids, return_counts=True)
                additional_ids = unique_ids.repeat(count_ids - 1)
                dense_ids = np.sort(np.concatenate([dense_ids, additional_ids]))

                translate_idx = np.where(np.isin(dense_ids, ids), 1, None).nonzero()[0]

                dense_val = np.array([None]).repeat(dense_ids.shape)
                for i, v in zip(translate_idx, val):
                    dense_val[i] = v

                # [0, 2], [42, 77] --> [0, 1, 2], [42, None, 77]
                ids, val = dense_ids, dense_val

        if bins is None:
            lab, lev = algorithms.factorize(val, sort=True)
            llab = lambda lab, inc: lab[inc]

It solves this issue and passes the tests though, some of the benchmarks got significantly worse D:

      before           after         ratio
     [f08a1e62]       [c6cedf0b]
                      <fix-value_counts>
+         783±3μs          278±5ms   354.92  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'transformation')
+        792±10μs          279±3ms   352.98  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'direct')
+        959±10μs       3.40±0.1ms     3.54  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'transformation')
+         941±2μs      3.25±0.02ms     3.45  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'direct')
+         831±9μs       2.69±0.1ms     3.24  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
+        850±20μs      2.55±0.01ms     3.00  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')
+         841±3μs      2.51±0.02ms     2.99  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'direct')
+        832±20μs       2.40±0.1ms     2.88  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'transformation')
+         694±5μs         1.99±0ms     2.86  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'transformation')
+        813±10μs      2.32±0.02ms     2.85  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'direct')
+        852±10μs       2.42±0.1ms     2.84  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'transformation')
+         699±5μs      1.97±0.01ms     2.82  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'direct')
+       655±600μs         1.52±1ms     2.32  index_cached_properties.IndexCache.time_engine('MultiIndex')
+       51.5±20ms          107±1ms     2.07  multiindex_object.GetLoc.time_large_get_loc
+        67.1±1μs          134±8μs     2.00  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
...
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

so I think that I should make self.grouper.recons_labels compact rather than make idx and val dense. But I'm not sure whether it'll work.

@xuancong84
Copy link
Author

@0xF4D3C0D3 Thank you for your attempt!-:) Though, the benchmark result looks terrible. I do not think this is the right way to fix. In fact, you need to read my entire post especially the last part which causes missing rows. In general, the first step in value_counts() should not ignore empty groups, and the solution must be generic and must apply to any groupby objects, not just for DatetimeIndex.

Maybe you can add an additional option in the parent function, controlling whether to ignore empty rows in constructing the data frame to be passed to the 2nd part of value_counts() function (see the stack-trace).

@dongho-jung
Copy link
Contributor

Oh, I just overlooked the last part of your post. I only attempt about that time series. thanks to your advice, I'll be able to try in another way that more suitable. :D

But I don't understand what is the data frame to be passed to the 2nd part of value_counts() function.
Series.value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
and
SeriesGroupBy.value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)

what's the meaning of 2nd part?

@xuancong84
Copy link
Author

@0xF4D3C0D3 Thanks for looking into the other issue on DataFrame construction!

For "the 2nd part", I am refers to the crashing line in the stack-trace:
labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
Since the 1st part has ignored empty rows, the constructed DataFrame/Series has fewer number of rows than the groupby output. The discrepancy in number of rows causes the broadcast operation to fail, giving rise to the stack-trace.

@dongho-jung
Copy link
Contributor

oh wow, this is likely to be harder than I thought :D but by your favor, I find what I can try.
I'm not sure that I would make it, but I'll do my best.

@dongho-jung
Copy link
Contributor

dongho-jung commented Sep 23, 2019

I'm just looking here

pandas/pandas/core/frame.py

Lines 8402 to 8409 in 1285938

def _from_nested_dict(data):
# TODO: this should be seriously cythonized
new_data = OrderedDict()
for index, s in data.items():
for col, v in s.items():
new_data[col] = new_data.get(col, OrderedDict())
new_data[col][index] = v
return new_data

and I modified a little bit and running tests, feeling lucky.

EDIT: Agggh no nevermind I have looked the wrong one

@dongho-jung
Copy link
Contributor

dongho-jung commented Sep 23, 2019

Hello guys! how about this?

diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py
index e731cffea..47f3cca7c 100644
--- a/pandas/core/groupby/generic.py
+++ b/pandas/core/groupby/generic.py
@@ -9,6 +9,7 @@ from collections import OrderedDict, abc, namedtuple
 import copy
 import functools
 from functools import partial
+import itertools
 from textwrap import dedent
 import typing
 from typing import Any, Callable, FrozenSet, Iterator, Sequence, Type, Union
@@ -1264,9 +1265,12 @@ class SeriesGroupBy(GroupBy):
 
         # num. of times each group should be repeated
         rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx))
-
-        # multi-index components
-        labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
+        
+        #multi-index components
+        try:
+            labels = list(map(rep, self.grouper.recons_labels )) + [llab(lab, inc)]
+        except ValueError:
+            labels = list(map(rep, [[k for k, _ in itertools.groupby(ids)]])) + [llab(lab, inc)]
         levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
         names = self.grouper.names + [self._selection_name]

I tried to make constructing DataFrame don't skip empty rows and I somewhat made it.
it works on some levels because pd.DataFrame.from_dict(data={'row1':{'a':1, 'b':2}, 'row2': {'a':3, 'b':4, 'c':5}, 'row3':{}}, orient='index').fillna(0) gives me the result having the row3. But it couldn't
solve this issue D:. So I tried another way that I told above.

First I think a design flaw that the construction of DataFrame ignores empty rows and a mismatch between rep and callee are separated issues, or I may be misunderstanding it. Is there any example
that crashes like this with no index which is group by Grouper(freq=~) or something like a period? As I haven't been found the example yet, I focused on a mismatch between rep and callee again.
And this time I tried to make the callee compact rather than expand the parameter of rep. The code above is the result of it. it passes all tests including this issue though, I have not run benchmarks yet.

@dongho-jung
Copy link
Contributor

dongho-jung commented Sep 24, 2019

I have found that the second way is more efficient than itertools.groupby so I changed it and ran the benchmark again. (Is it normal that the benchmark is sooooooo slow??? maybe it took 5 hours in my mac laptop and 3 hours in my desktop(32GB mem, i7)...) btw whenever I find something, I'll share it.

image

dongho-jung added a commit to dongho-jung/pandas that referenced this issue Sep 24, 2019
…#28479)

* If applying rep to recons_labels go fail, use ids which has no
  consecutive duplicates instead.
dongho-jung added a commit to dongho-jung/pandas that referenced this issue Sep 25, 2019
…dev#28479)

* If applying rep to recons_labels go fail, use ids which has no
  consecutive duplicates instead.
dongho-jung added a commit to dongho-jung/pandas that referenced this issue Sep 25, 2019
…dev#28479)

* If applying rep to recons_labels go fail, use ids which has no
  consecutive duplicates instead.
dongho-jung added a commit to dongho-jung/pandas that referenced this issue Sep 25, 2019
…dev#28479)

    * If applying rep to recons_labels go fail, use ids which has no
      consecutive duplicates instead.
@biddwan09
Copy link
Contributor

Hi @jreback can I work on this issue ? I really want to start contributing on this project

@jreback jreback modified the milestones: Contributions Welcome, 1.0 Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment