value_counts() crashes if groupby object contains empty groups #28479

xuancong84 · 2019-09-17T05:05:20Z

When you group some statistical counts for every day, it is possible that on some day there is no counts at all. This will result in empty groups in the groupby object. Performing value_counts() on such groupby objects causes crash.

The following example illustrates the problem:

import pandas as pd
df = pd.DataFrame({'Timestamp':[1565083561, 1565083561+86400, 1565083561+86500, 1565083561+86400*2, 1565083561+86400*3, 1565083561+86500*3, 1565083561+86400*4],
                   'Food':['apple', 'apple', 'banana', 'banana', 'orange', 'orange', 'pear']})
df['Datetime'] = pd.to_datetime(df['Timestamp'].apply(lambda t:str(t)), unit='s')
display(df)
dfg = df.groupby(pd.Grouper(freq='1D', key='Datetime'))
[print(g) for g in dfg]
display(dfg['Food'].value_counts())

df = df.drop([3])
display(df)
dfg = df.groupby(pd.Grouper(freq='1D', key='Datetime'))
[print(g) for g in dfg]
display(dfg['Food'].value_counts())

This table does not contain days with empty data, value_counts() does not crash:

Timestamp	Food	Datetime
1565083561	apple	2019-08-06 09:26:01
1565169961	apple	2019-08-07 09:26:01
1565170061	banana	2019-08-07 09:27:41
1565256361	banana	2019-08-08 09:26:01
1565342761	orange	2019-08-09 09:26:01
1565343061	orange	2019-08-09 09:31:01
1565429161	pear	2019-08-10 09:26:01

After groupby each day:
(Timestamp('2019-08-06 00:00:00', freq='D'), Timestamp Food Datetime
0 1565083561 apple 2019-08-06 09:26:01)
(Timestamp('2019-08-07 00:00:00', freq='D'), Timestamp Food Datetime
1 1565169961 apple 2019-08-07 09:26:01
2 1565170061 banana 2019-08-07 09:27:41)
(Timestamp('2019-08-08 00:00:00', freq='D'), Timestamp Food Datetime
3 1565256361 banana 2019-08-08 09:26:01)
(Timestamp('2019-08-09 00:00:00', freq='D'), Timestamp Food Datetime
4 1565342761 orange 2019-08-09 09:26:01
5 1565343061 orange 2019-08-09 09:31:01)
(Timestamp('2019-08-10 00:00:00', freq='D'), Timestamp Food Datetime
6 1565429161 pear 2019-08-10 09:26:01)

Result of value_counts():
Datetime Food
2019-08-06 apple 1
2019-08-07 apple 1
banana 1
2019-08-08 banana 1
2019-08-09 orange 2
2019-08-10 pear 1
Name: Food, dtype: int64

This table contains a day with empty data (2019-08-08), value_counts() will crash:

Timestamp	Food	Datetime
1565083561	apple	2019-08-06 09:26:01
1565169961	apple	2019-08-07 09:26:01
1565170061	banana	2019-08-07 09:27:41
1565342761	orange	2019-08-09 09:26:01
1565343061	orange	2019-08-09 09:31:01
1565429161	pear	2019-08-10 09:26:01

After groupby each day (note the empty group on 2019-08-08):
(Timestamp('2019-08-06 00:00:00', freq='D'), Timestamp Food Datetime
0 1565083561 apple 2019-08-06 09:26:01)
(Timestamp('2019-08-07 00:00:00', freq='D'), Timestamp Food Datetime
1 1565169961 apple 2019-08-07 09:26:01
2 1565170061 banana 2019-08-07 09:27:41)
(Timestamp('2019-08-08 00:00:00', freq='D'), Empty DataFrame
Columns: [Timestamp, Food, Datetime]
Index: [])
(Timestamp('2019-08-09 00:00:00', freq='D'), Timestamp Food Datetime
4 1565342761 orange 2019-08-09 09:26:01
5 1565343061 orange 2019-08-09 09:31:01)
(Timestamp('2019-08-10 00:00:00', freq='D'), Timestamp Food Datetime
6 1565429161 pear 2019-08-10 09:26:01)

value_counts() crashes:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-543-5efc1c882109> in <module>
     14 [print(g) for g in dfg]
     15 print('This will cause crash:')
---> 16 display(dfg['Food'].value_counts())

~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
   1137 
   1138         # multi-index components
-> 1139         labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
   1140         levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
   1141         names = self.grouper.names + [self._selection_name]

~/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
    469 
    470     """
--> 471     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    472 
    473 

~/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     54 def _wrapfunc(obj, method, *args, **kwds):
     55     try:
---> 56         return getattr(obj, method)(*args, **kwds)
     57 
     58     # An AttributeError occurs if the object does not have

ValueError: operands could not be broadcast together with shape (5,) (4,)

It turns out that this might result from a design flaw in DataFrame construction that it skips empty rows:
pd.DataFrame.from_dict(data={'row1':{'a':1, 'b':2}, 'row2': {'a':3, 'b':4, 'c':5}, 'row3':{}}, orient='index').fillna(0)

	a	b	c
row1	1	2	0
row2	3	4	5.0

Take note that row3 is not constructed at all. The correct behavior should output:

	a	b	c
row1	1	2	0.0
row2	3	4	5.0
row3	0	0	0.0

The text was updated successfully, but these errors were encountered:

Liam3851 · 2019-09-17T19:29:06Z

Simpler repro; happens on Series as well, not just DataFrame.

import pandas as pd
ser = pd.Series([1, 2], index=pd.DatetimeIndex(['2019-09-01', '2019-09-03']))
ser.groupby(pd.Grouper(freq='D')).value_counts()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-0a3140bd4ce6> in <module>
----> 1 ser.groupby(pd.Grouper(freq='D')).value_counts()

C:\Miniconda3\envs\py37\lib\site-packages\pandas\core\groupby\generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
   1242
   1243         # multi-index components
-> 1244         labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
   1245         levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
   1246         names = self.grouper.names + [self._selection_name]

C:\Miniconda3\envs\py37\lib\site-packages\numpy\core\fromnumeric.py in repeat(a, repeats, axis)
    469
    470     """
--> 471     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    472
    473

C:\Miniconda3\envs\py37\lib\site-packages\numpy\core\fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     54 def _wrapfunc(obj, method, *args, **kwds):
     55     try:
---> 56         return getattr(obj, method)(*args, **kwds)
     57
     58     # An AttributeError occurs if the object does not have

ValueError: operands could not be broadcast together with shape (3,) (2,)

xuancong84 · 2019-09-18T07:34:30Z

I wonder how can such a big issue never been reported in the past. Grouping according to days is a very common use right?

Liam3851 · 2019-09-18T13:59:58Z

@xuancong84 I can reproduce on 0.19.2 (my earliest install), so this bug has been around a long time. My offhand guess is that groupby(...).value_counts() is a less common call, with users tending toward groupby(['a', 'b']).count() or some such rather than groupby('a')['b'].value_counts().

TomAugspurger · 2019-09-19T21:41:41Z

Thanks for the report. Anybody interested in working on this?

dongho-jung · 2019-09-20T14:08:30Z

Hi, I'm interested in this! May I take this?

dongho-jung · 2019-09-21T11:51:10Z

I have found a cause of this issue by pdb and somewhat monkeypatch.
the direct cause is that the prototype of rep func is functools.partial(<function repeat at 0x7fd2c1cbfdd0>, repeats=array([1, 1])) but self.grouper.recons_labels is [array([0, 1, 2])]

pandas/pandas/core/groupby/generic.py

Lines 1258 to 1264 in f08a1e6

    
           # num. of times each group should be repeated 
        
           rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx)) 
        
           # multi-index components 
        
           labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)] 
        
           levels = [ping.group_index for ping in self.grouper.groupings] + [lev] 
        
           names = self.grouper.names + [self._selection_name]

So the number of params doesn't match at map(rep, self.grouper.recons_labels) in line 1262.

To solve this, I managed to find an indirect cause and they were ids and val because the rep func is made from them. with real code and psuedo-pdb:

s = pd.Series([42, 99], index=pd.DatetimeIndex(['2019-09-10', '2019-09-12'])
sg = s.groupby(pd.Grouper('1D'))
sg.value_counts()
    -> ...
    -> ids, val = ids[mask], val[mask]
    -> p ids, val
    [0, 2], [42, 99]
    -> ...
    -> *FAIL*

We need to supplement them when there are empty groups like this issue. In other words, we should
make them dense like as follows:

[0, 2], [42, 99] -> [0, 1, 2], [42, None, 99]
[0, 0, 3, 3, 3], [111, 111, 222, 222, 333] -> [0, 0, 1, 2, 3, 3, 3], [111, 111, None, None, 222, 222, 333]

the former is ids and the latter is val.

Now I have written the code block for making dense ids and val. and I'm writing some tests for this and finding where I should modify else like doc/source/whatsnew. Because this is my first contributing, I might miss something. So please let me know if I made a mistake or didn't follow the rule :D

dongho-jung · 2019-09-21T20:15:07Z

However, this treatment only works when the index of SeriesGroupby is DatetimeIndex.
So I had it check that before enter the code block I wrote. It's like pretty ad-hoc though.

I think there are more fundamental solutions for this, but I'm just pandas newbie yet and what I able to do was just investigating local code block. I'll upload the rest tomorrow because I'm totally exhausted now D:...

dongho-jung · 2019-09-22T10:32:03Z

I just tried it like this:

        # groupby removes null keys from groupings
        mask = ids != -1
        ids, val = ids[mask], val[mask]

        # when the index is sparse datetime, we should make it dense
        # otherwise, param of rep func below won't match with the labels
        if isinstance(self.obj.index, DatetimeIndex):
            dense_ids = np.arange(0, ids.max() + 1)

            if not np.array_equal(ids, dense_ids):
                unique_ids, count_ids = np.unique(ids, return_counts=True)
                additional_ids = unique_ids.repeat(count_ids - 1)
                dense_ids = np.sort(np.concatenate([dense_ids, additional_ids]))

                translate_idx = np.where(np.isin(dense_ids, ids), 1, None).nonzero()[0]

                dense_val = np.array([None]).repeat(dense_ids.shape)
                for i, v in zip(translate_idx, val):
                    dense_val[i] = v

                # [0, 2], [42, 77] --> [0, 1, 2], [42, None, 77]
                ids, val = dense_ids, dense_val

        if bins is None:
            lab, lev = algorithms.factorize(val, sort=True)
            llab = lambda lab, inc: lab[inc]

It solves this issue and passes the tests though, some of the benchmarks got significantly worse D:

      before           after         ratio
     [f08a1e62]       [c6cedf0b]
                      <fix-value_counts>
+         783±3μs          278±5ms   354.92  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'transformation')
+        792±10μs          279±3ms   352.98  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'direct')
+        959±10μs       3.40±0.1ms     3.54  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'transformation')
+         941±2μs      3.25±0.02ms     3.45  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'direct')
+         831±9μs       2.69±0.1ms     3.24  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
+        850±20μs      2.55±0.01ms     3.00  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')
+         841±3μs      2.51±0.02ms     2.99  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'direct')
+        832±20μs       2.40±0.1ms     2.88  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'transformation')
+         694±5μs         1.99±0ms     2.86  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'transformation')
+        813±10μs      2.32±0.02ms     2.85  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'direct')
+        852±10μs       2.42±0.1ms     2.84  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'transformation')
+         699±5μs      1.97±0.01ms     2.82  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'direct')
+       655±600μs         1.52±1ms     2.32  index_cached_properties.IndexCache.time_engine('MultiIndex')
+       51.5±20ms          107±1ms     2.07  multiindex_object.GetLoc.time_large_get_loc
+        67.1±1μs          134±8μs     2.00  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
...
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

so I think that I should make self.grouper.recons_labels compact rather than make idx and val dense. But I'm not sure whether it'll work.

xuancong84 · 2019-09-22T11:58:19Z

@0xF4D3C0D3 Thank you for your attempt!-:) Though, the benchmark result looks terrible. I do not think this is the right way to fix. In fact, you need to read my entire post especially the last part which causes missing rows. In general, the first step in value_counts() should not ignore empty groups, and the solution must be generic and must apply to any groupby objects, not just for DatetimeIndex.

Maybe you can add an additional option in the parent function, controlling whether to ignore empty rows in constructing the data frame to be passed to the 2nd part of value_counts() function (see the stack-trace).

dongho-jung · 2019-09-22T12:19:33Z

Oh, I just overlooked the last part of your post. I only attempt about that time series. thanks to your advice, I'll be able to try in another way that more suitable. :D

But I don't understand what is the data frame to be passed to the 2nd part of value_counts() function.
Series.value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
and
SeriesGroupBy.value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)

what's the meaning of 2nd part?

xuancong84 · 2019-09-23T03:06:32Z

@0xF4D3C0D3 Thanks for looking into the other issue on DataFrame construction!

For "the 2nd part", I am refers to the crashing line in the stack-trace:
labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
Since the 1st part has ignored empty rows, the constructed DataFrame/Series has fewer number of rows than the groupby output. The discrepancy in number of rows causes the broadcast operation to fail, giving rise to the stack-trace.

dongho-jung · 2019-09-23T08:13:27Z

oh wow, this is likely to be harder than I thought :D but by your favor, I find what I can try.
I'm not sure that I would make it, but I'll do my best.

dongho-jung · 2019-09-23T08:48:36Z

I'm just looking here

pandas/pandas/core/frame.py

Lines 8402 to 8409 in 1285938

    
           def _from_nested_dict(data): 
        
               # TODO: this should be seriously cythonized 
        
               new_data = OrderedDict() 
        
               for index, s in data.items(): 
        
                   for col, v in s.items(): 
        
                       new_data[col] = new_data.get(col, OrderedDict()) 
        
                       new_data[col][index] = v 
        
               return new_data

and I modified a little bit and running tests, feeling lucky.

EDIT: Agggh no nevermind I have looked the wrong one

dongho-jung · 2019-09-23T14:02:52Z

Hello guys! how about this?

diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py
index e731cffea..47f3cca7c 100644
--- a/pandas/core/groupby/generic.py
+++ b/pandas/core/groupby/generic.py
@@ -9,6 +9,7 @@ from collections import OrderedDict, abc, namedtuple
 import copy
 import functools
 from functools import partial
+import itertools
 from textwrap import dedent
 import typing
 from typing import Any, Callable, FrozenSet, Iterator, Sequence, Type, Union
@@ -1264,9 +1265,12 @@ class SeriesGroupBy(GroupBy):
 
         # num. of times each group should be repeated
         rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx))
-
-        # multi-index components
-        labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
+        
+        #multi-index components
+        try:
+            labels = list(map(rep, self.grouper.recons_labels )) + [llab(lab, inc)]
+        except ValueError:
+            labels = list(map(rep, [[k for k, _ in itertools.groupby(ids)]])) + [llab(lab, inc)]
         levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
         names = self.grouper.names + [self._selection_name]

I tried to make constructing DataFrame don't skip empty rows and I somewhat made it.
it works on some levels because pd.DataFrame.from_dict(data={'row1':{'a':1, 'b':2}, 'row2': {'a':3, 'b':4, 'c':5}, 'row3':{}}, orient='index').fillna(0) gives me the result having the row3. But it couldn't
solve this issue D:. So I tried another way that I told above.

First I think a design flaw that the construction of DataFrame ignores empty rows and a mismatch between rep and callee are separated issues, or I may be misunderstanding it. Is there any example
that crashes like this with no index which is group by Grouper(freq=~) or something like a period? As I haven't been found the example yet, I focused on a mismatch between rep and callee again.
And this time I tried to make the callee compact rather than expand the parameter of rep. The code above is the result of it. it passes all tests including this issue though, I have not run benchmarks yet.

dongho-jung · 2019-09-24T09:33:27Z

I have found that the second way is more efficient than itertools.groupby so I changed it and ran the benchmark again. (Is it normal that the benchmark is sooooooo slow??? maybe it took 5 hours in my mac laptop and 3 hours in my desktop(32GB mem, i7)...) btw whenever I find something, I'll share it.

…28479)

…#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.

…dev#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.

biddwan09 · 2019-10-20T15:50:29Z

Hi @jreback can I work on this issue ? I really want to start contributing on this project

…#28634)

…dev#28479) (pandas-dev#28634)

TomAugspurger added this to the Contributions Welcome milestone Sep 19, 2019

dongho-jung pushed a commit to dongho-jung/pandas that referenced this issue Sep 24, 2019

BUG: make construction of labels can handle empty periods(pandas-dev#…

1d69334

…28479)

dongho-jung added a commit to dongho-jung/pandas that referenced this issue Sep 24, 2019

BUG: make construction of labels can handle empty periods (pandas-dev…

7a0b22f

…#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.

dongho-jung added a commit to dongho-jung/pandas that referenced this issue Sep 25, 2019

BUG: value_counts can handle the case even with empty groups (pandas-…

175a33d

…dev#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.

dongho-jung added a commit to dongho-jung/pandas that referenced this issue Sep 25, 2019

BUG: value_counts can handle the case even with empty groups (pandas-…

2b1808b

…dev#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.

dongho-jung mentioned this issue Sep 25, 2019

BUG: value_counts can handle the case even with empty groups (#28479) #28607

Closed

5 tasks

dongho-jung added a commit to dongho-jung/pandas that referenced this issue Sep 25, 2019

BUG: value_counts can handle the case even with empty groups (pandas-…

b28456a

…dev#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.

This was referenced Sep 25, 2019

BUG: value_counts can handle the case even with empty groups (#28479) #28608

Closed

BUG: value_counts can handle the case even with empty groups (#28479) #28634

Merged

jreback added Bug Groupby labels Oct 18, 2019

jreback modified the milestones: Contributions Welcome, 1.0 Nov 6, 2019

jreback closed this as completed in #28634 Nov 7, 2019

jreback pushed a commit that referenced this issue Nov 7, 2019

BUG: value_counts can handle the case even with empty groups (#28479) (…

8efc717

…#28634)

Reksbril pushed a commit to Reksbril/pandas that referenced this issue Nov 18, 2019

BUG: value_counts can handle the case even with empty groups (pandas-…

16b5739

…dev#28479) (pandas-dev#28634)

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

BUG: value_counts can handle the case even with empty groups (pandas-…

0b0fd79

…dev#28479) (pandas-dev#28634)

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

BUG: value_counts can handle the case even with empty groups (pandas-…

598bfba

…dev#28479) (pandas-dev#28634)

xuancong84 mentioned this issue Jan 15, 2020

BUG: concat on axis with both different and duplicate labels raising error #6963

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

value_counts() crashes if groupby object contains empty groups #28479

value_counts() crashes if groupby object contains empty groups #28479

xuancong84 commented Sep 17, 2019

Liam3851 commented Sep 17, 2019

xuancong84 commented Sep 18, 2019

Liam3851 commented Sep 18, 2019

TomAugspurger commented Sep 19, 2019

dongho-jung commented Sep 20, 2019

dongho-jung commented Sep 21, 2019

dongho-jung commented Sep 21, 2019

dongho-jung commented Sep 22, 2019

xuancong84 commented Sep 22, 2019

dongho-jung commented Sep 22, 2019

xuancong84 commented Sep 23, 2019

dongho-jung commented Sep 23, 2019

dongho-jung commented Sep 23, 2019 •

edited

Loading

dongho-jung commented Sep 23, 2019 •

edited

Loading

dongho-jung commented Sep 24, 2019 •

edited

Loading

biddwan09 commented Oct 20, 2019

value_counts() crashes if groupby object contains empty groups #28479

value_counts() crashes if groupby object contains empty groups #28479

Comments

xuancong84 commented Sep 17, 2019

Liam3851 commented Sep 17, 2019

xuancong84 commented Sep 18, 2019

Liam3851 commented Sep 18, 2019

TomAugspurger commented Sep 19, 2019

dongho-jung commented Sep 20, 2019

dongho-jung commented Sep 21, 2019

dongho-jung commented Sep 21, 2019

dongho-jung commented Sep 22, 2019

xuancong84 commented Sep 22, 2019

dongho-jung commented Sep 22, 2019

xuancong84 commented Sep 23, 2019

dongho-jung commented Sep 23, 2019

dongho-jung commented Sep 23, 2019 • edited Loading

dongho-jung commented Sep 23, 2019 • edited Loading

dongho-jung commented Sep 24, 2019 • edited Loading

biddwan09 commented Oct 20, 2019

dongho-jung commented Sep 23, 2019 •

edited

Loading

dongho-jung commented Sep 23, 2019 •

edited

Loading

dongho-jung commented Sep 24, 2019 •

edited

Loading