-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
value_counts() crashes if groupby object contains empty groups #28479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Simpler repro; happens on Series as well, not just DataFrame. import pandas as pd
ser = pd.Series([1, 2], index=pd.DatetimeIndex(['2019-09-01', '2019-09-03']))
ser.groupby(pd.Grouper(freq='D')).value_counts() ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-0a3140bd4ce6> in <module>
----> 1 ser.groupby(pd.Grouper(freq='D')).value_counts()
C:\Miniconda3\envs\py37\lib\site-packages\pandas\core\groupby\generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
1242
1243 # multi-index components
-> 1244 labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
1245 levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
1246 names = self.grouper.names + [self._selection_name]
C:\Miniconda3\envs\py37\lib\site-packages\numpy\core\fromnumeric.py in repeat(a, repeats, axis)
469
470 """
--> 471 return _wrapfunc(a, 'repeat', repeats, axis=axis)
472
473
C:\Miniconda3\envs\py37\lib\site-packages\numpy\core\fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
54 def _wrapfunc(obj, method, *args, **kwds):
55 try:
---> 56 return getattr(obj, method)(*args, **kwds)
57
58 # An AttributeError occurs if the object does not have
ValueError: operands could not be broadcast together with shape (3,) (2,) |
I wonder how can such a big issue never been reported in the past. Grouping according to days is a very common use right? |
@xuancong84 I can reproduce on 0.19.2 (my earliest install), so this bug has been around a long time. My offhand guess is that |
Thanks for the report. Anybody interested in working on this? |
Hi, I'm interested in this! May I take this? |
I have found a cause of this issue by pdb and somewhat monkeypatch. pandas/pandas/core/groupby/generic.py Lines 1258 to 1264 in f08a1e6
So the number of params doesn't match at To solve this, I managed to find an indirect cause and they were
We need to supplement them when there are empty groups like this issue. In other words, we should
the former is Now I have written the code block for making dense |
However, this treatment only works when the index of SeriesGroupby is DatetimeIndex. I think there are more fundamental solutions for this, but I'm just pandas newbie yet and what I able to do was just investigating local code block. I'll upload the rest tomorrow because I'm totally exhausted now D:... |
I just tried it like this:
It solves this issue and passes the tests though, some of the benchmarks got significantly worse D:
so I think that I should make |
@0xF4D3C0D3 Thank you for your attempt!-:) Though, the benchmark result looks terrible. I do not think this is the right way to fix. In fact, you need to read my entire post especially the last part which causes missing rows. In general, the first step in value_counts() should not ignore empty groups, and the solution must be generic and must apply to any groupby objects, not just for DatetimeIndex. Maybe you can add an additional option in the parent function, controlling whether to ignore empty rows in constructing the data frame to be passed to the 2nd part of value_counts() function (see the stack-trace). |
Oh, I just overlooked the last part of your post. I only attempt about that time series. thanks to your advice, I'll be able to try in another way that more suitable. :D But I don't understand what is what's the meaning of |
@0xF4D3C0D3 Thanks for looking into the other issue on DataFrame construction! For "the 2nd part", I am refers to the crashing line in the stack-trace: |
oh wow, this is likely to be harder than I thought :D but by your favor, I find what I can try. |
I'm just looking here Lines 8402 to 8409 in 1285938
and I modified a little bit and running tests, feeling lucky. EDIT: Agggh no nevermind I have looked the wrong one |
Hello guys! how about this? diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py
index e731cffea..47f3cca7c 100644
--- a/pandas/core/groupby/generic.py
+++ b/pandas/core/groupby/generic.py
@@ -9,6 +9,7 @@ from collections import OrderedDict, abc, namedtuple
import copy
import functools
from functools import partial
+import itertools
from textwrap import dedent
import typing
from typing import Any, Callable, FrozenSet, Iterator, Sequence, Type, Union
@@ -1264,9 +1265,12 @@ class SeriesGroupBy(GroupBy):
# num. of times each group should be repeated
rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx))
-
- # multi-index components
- labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
+
+ #multi-index components
+ try:
+ labels = list(map(rep, self.grouper.recons_labels )) + [llab(lab, inc)]
+ except ValueError:
+ labels = list(map(rep, [[k for k, _ in itertools.groupby(ids)]])) + [llab(lab, inc)]
levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
names = self.grouper.names + [self._selection_name] I tried to make constructing DataFrame don't skip empty rows and I somewhat made it. First I think a design flaw that the construction of DataFrame ignores empty rows and a mismatch between rep and callee are separated issues, or I may be misunderstanding it. Is there any example |
I have found that the second way is more efficient than |
…#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.
…dev#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.
…dev#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.
…dev#28479) * If applying rep to recons_labels go fail, use ids which has no consecutive duplicates instead.
Hi @jreback can I work on this issue ? I really want to start contributing on this project |
When you group some statistical counts for every day, it is possible that on some day there is no counts at all. This will result in empty groups in the groupby object. Performing value_counts() on such groupby objects causes crash.
The following example illustrates the problem:
This table does not contain days with empty data, value_counts() does not crash:
After groupby each day:
(Timestamp('2019-08-06 00:00:00', freq='D'), Timestamp Food Datetime
0 1565083561 apple 2019-08-06 09:26:01)
(Timestamp('2019-08-07 00:00:00', freq='D'), Timestamp Food Datetime
1 1565169961 apple 2019-08-07 09:26:01
2 1565170061 banana 2019-08-07 09:27:41)
(Timestamp('2019-08-08 00:00:00', freq='D'), Timestamp Food Datetime
3 1565256361 banana 2019-08-08 09:26:01)
(Timestamp('2019-08-09 00:00:00', freq='D'), Timestamp Food Datetime
4 1565342761 orange 2019-08-09 09:26:01
5 1565343061 orange 2019-08-09 09:31:01)
(Timestamp('2019-08-10 00:00:00', freq='D'), Timestamp Food Datetime
6 1565429161 pear 2019-08-10 09:26:01)
Result of value_counts():
banana 1
Datetime Food
2019-08-06 apple 1
2019-08-07 apple 1
2019-08-08 banana 1
2019-08-09 orange 2
2019-08-10 pear 1
Name: Food, dtype: int64
This table contains a day with empty data (2019-08-08), value_counts() will crash:
After groupby each day (note the empty group on 2019-08-08):
(Timestamp('2019-08-06 00:00:00', freq='D'), Timestamp Food Datetime
0 1565083561 apple 2019-08-06 09:26:01)
(Timestamp('2019-08-07 00:00:00', freq='D'), Timestamp Food Datetime
1 1565169961 apple 2019-08-07 09:26:01
2 1565170061 banana 2019-08-07 09:27:41)
(Timestamp('2019-08-08 00:00:00', freq='D'), Empty DataFrame
Columns: [Timestamp, Food, Datetime]
Index: [])
(Timestamp('2019-08-09 00:00:00', freq='D'), Timestamp Food Datetime
4 1565342761 orange 2019-08-09 09:26:01
5 1565343061 orange 2019-08-09 09:31:01)
(Timestamp('2019-08-10 00:00:00', freq='D'), Timestamp Food Datetime
6 1565429161 pear 2019-08-10 09:26:01)
value_counts() crashes:
It turns out that this might result from a design flaw in DataFrame construction that it skips empty rows:
pd.DataFrame.from_dict(data={'row1':{'a':1, 'b':2}, 'row2': {'a':3, 'b':4, 'c':5}, 'row3':{}}, orient='index').fillna(0)
Take note that row3 is not constructed at all. The correct behavior should output:
The text was updated successfully, but these errors were encountered: