-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby upon categorical and sort=False triggers ValueError #13179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So the purpose of the So in this case you are removing the values for category '1', BUT that should still show up in the results as its a categorical. for
I suppose for I think would just do this in groupby (or maybe add a kw arg to |
cc @sinhrks |
Hi @jreback - Thanks for receiving the bug report. I have just a little doubt as a layman here: is it convention to return a 'group' (in the groupby) for all the categories even tough there is no data for them available in the supplied data? Imagine I make a query for just chromosomes query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].groupby('chromosomes').A.sum()
chromosomes
4 195.0
5 394.0
Name: A, dtype: float64
# as opposed to :
chromosomes
1 NaN
2 NaN
3 NaN
4 195.0
5 394.0
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
X NaN
Y NaN
Name: A, dtype: float64 |
@mpschr yes, this is the purpose of If you think about it would be buggy to remove them! IOW, how would the code know its 'ok' to drop them? |
Hi @jreback I am not sure if we are talking about the same thing. I elaborte: I was referring to the data available in the query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes
71 4
72 4
73 4
74 5
75 5
76 5
77 5
78 5
79 5
80 5
81 5
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y] But, analogously to this I would expect the following output after doing groupby: df[df.chromosomes.isin(query_chroms)].groupby('chromosomes').A.sum().reset_index().chromosomes
#expected output:
chromosomes
1 4
2 5
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]
#but actual output is.
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
20 21
21 22
22 X
23 Y
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y] The actual output (2nd option) we get here is misleading since all chromosomes except 4 and 5 are not in the supplied data, they are just 'acceptable' options. Is it possible that this two different viewpoints may contribute to the bug reported here? |
@mpschr:
I think there was a specific reason why unique is now not returning the whole categories (AFAIK remember the first implementation simply returned the categories). I think because someone argued that the implicit API contract for |
Ok, so this is the current behaviour: # 1.
query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes.unique()
#output
[4, 5]
Categories (2, object): [4 < 5] again - here what a layman like me would expect is the following. # 2.
query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes.unique()
#output
[4, 5]
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y] Now I understood the bug :) The |
@mpschr not sure what you mean. This is as expected. The point is that the category dtype IS propogated to ALL operations. There is extensive documentation on this. What exactly is not clear? (the bug in this issue is independent / not related to this).
|
Yep @jreback - I think I went a bit off-topic with the groupby behaviour (including unused categories in the output of group aggregations). In any case I totally agree with you on the matter with the |
yeah I don't really recall all of the discussion about Yeah I can see how we just return the observed values |
Has this not been fixed yet (just curiosity) |
@mpschr issues get closed when they are fixed. you are welcome to submit a PR to fix this. Community PR's push things along. |
closes pandas-dev#13179 Author: Kernc <[email protected]> Closes pandas-dev#15439 from kernc/Categorical.unique-nostrip-unused and squashes the following commits: 55733b8 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing 2aec326 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing c813146 [Kernc] PERF: add asv for categorical grouping 0c550e6 [Kernc] BUG: Fix .groupby(categorical, sort=False) failing
Code that triggers ValueError
The combination of
sort=False
and a missing category in the data causes the bug - see belowFirst off, see this notebook which showcases the bug nicely: github.com/mpschr/pandas_missing_cat_bug
Summaries of the scenarios where this bug appears:
Bug scenarios with ordered categories:
sort = True
): No errorchromosome 1
filtered out andsort=True
: No errorchromosome 1
filtered out andsort=False
: Errorsort = False
: ErrorBug scenarios without ordered categories:
the 4 scenarios:
sort = True
): No errorchromosome 1
filtered out andsort=True
: No errorsort = False
: No errorchromosome 1
filtered out andsort=False
: ErrorExpected Output
Not an error, but this:
output of
pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
nose: 1.3.4
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.22
numpy: 1.10.4
scipy: 0.16.0
statsmodels: 0.6.0.dev-9ce1605
xarray: None
IPython: 4.1.2
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.36.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: