-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Groupby inconsistency with categorical values #17032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@aplavin : We've had this issue before with I'm a little wary about adding a parameter to That being said, this method is a little different from |
@gfyoung I'm not talking about adding a parameter to As of now the only way I see to have nonempty groups only (as in non-categorical groupby) is something like |
I see, though outside of categorical, what would this parameter mean? I would still be wary of adding another parameter and functionality just because of how special-cased it is in the context of |
xref #8559 (comment) Rather than changing some general purpose method(s) such as |
@toobaz if I understand you correctly, then it probably won't work in general. Suppose there is column |
I think in light of @toobaz 's reference, I think it brings up a larger question of how we want to handle counting of It might be worthwhile to create one umbrella issue that addresses this API standpoint and close the duplicates, as this issue seems to have crept up on several occasions. @jreback : what do you think? |
Yes, you understood me correctly... and yes, you're right... |
see much discussion #8559 closing as a duplicate. |
Code Sample
Problem description
I'm using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to behave the same whether the string column is left as-is or converted to category. If that's expected behavior for categorical groupby to keep empty groups, is it possible to at least provide a boolean parameter to
groupby
likeempty_groups
? Or maybe even a simpler solution exists, but I couldn't find it.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.16-gentoo
machine: x86_64
processor: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8
pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: