-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Reading with read_stata in chunks messes up categories #31544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Milestone
Comments
Seems like a bug. Might need to handle categories using a new path when chunked. |
bashtage
pushed a commit
to bashtage/pandas
that referenced
this issue
May 12, 2020
Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544
5 tasks
bashtage
pushed a commit
to bashtage/pandas
that referenced
this issue
May 12, 2020
Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544
bashtage
pushed a commit
to bashtage/pandas
that referenced
this issue
May 12, 2020
Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544
bashtage
pushed a commit
to bashtage/pandas
that referenced
this issue
May 12, 2020
Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544
bashtage
pushed a commit
to bashtage/pandas
that referenced
this issue
May 12, 2020
Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544
bashtage
pushed a commit
to bashtage/pandas
that referenced
this issue
Jun 2, 2020
Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Code Sample, a copy-pastable example if possible
Problem description
My data has categories, but they are lost only because I'm reading it in chunks. I noticed this because I was reading in chunks a large database of which I only needed a subset of columns: ironically, precisely the fact that I was reading it in chunks made memory usage explode when I reattached them.
An by the way,
Out[8]:
shows that pandas is aware of the actual categories, even before iterating... so this is the information that should be used to consistently recreate them, and all chunks should have exactly the same (as inis
) categorical dtype.Expected Output
Out[10]
should feature both categories, andOut[13]
should still be a categorical.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.0-6-amd64
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8
pandas : 1.1.0.dev0+276.g2495068ad
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 18.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : 4.6.3
hypothesis : 3.71.11
sphinx : 1.8.4
blosc : 1.7.0
feather : None
xlsxwriter : 0.9.3
lxml.etree : 4.3.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.7 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.2
matplotlib : 3.0.2
numexpr : 2.6.9
odfpy : None
openpyxl : 2.4.9
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 4.6.3
pyxlsb : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.18
tables : 3.4.4
tabulate : 0.8.3
xarray : 0.11.3
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : 0.9.3
numba : 0.45.0
The text was updated successfully, but these errors were encountered: