-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_csv(dtype='category') raises with many categories #18186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Something like diff --git a/pandas/_libs/parsers.pyx b/pandas/_libs/parsers.pyx
index 85857c158..e3a2a2186 100644
--- a/pandas/_libs/parsers.pyx
+++ b/pandas/_libs/parsers.pyx
@@ -2228,8 +2228,9 @@ def _concatenate_chunks(list chunks):
arrs = [chunk.pop(name) for chunk in chunks]
# Check each arr for consistent types.
dtypes = set([a.dtype for a in arrs])
- if len(dtypes) > 1:
- common_type = np.find_common_type(dtypes, [])
+ numpy_dtypes = {x for x in dtypes if not is_categorical_dtype(x)}
+ if len(numpy_dtypes) > 1:
+ common_type = np.find_common_type(numpy_dtypes, [])
if common_type == np.object:
warning_columns.append(str(name))
is what we want, though that's specific to Categoricals. We would want to avoid sending any of our extension dtypes to |
use |
@adbull do you have time to submit a PR with a fix like that, along with some tests? |
Is there a work around which would help us run existing code on 0.21.0? |
@tomanizer not sure off the top of my head. Reading them as strings and then converting to categorical I suppose, but that may not be an option depending on memory usage. You could presumably use We're doing a bugfix release Wednesday or Thursday. If you have a chance to submit a PR before then we can get it merged. |
Code Sample, a copy-pastable example if possible
results in
Problem description
read_csv
now raises when reading a column with many unique values as a category. This appears to be a regression in 0.21.0, due to the introduction ofCategoricalDtype
.Expected Output
No exception.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.11-300.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None
pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: 0.1.3
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: