Skip to content

pd.read_json ignores 'category' dtypes #21892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alessandrobenedetti opened this issue Jul 13, 2018 · 2 comments · Fixed by #42819
Closed

pd.read_json ignores 'category' dtypes #21892

alessandrobenedetti opened this issue Jul 13, 2018 · 2 comments · Fixed by #42819
Labels
Categorical Categorical Data Type Enhancement IO JSON read_json, to_json, json_normalize
Milestone

Comments

@alessandrobenedetti
Copy link

alessandrobenedetti commented Jul 13, 2018

Code Sample, a copy-pastable example if possible

dtypes_dict = {'interactionRelevance': 'uint8', 'productBrand': 'category', ...}
data_frame = pd.read_json(input_file, lines=True, dtype=dtypes_dict)

Problem description

Category type is just ignored, the columns specified to be 'category' from the Json just becomes 'object' with the related memory consumption.

Debugging I noticed that in /usr/local/lib/python3.6/site-packages/pandas/io/json/json.py:677

                if dtype is not None:
                    try:
                        dtype = np.dtype(dtype)
                        return data.astype(dtype), True
                    except (TypeError, ValueError):
                        return data, False

np.dtype('category') doesn't seem correct.
Is this just a bug or I misunderstood something ?

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.23.3
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.5
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

I suspect that was written before categorical were implemented.

Someone could fix this if they want, but we generally recommend the table orient if you need to roundtrip DataFrames

In [11]: df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'c'])})

In [12]: pd.read_json(df.to_json(orient='table'), orient='table').dtypes
Out[12]:
A    category
dtype: object

or converting your strings to categoricals later if you don't control the source.

@TomAugspurger TomAugspurger added IO JSON read_json, to_json, json_normalize Categorical Categorical Data Type labels Jul 13, 2018
@alessandrobenedetti
Copy link
Author

Thanks Tom!
I appreciate the work-around but I still believe it is a bug ( low priority of course).

In my case the Json source is external, imagine it as a log file.
Converting the columns immediately after the parsing would be a fine option.
I wanted to reduce the memory output as much as possible before starting the processing steps.

This should do the trick for the single valued categorical :
interactions[col] = interactions[col].astype('category')

What about multi valued categoricals ? Where the value of a cell is a list of categories ?
such as : ['item1','item2'].
Considering the fact that I later one hot encode them in separate columns, would be memory efficient to make them list of categories as soon as the Json parsing finishes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants