read_json ignores dictionary as dtype #33205

dtrizna · 2020-04-01T13:37:39Z

Code Sample, a copy-pastable example if possible

dtypes = {
    'created': 'int64',
    'eventType' : 'category',
    'severity' : 'category'
    }

df = pd.read_json('dataset.json', lines=True, dtype=dtypes)
df.info()

Results into:

created          int64
eventType        object
severity         object

Using .astype() instead converts types correctly:

df.astype(dtypes).info()
created          int64
eventType        category
severity         category

Problem description

Should take take appropriate data type during DataFrame loading from disc.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 0.25.3
numpy            : 1.17.4
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 41.2.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : 0.4.0
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None

The text was updated successfully, but these errors were encountered:

mekeenan1 · 2020-04-02T20:06:01Z

take

mekeenan1 · 2020-04-02T20:08:44Z

Would like to work on this if that's ok! (New to the open source community)

simonjayhawkins · 2020-04-08T11:20:17Z

Thanks @dtrizna for the report. can you update the OP with a minimal reproducible example.

https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports provides details how to provide the necessary information for us to reproduce your bug.

dtrizna · 2020-04-08T15:21:24Z

Try it out with following json file:

> type test.json
{"created": 1585669938386, "eventType": "TEST", "severity": "INFO"}
{"created": 1585669938387, "eventType": "TEST2", "severity": "INFO"}

In case of using these dtype key word argment during read_json - pandas just ignores this setting (note data types are "object", not "category" as specified in dtypes dictonary.

>>> import pandas as pd
>>> dtypes = {
...     'created': 'int64',
...     'eventType' : 'category',
...     'severity' : 'category'
...     }
>>> a = pd.read_json('test.json', lines=True, dtype=dtypes)
>>> a.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
created      2 non-null int64
eventType    2 non-null object
severity     2 non-null object
dtypes: int64(1), object(2)
memory usage: 176.0+ bytes

If we use same dtypes dictionary on DataFrame's astype method - setting is applied (note correct data types):

>>> a.astype(dtypes).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
created      2 non-null int64
eventType    2 non-null category
severity     2 non-null category
dtypes: category(2), int64(1)
memory usage: 332.0 bytes

This raises problems with large datasets, when reading data with correct types decrease usage of RAM drasticly.

jake9wi · 2021-05-19T03:10:22Z

Has there been any updates to this. I am experiencing this issue with pandas 1.2.4.

github-actions bot assigned mekeenan1 Apr 2, 2020

simonjayhawkins added IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue Dtype Conversions Unexpected or buggy dtype conversions labels Apr 8, 2020

MarcoGorelli removed the Needs Info Clarification about behavior needed to assess issue label May 14, 2020

r-raymond mentioned this issue Jul 30, 2021

BUG: Fix dtypes for read_json #42819

Merged

5 tasks

mroeschke added the Bug label Jul 31, 2021

jreback added this to the 1.4 milestone Aug 5, 2021

mzeitlin11 closed this as completed in #42819 Oct 4, 2021

harupy mentioned this issue Jul 29, 2022

Avoid logging model signatures in pyspark ML autologging if model input/output dataframe contains unsupported data types mlflow/mlflow#6365

Merged

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_json ignores dictionary as dtype #33205

read_json ignores dictionary as dtype #33205

dtrizna commented Apr 1, 2020

mekeenan1 commented Apr 2, 2020

mekeenan1 commented Apr 2, 2020

simonjayhawkins commented Apr 8, 2020

dtrizna commented Apr 8, 2020 •

edited

Loading

jake9wi commented May 19, 2021

read_json ignores dictionary as dtype #33205

read_json ignores dictionary as dtype #33205

Comments

dtrizna commented Apr 1, 2020

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

mekeenan1 commented Apr 2, 2020

mekeenan1 commented Apr 2, 2020

simonjayhawkins commented Apr 8, 2020

dtrizna commented Apr 8, 2020 • edited Loading

jake9wi commented May 19, 2021

Output of `pd.show_versions()`

dtrizna commented Apr 8, 2020 •

edited

Loading