Skip to content

read_json ignores dictionary as dtype #33205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dtrizna opened this issue Apr 1, 2020 · 5 comments · Fixed by #42819
Closed

read_json ignores dictionary as dtype #33205

dtrizna opened this issue Apr 1, 2020 · 5 comments · Fixed by #42819
Assignees
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO JSON read_json, to_json, json_normalize
Milestone

Comments

@dtrizna
Copy link

dtrizna commented Apr 1, 2020

Code Sample, a copy-pastable example if possible

dtypes = {
    'created': 'int64',
    'eventType' : 'category',
    'severity' : 'category'
    }

df = pd.read_json('dataset.json', lines=True, dtype=dtypes)
df.info()

Results into:

created          int64
eventType        object
severity         object

Using .astype() instead converts types correctly:

df.astype(dtypes).info()
created          int64
eventType        category
severity         category

Problem description

Should take take appropriate data type during DataFrame loading from disc.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 0.25.3
numpy            : 1.17.4
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 41.2.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : 0.4.0
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None
@mekeenan1
Copy link

take

@mekeenan1
Copy link

Would like to work on this if that's ok! (New to the open source community)

@simonjayhawkins
Copy link
Member

Thanks @dtrizna for the report. can you update the OP with a minimal reproducible example.

https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports provides details how to provide the necessary information for us to reproduce your bug.

@simonjayhawkins simonjayhawkins added IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue Dtype Conversions Unexpected or buggy dtype conversions labels Apr 8, 2020
@dtrizna
Copy link
Author

dtrizna commented Apr 8, 2020

Try it out with following json file:

> type test.json
{"created": 1585669938386, "eventType": "TEST", "severity": "INFO"}
{"created": 1585669938387, "eventType": "TEST2", "severity": "INFO"}

In case of using these dtype key word argment during read_json - pandas just ignores this setting (note data types are "object", not "category" as specified in dtypes dictonary.

>>> import pandas as pd
>>> dtypes = {
...     'created': 'int64',
...     'eventType' : 'category',
...     'severity' : 'category'
...     }
>>> a = pd.read_json('test.json', lines=True, dtype=dtypes)
>>> a.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
created      2 non-null int64
eventType    2 non-null object
severity     2 non-null object
dtypes: int64(1), object(2)
memory usage: 176.0+ bytes

If we use same dtypes dictionary on DataFrame's astype method - setting is applied (note correct data types):

>>> a.astype(dtypes).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
created      2 non-null int64
eventType    2 non-null category
severity     2 non-null category
dtypes: category(2), int64(1)
memory usage: 332.0 bytes

This raises problems with large datasets, when reading data with correct types decrease usage of RAM drasticly.

@MarcoGorelli MarcoGorelli removed the Needs Info Clarification about behavior needed to assess issue label May 14, 2020
@jake9wi
Copy link

jake9wi commented May 19, 2021

Has there been any updates to this. I am experiencing this issue with pandas 1.2.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants