BUG: Pandas changing values when inferring dtypes #40992

giovannirescia · 2021-04-17T00:14:18Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.read_json('file.jsonl', orient='records', lines=True)

df['uuid']

Problem description

I have a jsonlines file (file.jsonl) that looks like this

{"id": 1, "uuid": "1344800117571260417"}
{"id": 2, "uuid": "1344900117571260918"}

After reading that file using the above code, this is what I get:

    id                 uuid
0   1  1344800117571260416
1   2  1344900117571260928

So the uuids have changed

Expected Output

My output should look like

    id                 uuid
0   1  1344800117571260417
1   2  1344900117571260918

On Stackoverflow, someone answered my post, with the following explanation:

int(float("1344900117571260918")) is 1344900117571260928 . I assume pandas first uses floats and afterwards converts it to int - so precision is lost.

So the problem is how pandas is inferring the types.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : 2cb9652 python : 3.7.6.final.0 python-bits : 64 OS : Linux OS-release : 5.8.0-49-generic Version : #55~20.04.1-Ubuntu SMP Fri Mar 26 01:01:07 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.2.4
numpy : 1.18.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.4.3
hypothesis : 5.5.4
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

phofl · 2021-04-17T00:31:20Z

This gives you your expected result

data = """
{"id": 1, "uuid": "1344800117571260417"}
{"id": 2, "uuid": "1344900117571260918"}
"""

df = pd.read_json(StringIO(data), orient='records', lines=True, dtype="int64")

phofl · 2021-04-17T00:38:38Z

The inference is done with astype and float64, which causes the precision loss, maybe could document this clearer

giovannirescia · 2021-04-17T00:42:21Z

This gives you your expected result

data = """
{"id": 1, "uuid": "1344800117571260417"}
{"id": 2, "uuid": "1344900117571260918"}
"""

df = pd.read_json(StringIO(data), orient='records', lines=True, dtype="int64")

yes, even using dtype='int' or dtype=False solves the problem, but I was looking for an explanation on why this was happening, which you just validated.

asishm · 2021-04-17T04:39:06Z

imo this should be a bug, having a default that leads to a loss of precision is not desirable. This is also a bit counter intuitive / inconsistent because this doesn't occur with CSVs.

phofl · 2021-04-17T09:42:10Z

I think this would be more of a enhancement of the inference logic in read_json

asishm · 2021-04-19T10:13:50Z

Can close this as a duplicate of #20608 where it's classified as a bug.

Reading through that, it seems to me that a potential solution could be

to converting to int first (instead of float by default) - in which case OverFlowErrors also need to be handled. This approach probably require changes to tests as some of these are checking for a float dtype (for empty dfs)

or documenting this limitation.

giovannirescia added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2021

phofl added Docs IO JSON read_json, to_json, json_normalize and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2021

lithomas1 added Enhancement Dtype Conversions Unexpected or buggy dtype conversions and removed Docs labels Apr 18, 2021

lithomas1 added this to the Contributions Welcome milestone Apr 19, 2021

jreback added the Duplicate Report Duplicate issue or pull request label Apr 19, 2021

jreback closed this as completed Apr 19, 2021

lithomas1 removed this from the Contributions Welcome milestone Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Pandas changing values when inferring dtypes #40992

BUG: Pandas changing values when inferring dtypes #40992

giovannirescia commented Apr 17, 2021

phofl commented Apr 17, 2021

phofl commented Apr 17, 2021

giovannirescia commented Apr 17, 2021

asishm commented Apr 17, 2021

phofl commented Apr 17, 2021

asishm commented Apr 19, 2021 •

edited

Loading

BUG: Pandas changing values when inferring dtypes #40992

BUG: Pandas changing values when inferring dtypes #40992

Comments

giovannirescia commented Apr 17, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

phofl commented Apr 17, 2021

phofl commented Apr 17, 2021

giovannirescia commented Apr 17, 2021

asishm commented Apr 17, 2021

phofl commented Apr 17, 2021

asishm commented Apr 19, 2021 • edited Loading

Output of `pd.show_versions()`

asishm commented Apr 19, 2021 •

edited

Loading