Skip to content

BUG: Pandas changing values when inferring dtypes #40992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
giovannirescia opened this issue Apr 17, 2021 · 6 comments
Closed
2 of 3 tasks

BUG: Pandas changing values when inferring dtypes #40992

giovannirescia opened this issue Apr 17, 2021 · 6 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Enhancement IO JSON read_json, to_json, json_normalize

Comments

@giovannirescia
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.read_json('file.jsonl', orient='records', lines=True)

df['uuid']

Problem description

I have a jsonlines file (file.jsonl) that looks like this

{"id": 1, "uuid": "1344800117571260417"}
{"id": 2, "uuid": "1344900117571260918"}

After reading that file using the above code, this is what I get:

    id                 uuid
0   1  1344800117571260416
1   2  1344900117571260928

So the uuids have changed

Expected Output

My output should look like

    id                 uuid
0   1  1344800117571260417
1   2  1344900117571260918

On Stackoverflow, someone answered my post, with the following explanation:

int(float("1344900117571260918")) is 1344900117571260928 . I assume pandas first uses floats and afterwards converts it to int - so precision is lost.

So the problem is how pandas is inferring the types.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 2cb9652 python : 3.7.6.final.0 python-bits : 64 OS : Linux OS-release : 5.8.0-49-generic Version : #55~20.04.1-Ubuntu SMP Fri Mar 26 01:01:07 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.2.4
numpy : 1.18.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.4.3
hypothesis : 5.5.4
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

@giovannirescia giovannirescia added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2021
@phofl
Copy link
Member

phofl commented Apr 17, 2021

This gives you your expected result

data = """
{"id": 1, "uuid": "1344800117571260417"}
{"id": 2, "uuid": "1344900117571260918"}
"""

df = pd.read_json(StringIO(data), orient='records', lines=True, dtype="int64")

@phofl
Copy link
Member

phofl commented Apr 17, 2021

The inference is done with astype and float64, which causes the precision loss, maybe could document this clearer

@phofl phofl added Docs IO JSON read_json, to_json, json_normalize and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2021
@giovannirescia
Copy link
Author

This gives you your expected result

data = """
{"id": 1, "uuid": "1344800117571260417"}
{"id": 2, "uuid": "1344900117571260918"}
"""

df = pd.read_json(StringIO(data), orient='records', lines=True, dtype="int64")

yes, even using dtype='int' or dtype=False solves the problem, but I was looking for an explanation on why this was happening, which you just validated.

@asishm
Copy link
Contributor

asishm commented Apr 17, 2021

imo this should be a bug, having a default that leads to a loss of precision is not desirable. This is also a bit counter intuitive / inconsistent because this doesn't occur with CSVs.

@phofl
Copy link
Member

phofl commented Apr 17, 2021

I think this would be more of a enhancement of the inference logic in read_json

@lithomas1 lithomas1 added Enhancement Dtype Conversions Unexpected or buggy dtype conversions and removed Docs labels Apr 18, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Apr 19, 2021
@asishm
Copy link
Contributor

asishm commented Apr 19, 2021

Can close this as a duplicate of #20608 where it's classified as a bug.

Reading through that, it seems to me that a potential solution could be

to converting to int first (instead of float by default) - in which case OverFlowErrors also need to be handled. This approach probably require changes to tests as some of these are checking for a float dtype (for empty dfs)

or documenting this limitation.

@jreback jreback added the Duplicate Report Duplicate issue or pull request label Apr 19, 2021
@jreback jreback closed this as completed Apr 19, 2021
@lithomas1 lithomas1 removed this from the Contributions Welcome milestone Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

5 participants