read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608

Udayraj123 · 2018-04-04T14:03:57Z

Code Sample (Original Problem)

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.info())
print(df)

Problem description

I'm using pandas to load json data, but found some strange behaviour in the read_json function.
In the above code, the integers as strings aren't read correctly, though there shouldn't be an overflow case as the values are well within the integer range.

It is reading correctly on explictly specifying the argument dtype=int, but I don't understand why. What changes when we specify the dtype?

Corresponding SO discussion here:

Current Output

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2  10000000000000000
3  10000000000000000
4  10000000000000002

Expected Output

The tid's should have been stored correctly.

None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

A minimal pytest example

import pytest
import numpy as np
import pandas as pd
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp, read_json, compat)
from pandas.util import testing as tm

@pytest.mark.parametrize('dtype', ['int'])
def test_large_ints_from_json_strings(dtype):
    # GH 20608
    df1 = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
    df_temp = df1.copy().astype(str)
    df2 = read_json(df_temp.to_json())
    assert (df1 == df2).all()[0] == True # currently False

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IN
LOCALE: en_IN.ISO8859-1

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-04-04T20:16:09Z

Hmm well the problem is stemming from the code in _try_convert_data:

pandas/pandas/io/json/json.py

Line 659 in 6d610a4

def _try_convert_data(self, name, data, use_dtypes=True,

It looks like that method is converting the parsed object to a float and then to an int, which is causing the loss of precision on your large numbers

In [10]: int(float("9999999999999999"))
Out[10]: 10000000000000000

Want to try a fix in a PR?

Udayraj123 · 2018-04-04T22:29:05Z

Sure! Though I'm new to source code of pandas, but I can try :)

WillAyd · 2018-04-05T00:40:44Z

Feel free to ask any questions here, or if you submit a PR make sure to reference this issue. There's plenty of helpful people here to get you through the process

Udayraj123 · 2018-04-05T00:53:09Z

Actually I have a few questions,

The assert_frame_equal is also seemingly failing to detect difference between 9999999999999999 and 10000000000000000. Currently I've made a test file (temporarily) at pandas/tests/io/json/my_test.py which contains the following code:

import pytest
import numpy as np
import pandas as pd
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp, read_json, compat)
from pandas.util import testing as tm

@pytest.fixture
def input_json():
    json_content="""
    { 
        "0" : {"tid":"9999999999999999"},
        "1" : {"tid":"10000000000000001"}
    }
    """
    return json_content

@pytest.fixture
def expected_df():
    return DataFrame([9999999999999999,10000000000000001],columns=['tid'])


@pytest.mark.parametrize('dtype', ['int64','int'])
def test_large_ints_from_json_strings(dtype):
    # GH 20608
    data = read_json(input_json(),orient='index')
    # data = read_json(input_json(),orient='index',dtype={'tid':dtype})
    new_data = expected_df()
    print('')
    print(new_data)
    print(data)
    tm.assert_frame_equal(new_data,data)

I am testing with the following command.
pytest my_test.py -v -s --full-trace
And getting PASSED output, is this expected? If yes, how can I verify that correct dataframe is generated after a fix?

After finalizing, I'm thinking of placing this into test/io/json/test_pandas.py, the function name would be test_large_ints_from_json_strings. Does this seem fine?

I've gone through the following links till now.
https://pandas.pydata.org/pandas-docs/stable/contributing.html
https://github.com/pandas-dev/pandas/wiki/Testing

Udayraj123 · 2018-04-05T01:11:54Z

Also (in reference to above), the test is changing the 9999999999999999 from int to float at line 693.

pandas/pandas/io/json/json.py

Line 693 in 6d610a4

data = data.astype('float64')

And then converts it back to int at this line

pandas/pandas/io/json/json.py

Line 714 in 6d610a4

new_data = data.astype('int64')

which changes it's value to 10000000000000000. Just below it is a test for data equivalence, but it is also passing without an issue..

If we are passing dtype={'tid':int}, it will directly return as int without converting to float. But I'm assuming we can't do the same in the above case(as it may affect pandas default type conversions)?

WillAyd · 2018-04-05T01:21:16Z

Just in general for your test you don't need a fixture unless you plan on reusing that elsewhere. For now just put all of your items in the method definition. And for what it's worth shouldn't your test be failing? Isn't that the bug you are trying to fix?
Yes I think that location is fine
That equivalence test is performed against a data variable that was already cast to float, so isn't the precision issue already?

Udayraj123 · 2018-04-05T01:37:17Z

Okay, was just trying that out 😄 and Yes it should be failing, the two dataframes are different (one with 999.. and other with 100..), but the assert_frame_equal function is not detecting that. Can you please show me a sample way of testing those two dataframes being different?
Cool!
Yes, precision issue is there, Maybe we can cast it back to int and then compare, is that right?

Adding to point 1: Comparing with the (new_data == data).all() method is giving correct result (False), while assert_frame_equal is still passing the test. I don't understand why?

def test_large_ints_from_json_strings(dtype):
    json_content="""
    { 
        "0" : {"tid":"9999999999999999"},
        "1" : {"tid":"10000000000000001"}
    }
    """
    data = read_json(json_content,orient='index')
    new_data = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
    # new_data = new_data.astype('int64') 
    print((new_data == data).all()) #False
    tm.assert_frame_equal(new_data,data) #PASSED?

WillAyd · 2018-04-05T01:55:41Z

Sorry I misread your original point but the below definitely fails for me:

In [3]: json_content="""{ 
   ...:     "0" : {"tid":"9999999999999999"},
   ...:     "1" : {"tid":"10000000000000001"}
   ...: }"""
In [4]: exp = pd.DataFrame([9999999999999999,10000000000000001],columns=['tid'])
In [6]: tm.assert_frame_equal(pd.read_json(json_content), exp)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

Perhaps try using the complete example from your original post.

As to your solution, once the precision is lost casting back to int will not work. You'll have to think through another way - maybe we should be attempting to int first instead of float? Not saying that's the answer but think through that.

Udayraj123 · 2018-04-05T07:50:32Z

In your example the orient='index' is missing which transposes the dataframe (making tid a column). So the assert is failing because the transpose isn't taken.

Instead of json_string, I've thought of this more intuitive way using to_json() (which is still passing when it shouldn't)-

@pytest.mark.parametrize('dtype', ['int'])
def test_large_ints_from_json_strings(dtype):
    # GH 20608
    df1 = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
    df_temp = df1.copy().astype(str)
    df2 = read_json(df_temp.to_json())
    print((df1 == df2).all()) #False
    tm.assert_frame_equal(df1,df2) #PASSED?

Hmm I did think about the order of casting types, seems there has to be an additional case. I'll send a patch when I find a solution. But I would appreciate if someone else was working on this too

Udayraj123 · 2018-04-05T07:58:11Z

I've updated a working test in the main issue at the top

WillAyd · 2018-04-05T12:22:55Z

I haven't looked into your latest issue but you are making this more complicated than it needs to be. Just have one frame constructed via read_json and the other constructed manually. Don't do any type conversions, printing, or copying. You also don't need to parametrize it

gfyoung · 2018-04-10T04:26:13Z

@Udayraj123 : Do you have a patch for your original bug? I would submit that as a PR. We can move the discussion about the actual changes / test there.

Udayraj123 · 2018-04-14T10:12:59Z

@gfyoung : No I don't have a patch yet.

eavidan · 2020-05-27T14:55:02Z

I have just encountered this issue as well. Indeed it seems that the following line is always reached

pandas/pandas/io/json/json.py

Line 644 in 6d610a4

self._try_convert_types()

which leads to _try_convert_data() as @WillAyd mentioned.

looking at _try_convert_data(), a work around would be to supply read_json with an empty dtype dictionary. this way the data conversion will not happen.

pandas/pandas/io/json/json.py

Line 665 in 6d610a4

if self.dtype is False:

the following now works as expected

import pandas as pd

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
                dtype={}
        )
print(df.info())
print(df)

output:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tid     4 non-null      object
dtypes: object(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

@WillAyd what would be the correct way to handle this? is convert_axes=False the correct option to control this conversion?

What is the role of self._try_convert_types()? as it seems that _try_convert_data is already executed in _convert_axes.

WillAyd · 2020-05-27T15:04:06Z

Thanks for taking a look. convert_axes should be unrelated. The problem should be in _try_convert_data which instantly converts to float and from there to integer. I'm not sure why this does it, so if you want to try your hand at a PR you might have some luck refactoring that method

eavidan · 2020-05-31T16:54:50Z

This is a strange behavior. I believe there should be a way to disable this conversion, and it should be documented.
Now, it is not obvious you can disable it simply by passing dtype={} or anything else that is equivalent to false.
I can try to fix _try_convert_data, but there are many cases that will be hard to catch. for instance, currently, the string "00010" results in the int 10 which is not a desired behavior.
I assume that simply adding this fact to the docs may be sufficient for now.
Does this makes sense? any other suggestions?

ghost · 2022-07-15T19:41:51Z

up

Fixes pandas-dev#20608

gfyoung added Numeric Operations Arithmetic, Comparison, and Logical operations IO JSON read_json, to_json, json_normalize labels Apr 10, 2018

WillAyd mentioned this issue Dec 12, 2018

Erroneous parsing of big integers with read_json #24249

Closed

jschendel mentioned this issue Apr 24, 2020

BUG: read_json returns wrong value for big integer #33766

Closed

2 tasks

mroeschke added the Bug label May 8, 2020

asishm mentioned this issue Apr 19, 2021

BUG: Pandas changing values when inferring dtypes #40992

Closed

3 tasks

lithomas1 added this to the Contributions Welcome milestone Apr 20, 2021

lithomas1 self-assigned this Apr 20, 2021

lithomas1 mentioned this issue Apr 22, 2021

BUG: read_json not reading in large ints properly #41107

Closed

4 tasks

lithomas1 removed their assignment Apr 27, 2021

mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 19, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

fawazahmed0 added a commit to fawazahmed0/pandas that referenced this issue Jul 20, 2024

Fix pandas-dev#20608

9805875

fawazahmed0 mentioned this issue Jul 20, 2024

BUG: Fix precision loss in read_json #59284

Merged

5 tasks

fawazahmed0 added a commit to fawazahmed0/pandas that referenced this issue Sep 21, 2024

BUG: Fix precision loss in read_json

091584a

Fixes pandas-dev#20608

rhshadrach added this to the 3.0 milestone Sep 23, 2024

rhshadrach closed this as completed in #59284 Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608

read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608

Udayraj123 commented Apr 4, 2018 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Apr 4, 2018

Udayraj123 commented Apr 4, 2018

WillAyd commented Apr 5, 2018

Udayraj123 commented Apr 5, 2018 •

edited

Loading

Udayraj123 commented Apr 5, 2018

WillAyd commented Apr 5, 2018

Udayraj123 commented Apr 5, 2018 •

edited

Loading

WillAyd commented Apr 5, 2018

Udayraj123 commented Apr 5, 2018 •

edited

Loading

Udayraj123 commented Apr 5, 2018

WillAyd commented Apr 5, 2018

gfyoung commented Apr 10, 2018

Udayraj123 commented Apr 14, 2018

eavidan commented May 27, 2020

WillAyd commented May 27, 2020

eavidan commented May 31, 2020

ghost commented Jul 15, 2022

read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608

read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608

Comments

Udayraj123 commented Apr 4, 2018 • edited Loading

Code Sample (Original Problem)

Problem description

Current Output

Expected Output

A minimal pytest example

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Apr 4, 2018

Udayraj123 commented Apr 4, 2018

WillAyd commented Apr 5, 2018

Udayraj123 commented Apr 5, 2018 • edited Loading

Udayraj123 commented Apr 5, 2018

WillAyd commented Apr 5, 2018

Udayraj123 commented Apr 5, 2018 • edited Loading

WillAyd commented Apr 5, 2018

Udayraj123 commented Apr 5, 2018 • edited Loading

Udayraj123 commented Apr 5, 2018

WillAyd commented Apr 5, 2018

gfyoung commented Apr 10, 2018

Udayraj123 commented Apr 14, 2018

eavidan commented May 27, 2020

WillAyd commented May 27, 2020

eavidan commented May 31, 2020

ghost commented Jul 15, 2022

Udayraj123 commented Apr 4, 2018 •

edited

Loading

Output of `pd.show_versions()`

Udayraj123 commented Apr 5, 2018 •

edited

Loading

Udayraj123 commented Apr 5, 2018 •

edited

Loading

Udayraj123 commented Apr 5, 2018 •

edited

Loading