Skip to content

read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Udayraj123 opened this issue Apr 4, 2018 · 18 comments · Fixed by #59284
Closed
Labels
Bug IO JSON read_json, to_json, json_normalize
Milestone

Comments

@Udayraj123
Copy link

Udayraj123 commented Apr 4, 2018

Code Sample (Original Problem)

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.info())
print(df)

Problem description

I'm using pandas to load json data, but found some strange behaviour in the read_json function.
In the above code, the integers as strings aren't read correctly, though there shouldn't be an overflow case as the values are well within the integer range.

It is reading correctly on explictly specifying the argument dtype=int, but I don't understand why. What changes when we specify the dtype?

Corresponding SO discussion here:

Current Output

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2  10000000000000000
3  10000000000000000
4  10000000000000002

Expected Output

The tid's should have been stored correctly.

None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

A minimal pytest example

import pytest
import numpy as np
import pandas as pd
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp, read_json, compat)
from pandas.util import testing as tm

@pytest.mark.parametrize('dtype', ['int'])
def test_large_ints_from_json_strings(dtype):
    # GH 20608
    df1 = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
    df_temp = df1.copy().astype(str)
    df2 = read_json(df_temp.to_json())
    assert (df1 == df2).all()[0] == True # currently False

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IN
LOCALE: en_IN.ISO8859-1

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Apr 4, 2018

Hmm well the problem is stemming from the code in _try_convert_data:

def _try_convert_data(self, name, data, use_dtypes=True,

It looks like that method is converting the parsed object to a float and then to an int, which is causing the loss of precision on your large numbers

In [10]: int(float("9999999999999999"))
Out[10]: 10000000000000000

Want to try a fix in a PR?

@Udayraj123
Copy link
Author

Sure! Though I'm new to source code of pandas, but I can try :)

@WillAyd
Copy link
Member

WillAyd commented Apr 5, 2018

Feel free to ask any questions here, or if you submit a PR make sure to reference this issue. There's plenty of helpful people here to get you through the process

@Udayraj123
Copy link
Author

Udayraj123 commented Apr 5, 2018

Actually I have a few questions,

  1. The assert_frame_equal is also seemingly failing to detect difference between 9999999999999999 and 10000000000000000. Currently I've made a test file (temporarily) at pandas/tests/io/json/my_test.py which contains the following code:
import pytest
import numpy as np
import pandas as pd
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp, read_json, compat)
from pandas.util import testing as tm

@pytest.fixture
def input_json():
    json_content="""
    { 
        "0" : {"tid":"9999999999999999"},
        "1" : {"tid":"10000000000000001"}
    }
    """
    return json_content

@pytest.fixture
def expected_df():
    return DataFrame([9999999999999999,10000000000000001],columns=['tid'])


@pytest.mark.parametrize('dtype', ['int64','int'])
def test_large_ints_from_json_strings(dtype):
    # GH 20608
    data = read_json(input_json(),orient='index')
    # data = read_json(input_json(),orient='index',dtype={'tid':dtype})
    new_data = expected_df()
    print('')
    print(new_data)
    print(data)
    tm.assert_frame_equal(new_data,data)

I am testing with the following command.
pytest my_test.py -v -s --full-trace
And getting PASSED output, is this expected? If yes, how can I verify that correct dataframe is generated after a fix?

  1. After finalizing, I'm thinking of placing this into test/io/json/test_pandas.py, the function name would be test_large_ints_from_json_strings. Does this seem fine?

I've gone through the following links till now.
https://pandas.pydata.org/pandas-docs/stable/contributing.html
https://github.com/pandas-dev/pandas/wiki/Testing

@Udayraj123
Copy link
Author

Also (in reference to above), the test is changing the 9999999999999999 from int to float at line 693.

data = data.astype('float64')

And then converts it back to int at this line
new_data = data.astype('int64')

which changes it's value to 10000000000000000. Just below it is a test for data equivalence, but it is also passing without an issue..

If we are passing dtype={'tid':int}, it will directly return as int without converting to float. But I'm assuming we can't do the same in the above case(as it may affect pandas default type conversions)?

@WillAyd
Copy link
Member

WillAyd commented Apr 5, 2018

  1. Just in general for your test you don't need a fixture unless you plan on reusing that elsewhere. For now just put all of your items in the method definition. And for what it's worth shouldn't your test be failing? Isn't that the bug you are trying to fix?

  2. Yes I think that location is fine

  3. That equivalence test is performed against a data variable that was already cast to float, so isn't the precision issue already?

@Udayraj123
Copy link
Author

Udayraj123 commented Apr 5, 2018

  1. Okay, was just trying that out 😄 and Yes it should be failing, the two dataframes are different (one with 999.. and other with 100..), but the assert_frame_equal function is not detecting that. Can you please show me a sample way of testing those two dataframes being different?

  2. Cool!

  3. Yes, precision issue is there, Maybe we can cast it back to int and then compare, is that right?

Adding to point 1: Comparing with the (new_data == data).all() method is giving correct result (False), while assert_frame_equal is still passing the test. I don't understand why?

def test_large_ints_from_json_strings(dtype):
    json_content="""
    { 
        "0" : {"tid":"9999999999999999"},
        "1" : {"tid":"10000000000000001"}
    }
    """
    data = read_json(json_content,orient='index')
    new_data = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
    # new_data = new_data.astype('int64') 
    print((new_data == data).all()) #False
    tm.assert_frame_equal(new_data,data) #PASSED?

@WillAyd
Copy link
Member

WillAyd commented Apr 5, 2018

Sorry I misread your original point but the below definitely fails for me:

In [3]: json_content="""{ 
   ...:     "0" : {"tid":"9999999999999999"},
   ...:     "1" : {"tid":"10000000000000001"}
   ...: }"""
In [4]: exp = pd.DataFrame([9999999999999999,10000000000000001],columns=['tid'])
In [6]: tm.assert_frame_equal(pd.read_json(json_content), exp)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

Perhaps try using the complete example from your original post.

As to your solution, once the precision is lost casting back to int will not work. You'll have to think through another way - maybe we should be attempting to int first instead of float? Not saying that's the answer but think through that.

@Udayraj123
Copy link
Author

Udayraj123 commented Apr 5, 2018

In your example the orient='index' is missing which transposes the dataframe (making tid a column). So the assert is failing because the transpose isn't taken.

Instead of json_string, I've thought of this more intuitive way using to_json() (which is still passing when it shouldn't)-

@pytest.mark.parametrize('dtype', ['int'])
def test_large_ints_from_json_strings(dtype):
    # GH 20608
    df1 = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
    df_temp = df1.copy().astype(str)
    df2 = read_json(df_temp.to_json())
    print((df1 == df2).all()) #False
    tm.assert_frame_equal(df1,df2) #PASSED?

Hmm I did think about the order of casting types, seems there has to be an additional case. I'll send a patch when I find a solution. But I would appreciate if someone else was working on this too

@Udayraj123
Copy link
Author

I've updated a working test in the main issue at the top

@WillAyd
Copy link
Member

WillAyd commented Apr 5, 2018

I haven't looked into your latest issue but you are making this more complicated than it needs to be. Just have one frame constructed via read_json and the other constructed manually. Don't do any type conversions, printing, or copying. You also don't need to parametrize it

@gfyoung gfyoung added Numeric Operations Arithmetic, Comparison, and Logical operations IO JSON read_json, to_json, json_normalize labels Apr 10, 2018
@gfyoung
Copy link
Member

gfyoung commented Apr 10, 2018

@Udayraj123 : Do you have a patch for your original bug? I would submit that as a PR. We can move the discussion about the actual changes / test there.

@Udayraj123
Copy link
Author

@gfyoung : No I don't have a patch yet.

@eavidan
Copy link

eavidan commented May 27, 2020

I have just encountered this issue as well. Indeed it seems that the following line is always reached

self._try_convert_types()

which leads to _try_convert_data() as @WillAyd mentioned.

looking at _try_convert_data(), a work around would be to supply read_json with an empty dtype dictionary. this way the data conversion will not happen.

if self.dtype is False:

the following now works as expected

import pandas as pd

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
                dtype={}
        )
print(df.info())
print(df)

output:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tid     4 non-null      object
dtypes: object(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

@WillAyd what would be the correct way to handle this? is convert_axes=False the correct option to control this conversion?

What is the role of self._try_convert_types()? as it seems that _try_convert_data is already executed in _convert_axes.

@WillAyd
Copy link
Member

WillAyd commented May 27, 2020

Thanks for taking a look. convert_axes should be unrelated. The problem should be in _try_convert_data which instantly converts to float and from there to integer. I'm not sure why this does it, so if you want to try your hand at a PR you might have some luck refactoring that method

@eavidan
Copy link

eavidan commented May 31, 2020

This is a strange behavior. I believe there should be a way to disable this conversion, and it should be documented.
Now, it is not obvious you can disable it simply by passing dtype={} or anything else that is equivalent to false.
I can try to fix _try_convert_data, but there are many cases that will be hard to catch. for instance, currently, the string "00010" results in the int 10 which is not a desired behavior.
I assume that simply adding this fact to the docs may be sufficient for now.
Does this makes sense? any other suggestions?

@lithomas1 lithomas1 added this to the Contributions Welcome milestone Apr 20, 2021
@lithomas1 lithomas1 self-assigned this Apr 20, 2021
@lithomas1 lithomas1 removed their assignment Apr 27, 2021
@mroeschke mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 19, 2021
@ghost
Copy link

ghost commented Jul 15, 2022

up

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
fawazahmed0 added a commit to fawazahmed0/pandas that referenced this issue Jul 20, 2024
fawazahmed0 added a commit to fawazahmed0/pandas that referenced this issue Sep 21, 2024
@rhshadrach rhshadrach added this to the 3.0 milestone Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
8 participants
@WillAyd @gfyoung @mroeschke @eavidan @Udayraj123 @rhshadrach @lithomas1 and others