-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm well the problem is stemming from the code in Line 659 in 6d610a4
It looks like that method is converting the parsed object to a float and then to an int, which is causing the loss of precision on your large numbers In [10]: int(float("9999999999999999"))
Out[10]: 10000000000000000 Want to try a fix in a PR? |
Sure! Though I'm new to source code of pandas, but I can try :) |
Feel free to ask any questions here, or if you submit a PR make sure to reference this issue. There's plenty of helpful people here to get you through the process |
Actually I have a few questions,
I am testing with the following command.
I've gone through the following links till now. |
Also (in reference to above), the test is changing the 9999999999999999 from int to float at line 693. Line 693 in 6d610a4
And then converts it back to int at this line Line 714 in 6d610a4
which changes it's value to 10000000000000000. Just below it is a test for data equivalence, but it is also passing without an issue.. If we are passing |
|
Adding to point 1: Comparing with the
|
Sorry I misread your original point but the below definitely fails for me: In [3]: json_content="""{
...: "0" : {"tid":"9999999999999999"},
...: "1" : {"tid":"10000000000000001"}
...: }"""
In [4]: exp = pd.DataFrame([9999999999999999,10000000000000001],columns=['tid'])
In [6]: tm.assert_frame_equal(pd.read_json(json_content), exp)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last) Perhaps try using the complete example from your original post. As to your solution, once the precision is lost casting back to int will not work. You'll have to think through another way - maybe we should be attempting to int first instead of float? Not saying that's the answer but think through that. |
In your example the Instead of json_string, I've thought of this more intuitive way using
Hmm I did think about the order of casting types, seems there has to be an additional case. I'll send a patch when I find a solution. But I would appreciate if someone else was working on this too |
I've updated a working test in the main issue at the top |
I haven't looked into your latest issue but you are making this more complicated than it needs to be. Just have one frame constructed via read_json and the other constructed manually. Don't do any type conversions, printing, or copying. You also don't need to parametrize it |
@Udayraj123 : Do you have a patch for your original bug? I would submit that as a PR. We can move the discussion about the actual changes / test there. |
@gfyoung : No I don't have a patch yet. |
I have just encountered this issue as well. Indeed it seems that the following line is always reached Line 644 in 6d610a4
which leads to _try_convert_data() as @WillAyd mentioned.
looking at Line 665 in 6d610a4
the following now works as expected import pandas as pd
json_content="""
{
"1": {
"tid": "9999999999999998",
},
"2": {
"tid": "9999999999999999",
},
"3": {
"tid": "10000000000000001",
},
"4": {
"tid": "10000000000000002",
}
}
"""
df=pd.read_json(json_content,
orient='index', # read as transposed
convert_axes=False, # don't convert keys to dates
dtype={}
)
print(df.info())
print(df) output: <class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tid 4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
None
tid
1 9999999999999998
2 9999999999999999
3 10000000000000001
4 10000000000000002 @WillAyd what would be the correct way to handle this? is What is the role of |
Thanks for taking a look. |
This is a strange behavior. I believe there should be a way to disable this conversion, and it should be documented. |
up |
Code Sample (Original Problem)
Problem description
I'm using pandas to load json data, but found some strange behaviour in the
read_json
function.In the above code, the integers as strings aren't read correctly, though there shouldn't be an overflow case as the values are well within the integer range.
It is reading correctly on explictly specifying the argument
dtype=int
, but I don't understand why. What changes when we specify the dtype?Corresponding SO discussion here:
Current Output
Expected Output
The tid's should have been stored correctly.
A minimal pytest example
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IN
LOCALE: en_IN.ISO8859-1
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: