Int64Dtype in read_csv leads to unexpected values #26259

alohr · 2019-05-01T16:54:24Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import io

t = io.StringIO('''\
event,timestamp
a,1556559573141592653
b,1556559573141592654
c,
d,1556559573141592655
''')

# Reading the timestamps as strings works fine
print("\nExpected output:")
print(pd.read_csv(t, dtype={'timestamp': object}))

# Now with Int64Dtype
t.seek(0)
print("\nActual output:")
print(pd.read_csv(t, dtype={'timestamp': pd.Int64Dtype()}))

Problem description

I would like to read csv files with nullable (big) integers into a dataframe. The integers represent nanoseconds since the UNIX epoch 1970. Using the Int64Dtype introduced in 0.24.0 seems like the way to go. I quote from the FAQ:

https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions

If you need to represent integers with possibly missing values, use one of the nullable- integer extension dtypes provided by pandas

Expected Output

  event            timestamp
0     a  1556559573141592653
1     b  1556559573141592654
2     c                  NaN
3     d  1556559573141592655

Actual Output

  event            timestamp
0     a  1556559573141592576
1     b  1556559573141592576
2     c                  NaN
3     d  1556559573141592576

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-05-01T17:51:20Z

Yea I think there may be some undesired casting to float64 going on behind the scenes. Looks OK up to 2^52 but beyond that things get dicey:

In [15]: 2**53
Out[15]: 9007199254740992

t = io.StringIO('''\
event,timestamp
a,9007199254740992
b,9007199254740993
c,
d,9007199254740994
''')

In [20]: print(pd.read_csv(t, dtype={'timestamp': pd.Int64Dtype()}))
  event         timestamp
0     a  9007199254740992
1     b  9007199254740992
2     c               NaN
3     d  9007199254740994

Investigation and PRs are certainly welcome

phofl · 2022-02-16T21:15:11Z

Duplicate of #32134

WillAyd added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Bug labels May 1, 2019

WillAyd added this to the Contributions Welcome milestone May 1, 2019

gfyoung added the IO CSV read_csv, to_csv label May 2, 2019

WillAyd mentioned this issue May 3, 2019

WIP: Maintain Int64 Precision on Construction #26272

Closed

4 tasks

simonjayhawkins mentioned this issue Dec 19, 2019

solve "Int64 with null value mangles large-ish integers" problem #30282

Closed

5 tasks

jbrockmendel added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 8, 2022

phofl marked this as a duplicate of #32134 Feb 16, 2022

phofl closed this as completed Feb 16, 2022

phofl modified the milestones: Contributions Welcome, No action Feb 16, 2022

JanEricNitschke mentioned this issue Jan 21, 2023

attackerSteamIDs in DF['damages'] rounding error if there is any World or C4 damage done pnxenopoulos/awpy#213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Int64Dtype in read_csv leads to unexpected values #26259

Int64Dtype in read_csv leads to unexpected values #26259

alohr commented May 1, 2019

WillAyd commented May 1, 2019

phofl commented Feb 16, 2022

Int64Dtype in read_csv leads to unexpected values #26259

Int64Dtype in read_csv leads to unexpected values #26259

Comments

alohr commented May 1, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Actual Output

Output of pd.show_versions()

WillAyd commented May 1, 2019

phofl commented Feb 16, 2022

Output of `pd.show_versions()`