Skip to content

Int64Dtype in read_csv leads to unexpected values #26259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alohr opened this issue May 1, 2019 · 2 comments
Closed

Int64Dtype in read_csv leads to unexpected values #26259

alohr opened this issue May 1, 2019 · 2 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@alohr
Copy link

alohr commented May 1, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
import io

t = io.StringIO('''\
event,timestamp
a,1556559573141592653
b,1556559573141592654
c,
d,1556559573141592655
''')

# Reading the timestamps as strings works fine
print("\nExpected output:")
print(pd.read_csv(t, dtype={'timestamp': object}))

# Now with Int64Dtype
t.seek(0)
print("\nActual output:")
print(pd.read_csv(t, dtype={'timestamp': pd.Int64Dtype()}))

Problem description

I would like to read csv files with nullable (big) integers into a dataframe. The integers represent nanoseconds since the UNIX epoch 1970. Using the Int64Dtype introduced in 0.24.0 seems like the way to go. I quote from the FAQ:

https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions

If you need to represent integers with possibly missing values, use one of the nullable- integer extension dtypes provided by pandas

Expected Output

  event            timestamp
0     a  1556559573141592653
1     b  1556559573141592654
2     c                  NaN
3     d  1556559573141592655

Actual Output

  event            timestamp
0     a  1556559573141592576
1     b  1556559573141592576
2     c                  NaN
3     d  1556559573141592576

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented May 1, 2019

Yea I think there may be some undesired casting to float64 going on behind the scenes. Looks OK up to 2^52 but beyond that things get dicey:

In [15]: 2**53
Out[15]: 9007199254740992

t = io.StringIO('''\
event,timestamp
a,9007199254740992
b,9007199254740993
c,
d,9007199254740994
''')

In [20]: print(pd.read_csv(t, dtype={'timestamp': pd.Int64Dtype()}))
  event         timestamp
0     a  9007199254740992
1     b  9007199254740992
2     c               NaN
3     d  9007199254740994

Investigation and PRs are certainly welcome

@WillAyd WillAyd added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Bug labels May 1, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone May 1, 2019
@gfyoung gfyoung added the IO CSV read_csv, to_csv label May 2, 2019
@jbrockmendel jbrockmendel added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 8, 2022
@phofl
Copy link
Member

phofl commented Feb 16, 2022

Duplicate of #32134

@phofl phofl marked this as a duplicate of #32134 Feb 16, 2022
@phofl phofl closed this as completed Feb 16, 2022
@phofl phofl modified the milestones: Contributions Welcome, No action Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants