Skip to content

WIP: Maintain Int64 Precision on Construction #26272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
import copy
import sys
from typing import Type
from typing import Sequence, Type
import warnings

import numpy as np

from pandas._libs import lib
from pandas._typing import Dtype
from pandas.compat import set_function_name
from pandas.util._decorators import cache_readonly

Expand Down Expand Up @@ -304,9 +305,18 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
return integer_array(scalars, dtype=dtype, copy=copy)

@classmethod
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):
scalars = to_numeric(strings, errors="raise")
return cls._from_sequence(scalars, dtype, copy)
def _from_sequence_of_strings(cls,
strings: Sequence[str],
dtype: Dtype = None,
copy: bool = False) -> 'IntegerArray':
# Mask the NA location before sending to to_numeric to prevent
# undesirable cast to float which may lose precision
mask = isna(strings)
masked_strings = np.where(mask, 0, strings)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can do this instead in to_numeric itself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking that too but the downside I think is we'd have to return both an array of values and a mask from that method otherwise we'd keep hitting the same issue (unless we have pd.to_numeric construct the Int64 array directly)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unless we have pd.to_numeric construct the Int64 array directly)

yes that is a desirable outcome

scalars = to_numeric(masked_strings, errors="raise")

return IntegerArray(scalars, mask)

@classmethod
def _from_factorized(cls, values, original):
Expand Down
24 changes: 24 additions & 0 deletions pandas/tests/io/parser/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -509,3 +509,27 @@ def test_numeric_dtype(all_parsers, dtype):

result = parser.read_csv(StringIO(data), header=None, dtype=dtype)
tm.assert_frame_equal(expected, result)


def test_intna_precision(all_parsers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to need several more tests to check this behavior, but we can get revisit this point after the actual implementation has been critiqued.

parser = all_parsers
data = "1556559573141592653\n1556559573141592654\n\n1556559573141592655"
dtype = 'Int64'

expected = DataFrame([
[1556559573141592653],
[1556559573141592654],
[0],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify the below TODO if you use np.nan the precision is lost before the DataFrame gets constructed, hence why I've constructed this with just integers first and subsequently assigned nan here

[1556559573141592655],
], dtype=dtype)
expected.iloc[2] = np.nan # TODO: fix general bug on df construction

result = parser.read_csv(StringIO(data), header=None, dtype=dtype,
skip_blank_lines=False)

tm.assert_frame_equal(result, expected)

# See why tm.assert_frame_equal doesn't fail...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow what's going here exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't dig through the assert_*_methods but I'm pretty sure there is internal casting to float64 going on, so you can lose precision pretty easily.

Here's an example that passes without issue:

>>> ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64')
>>> ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64')
>>> tm.assert_series_equal(ser1, ser2) # should fail but doesn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally. I think quite a few more need to go into test_integers for construction.

When I started this I wasn't expecting the precision issue with DataFrame construction and the assert methods. Ended up being a Pandora's box pretty quick, so posting this for discussion to see how we want to tackle the various components

Copy link
Member

@gfyoung gfyoung May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...that's awkward. Should open an issue about this. Our handling of large numbers is definitely more bug-prone across the codebase relatively speaking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I can get assert_series_equal to work if check_exact=True is passed. My guess is that the float conversion happens due to defaulting to checking equality up to a given precision:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+476.g9feb3ad92c'

In [2]: ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64')

In [3]: ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64')

In [4]: pd.util.testing.assert_series_equal(ser1, ser2)

In [5]: pd.util.testing.assert_series_equal(ser1, ser2, check_exact=True)
---------------------------------------------------------------------------
AssertionError: Series are different

Series values are different (50.0 %)
[left]:  [1556559573141592653, 1556559573141592654]
[right]: [1556559573141592654, 1556559573141592654]

I guess it could make sense to default to exact checks for integer dtypes instead of doing precision checks, but it doesn't really seem like a clean solution since it'd require implementing dtype specific behavior.

assert result.iloc[0] == expected.iloc[0]
assert result.iloc[1] == expected.iloc[1]
assert result.iloc[3] == expected.iloc[3]