solve "Int64 with null value mangles large-ish integers" problem #30282

rushabh-v · 2019-12-15T14:29:59Z

closes Int64 with null value mangles large-ish integers #30268
tests added/passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
The pd.Series was giving less precise output when inputting a list containing large integers and np.nan values and passing dtype="Int64". Which was due to converting it to np.array. This PR aims to solve that bug.

rushabh-v · 2019-12-15T14:34:35Z

@jorisvandenbossche
I have added conditions for np.ndarray and list as you told in the issue. Please tell me which other datatypes should I add as conditions and tell me if any other changes are needed.

simonjayhawkins · 2019-12-15T14:38:26Z

@rushabh-v Thanks for the PR. can you add a whatsnew and test.

rushabh-v · 2019-12-15T18:01:36Z

The Tests are failing.
It says TypeError: object cannot be converted to an IntegerDtype.
see here.

jorisvandenbossche · 2019-12-16T10:41:36Z

@rushabh-v you will need to accept "boolean" as well as inferred dtype

TomAugspurger

Can you add a test and whatsnew note for this?

pandas/core/arrays/integer.py

WillAyd · 2019-12-16T20:31:39Z

pandas/core/arrays/integer.py

@@ -205,7 +205,14 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
            mask = mask.copy()
        return values, mask

-    values = np.array(values, copy=copy)
+    if isinstance(values, list):


How much more difficult would it be to do the proper fix as suggested by @jorisvandenbossche in the original issue?

Not much more difficult I think.
I assume we want to test if the values are array-like (so basically already have a dtype, i.e. numpy array, Series, Index, ), and in that case convert to a numpy array preserving the type. And otherwise always convert to object array.

I am not fully sure how we do such a check in other places, maybe just actually checking for dtype? hasattr(values, 'dtype') ? Or explicitly checking for the different classes?

Maybe that is better for now. Not sure if @jbrockmendel has come across this

i think we usually check for a dtype attribute, but this might merit a helper function in case corner cases come up. IIRC the last time I tried to write something general-ish I ended up with:

if not hasattr(data, "dtype"): # e.g. list, tuple if np.ndim(data) == 0: # i.e. generator data = list(data) data = np.asarray(data)

https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/datetimes.py#L1827
https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/timedeltas.py#L991

Tell me if there are any potential changes now?

rushabh-v · 2019-12-17T16:51:48Z

Yes, I'll add tests once the checks pass and I have added the whatsnew note above(If it should be somewhere else then please tell me.). @TomAugspurger.

jreback

this needs tests, even before an impl change

rushabh-v · 2019-12-17T16:57:35Z

Oh Okay, I am adding tests.

pep8speaks · 2019-12-18T16:09:28Z

Hello @rushabh-v! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-30 18:54:09 UTC

pandas/tests/arrays/test_integer.py

simonjayhawkins · 2019-12-19T12:25:08Z

@rushabh-v This may also close #26259?

rushabh-v · 2019-12-19T16:24:21Z

This may also close #26259?

Yes It's related to it. But I am not understanding why
pd.read_csv(t, dtype={'timestamp': object}) is calling integer_array,
whereas pd.read_csv(t, dtype={'timestamp': pd.Int64Dtype()}) is not calling integer_array.

I have checked this locally.

rushabh-v · 2019-12-19T17:04:57Z

The tests are failing. It says,

AssertionError: Attributes of Series are different           
           Attribute "dtype" are different
           [left]:  bool
           [right]: Int64

see here.

simonjayhawkins

Can you add a what's new.

pandas/tests/arrays/test_integer.py

rushabh-v · 2019-12-20T16:45:54Z

What mistake have I made while adding whatsnew?

rushabh-v · 2019-12-20T20:12:43Z

It says "These conflicts are too complex to resolve in the web editor".

TomAugspurger · 2019-12-20T21:23:52Z

There's a merge conflict (you edited the same line as another pull request). You'll need to fix that as described in https://dev.pandas.io/docs/development/contributing.html#updating-your-pull-request.

rushabh-v · 2019-12-21T06:17:45Z

The tests are failing. It says,
AssertionError: Attributes of Series are different
Attribute "dtype" are different
[left]: bool
[right]: Int64
see here.

They have started appearing after adding "boolean" as inffered type.

doc/source/whatsnew/v1.0.0.rst

jorisvandenbossche · 2019-12-23T09:07:02Z

pandas/core/arrays/integer.py

+        if isinstance(values, list):
+            values = np.array(values, dtype=object, copy=copy)
+        else:
+            values = np.array(values, copy=copy)


Is there a reason that this if/else is needed? Can't we do the np.array(values, dtype=object, copy=copy) in all cases?

They both work equally. And I have updated that.
But I am not understanding why try_cast_to_ea(self._values, new_values)
returns int values for new_values=list of bools and self._values=Series of integers.
Find test log here.

But I am not understanding why try_cast_to_ea(self._values, new_values)
returns int values for new_values=list of bools and self._values=Series of integers.

I think it creates an Extension array of type of values in self._values. Can we pass something else there as the first argument in order to get the EA of the type of new_values?

The only failing case is that whenever it gets self._values as integer EA and the new_values is a list of boolean then it returns EA of ints instead of EA of bools. Can we just put a condition for that case or is there any other suggestion @jorisvandenbossche

pandas/core/arrays/integer.py

pandas/tests/arrays/test_integer.py

rushabh-v · 2020-01-25T06:01:23Z

@rushabh-v thanks for the link where tests are failing.

@jorisvandenbossche Can we ignore those tests for this PR, Because the error is general and not caused by this PR? Or we should wait for #31108 to be fixed?

jreback · 2020-01-25T16:21:29Z

@rushabh-v thanks for the link where tests are failing.

@jorisvandenbossche Can we ignore those tests for this PR, Because the error is general and not caused by this PR? Or we should wait for #31108 to be fixed?

no cannot ignore tests.

jreback

looks a lot better. why is this failing?

jreback · 2020-01-25T16:22:43Z

pandas/core/arrays/integer.py

@@ -196,7 +196,10 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
            mask = mask.copy()
        return values, mask

-    values = np.array(values, copy=copy)
+    values = np.asarray(values, dtype=getattr(values, "dtype", object))
+    if copy:


you only need to copy if values has a dtype (otherwise np.assarray by definition will copy)

rushabh-v · 2020-01-25T20:15:47Z

np.asarray raises TypeError: data type not understood when dtypes are pandas dtypes(i.e, Int8Dtype(),Int16Dtype(), Int64Dtype(), UInt8Dtype(), etc).
See,
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=26801&view=logs&j=bef1c175-2c1b-51ae-044a-2437c76fc339&t=770e7bb1-09f5-5ebf-b63b-578d2906aac9&l=516

WillAyd · 2020-02-21T12:29:27Z

@rushabh-v IIUC can just map to the numpy equivalent dtype before the asarray call

WillAyd · 2020-03-25T00:22:22Z

@rushabh-v can you address comments and try to get green?

jorisvandenbossche · 2020-03-29T07:50:12Z

I think this is still blocked by the issues I mentioned in #30282 (comment) ?
I will try to take a look at those the coming week.

jreback · 2020-06-14T15:42:55Z

@rushabh-v i am not sure what the status of this as we have had many moving parts under the hood. can you merge master and update.

WillAyd · 2020-07-29T20:51:51Z

@rushabh-v is this still active?

rushabh-v · 2020-07-31T06:24:17Z

Yes, it's active but the issues still persist!

WillAyd · 2020-09-10T19:10:11Z

@rushabh-v can you see if you can get this green? If so someone can review

rushabh-v · 2020-09-11T11:39:25Z

This PR is blocked by issue #31108

jreback · 2020-11-26T19:04:41Z

closing as blocked by #31108

simonjayhawkins added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 15, 2019

TomAugspurger reviewed Dec 16, 2019

View reviewed changes

pandas/core/arrays/integer.py Outdated Show resolved Hide resolved

WillAyd reviewed Dec 16, 2019

View reviewed changes

jreback requested changes Dec 17, 2019

View reviewed changes

simonjayhawkins requested changes Dec 19, 2019

View reviewed changes

pandas/tests/arrays/test_integer.py Outdated Show resolved Hide resolved

rushabh-v requested review from simonjayhawkins and jreback December 19, 2019 16:04

simonjayhawkins reviewed Dec 19, 2019

View reviewed changes

pandas/tests/arrays/test_integer.py Outdated Show resolved Hide resolved

pandas/tests/arrays/test_integer.py Outdated Show resolved Hide resolved

pandas/tests/arrays/test_integer.py Outdated Show resolved Hide resolved

pandas/tests/arrays/test_integer.py Outdated Show resolved Hide resolved

rushabh-v requested a review from simonjayhawkins December 20, 2019 15:48

rushabh-v force-pushed the master branch from e4c1853 to d031286 Compare December 22, 2019 17:15

jorisvandenbossche reviewed Dec 23, 2019

View reviewed changes

rushabh-v requested review from WillAyd, TomAugspurger, jbrockmendel and jorisvandenbossche December 25, 2019 09:29

jreback requested changes Jan 25, 2020

View reviewed changes

rushabh-v requested a review from jreback January 28, 2020 15:47

rushabh-v force-pushed the master branch from 112eec3 to 8dd9466 Compare June 17, 2020 16:44

rushabh-v closed this Jun 18, 2020

rushabh-v force-pushed the master branch from 8dd9466 to 72aed3e Compare June 18, 2020 11:03

rushabh-v reopened this Jun 18, 2020

merge master

be8e57a

rushabh-v force-pushed the master branch 4 times, most recently from f22c1f4 to 45341e6 Compare July 30, 2020 07:20

fix the dtype=None case

996f62b

rushabh-v force-pushed the master branch from 45341e6 to 996f62b Compare July 30, 2020 18:54

simonjayhawkins removed their request for review October 24, 2020 12:20

jreback closed this Nov 26, 2020

jorisvandenbossche mentioned this pull request Dec 6, 2020

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

Closed

2 tasks

jorisvandenbossche mentioned this pull request Dec 30, 2020

BUG: Series constructor with nullable unsigned integer dtype fails with large number #38798

Closed

CFretter mentioned this pull request Feb 18, 2022

Bug: read_csv losing precision when reading Int64 data with N/A values #32134

Closed

solve "Int64 with null value mangles large-ish integers" problem #30282

solve "Int64 with null value mangles large-ish integers" problem #30282

Conversation

rushabh-v commented Dec 15, 2019 • edited Loading

rushabh-v commented Dec 15, 2019 • edited Loading

simonjayhawkins commented Dec 15, 2019

rushabh-v commented Dec 15, 2019 • edited Loading

jorisvandenbossche commented Dec 16, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rushabh-v commented Dec 17, 2019

jreback left a comment

Choose a reason for hiding this comment

rushabh-v commented Dec 17, 2019

pep8speaks commented Dec 18, 2019 • edited Loading

Comment last updated at 2020-07-30 18:54:09 UTC

simonjayhawkins commented Dec 19, 2019

rushabh-v commented Dec 19, 2019

rushabh-v commented Dec 19, 2019

simonjayhawkins left a comment

Choose a reason for hiding this comment

rushabh-v commented Dec 20, 2019

rushabh-v commented Dec 20, 2019

TomAugspurger commented Dec 20, 2019

rushabh-v commented Dec 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rushabh-v Dec 24, 2019 • edited Loading

Choose a reason for hiding this comment

rushabh-v Dec 25, 2019 • edited Loading

Choose a reason for hiding this comment

rushabh-v commented Jan 25, 2020 • edited Loading

jreback commented Jan 25, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rushabh-v commented Jan 25, 2020 • edited Loading

WillAyd commented Feb 21, 2020

WillAyd commented Mar 25, 2020

jorisvandenbossche commented Mar 29, 2020

jreback commented Jun 14, 2020

WillAyd commented Jul 29, 2020

rushabh-v commented Jul 31, 2020

WillAyd commented Sep 10, 2020

rushabh-v commented Sep 11, 2020 • edited Loading

jreback commented Nov 26, 2020

rushabh-v commented Dec 15, 2019 •

edited

Loading

rushabh-v commented Dec 15, 2019 •

edited

Loading

rushabh-v commented Dec 15, 2019 •

edited

Loading

pep8speaks commented Dec 18, 2019 •

edited

Loading

rushabh-v Dec 24, 2019 •

edited

Loading

rushabh-v Dec 25, 2019 •

edited

Loading

rushabh-v commented Jan 25, 2020 •

edited

Loading

rushabh-v commented Jan 25, 2020 •

edited

Loading

rushabh-v commented Sep 11, 2020 •

edited

Loading