Skip to content

solve "Int64 with null value mangles large-ish integers" problem #30282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rushabh-v
Copy link
Contributor

@rushabh-v rushabh-v commented Dec 15, 2019

  • closes Int64 with null value mangles large-ish integers #30268
  • tests added/passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
    The pd.Series was giving less precise output when inputting a list containing large integers and np.nan values and passing dtype="Int64". Which was due to converting it to np.array. This PR aims to solve that bug.

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Dec 15, 2019

@jorisvandenbossche
I have added conditions for np.ndarray and list as you told in the issue. Please tell me which other datatypes should I add as conditions and tell me if any other changes are needed.

@simonjayhawkins simonjayhawkins added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 15, 2019
@simonjayhawkins
Copy link
Member

@rushabh-v Thanks for the PR. can you add a whatsnew and test.

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Dec 15, 2019

The Tests are failing.
It says TypeError: object cannot be converted to an IntegerDtype.
see here.

@jorisvandenbossche
Copy link
Member

@rushabh-v you will need to accept "boolean" as well as inferred dtype

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test and whatsnew note for this?

@@ -205,7 +205,14 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
mask = mask.copy()
return values, mask

values = np.array(values, copy=copy)
if isinstance(values, list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much more difficult would it be to do the proper fix as suggested by @jorisvandenbossche in the original issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much more difficult I think.
I assume we want to test if the values are array-like (so basically already have a dtype, i.e. numpy array, Series, Index, ), and in that case convert to a numpy array preserving the type. And otherwise always convert to object array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully sure how we do such a check in other places, maybe just actually checking for dtype? hasattr(values, 'dtype') ? Or explicitly checking for the different classes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that is better for now. Not sure if @jbrockmendel has come across this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we usually check for a dtype attribute, but this might merit a helper function in case corner cases come up. IIRC the last time I tried to write something general-ish I ended up with:

    if not hasattr(data, "dtype"):
        # e.g. list, tuple
        if np.ndim(data) == 0:
            # i.e. generator
            data = list(data)
        data = np.asarray(data)

https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/datetimes.py#L1827
https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/timedeltas.py#L991

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell me if there are any potential changes now?

@rushabh-v
Copy link
Contributor Author

Yes, I'll add tests once the checks pass and I have added the whatsnew note above(If it should be somewhere else then please tell me.). @TomAugspurger.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs tests, even before an impl change

@rushabh-v
Copy link
Contributor Author

Oh Okay, I am adding tests.

@pep8speaks
Copy link

pep8speaks commented Dec 18, 2019

Hello @rushabh-v! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-30 18:54:09 UTC

@simonjayhawkins
Copy link
Member

@rushabh-v This may also close #26259?

@rushabh-v
Copy link
Contributor Author

This may also close #26259?

Yes It's related to it. But I am not understanding why
pd.read_csv(t, dtype={'timestamp': object}) is calling integer_array,
whereas pd.read_csv(t, dtype={'timestamp': pd.Int64Dtype()}) is not calling integer_array.

I have checked this locally.

@rushabh-v
Copy link
Contributor Author

The tests are failing. It says,

AssertionError: Attributes of Series are different           
           Attribute "dtype" are different
           [left]:  bool
           [right]: Int64

see here.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a what's new.

@rushabh-v
Copy link
Contributor Author

What mistake have I made while adding whatsnew?

@rushabh-v
Copy link
Contributor Author

It says "These conflicts are too complex to resolve in the web editor".

@TomAugspurger
Copy link
Contributor

There's a merge conflict (you edited the same line as another pull request). You'll need to fix that as described in https://dev.pandas.io/docs/development/contributing.html#updating-your-pull-request.

@rushabh-v
Copy link
Contributor Author

The tests are failing. It says,
AssertionError: Attributes of Series are different
Attribute "dtype" are different
[left]: bool
[right]: Int64
see here.

They have started appearing after adding "boolean" as inffered type.

if isinstance(values, list):
values = np.array(values, dtype=object, copy=copy)
else:
values = np.array(values, copy=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that this if/else is needed? Can't we do the np.array(values, dtype=object, copy=copy) in all cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both work equally. And I have updated that.
But I am not understanding why try_cast_to_ea(self._values, new_values)
returns int values for new_values=list of bools and self._values=Series of integers.
Find test log here.

Copy link
Contributor Author

@rushabh-v rushabh-v Dec 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am not understanding why try_cast_to_ea(self._values, new_values)
returns int values for new_values=list of bools and self._values=Series of integers.

I think it creates an Extension array of type of values in self._values. Can we pass something else there as the first argument in order to get the EA of the type of new_values?

Copy link
Contributor Author

@rushabh-v rushabh-v Dec 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only failing case is that whenever it gets self._values as integer EA and the new_values is a list of boolean then it returns EA of ints instead of EA of bools. Can we just put a condition for that case or is there any other suggestion @jorisvandenbossche

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Jan 25, 2020

@rushabh-v thanks for the link where tests are failing.

@jorisvandenbossche Can we ignore those tests for this PR, Because the error is general and not caused by this PR? Or we should wait for #31108 to be fixed?

@jreback
Copy link
Contributor

jreback commented Jan 25, 2020

@rushabh-v thanks for the link where tests are failing.

@jorisvandenbossche Can we ignore those tests for this PR, Because the error is general and not caused by this PR? Or we should wait for #31108 to be fixed?

no cannot ignore tests.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks a lot better. why is this failing?

@@ -196,7 +196,10 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
mask = mask.copy()
return values, mask

values = np.array(values, copy=copy)
values = np.asarray(values, dtype=getattr(values, "dtype", object))
if copy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you only need to copy if values has a dtype (otherwise np.assarray by definition will copy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Jan 25, 2020

np.asarray raises TypeError: data type not understood when dtypes are pandas dtypes(i.e, Int8Dtype(),Int16Dtype(), Int64Dtype(), UInt8Dtype(), etc).
See,
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=26801&view=logs&j=bef1c175-2c1b-51ae-044a-2437c76fc339&t=770e7bb1-09f5-5ebf-b63b-578d2906aac9&l=516

@rushabh-v rushabh-v requested a review from jreback January 28, 2020 15:47
@WillAyd
Copy link
Member

WillAyd commented Feb 21, 2020

@rushabh-v IIUC can just map to the numpy equivalent dtype before the asarray call

@WillAyd
Copy link
Member

WillAyd commented Mar 25, 2020

@rushabh-v can you address comments and try to get green?

@jorisvandenbossche
Copy link
Member

I think this is still blocked by the issues I mentioned in #30282 (comment) ?
I will try to take a look at those the coming week.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

@rushabh-v i am not sure what the status of this as we have had many moving parts under the hood. can you merge master and update.

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

@rushabh-v is this still active?

@rushabh-v rushabh-v force-pushed the master branch 4 times, most recently from f22c1f4 to 45341e6 Compare July 30, 2020 07:20
@rushabh-v
Copy link
Contributor Author

Yes, it's active but the issues still persist!

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2020

@rushabh-v can you see if you can get this green? If so someone can review

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Sep 11, 2020

This PR is blocked by issue #31108

@simonjayhawkins simonjayhawkins removed their request for review October 24, 2020 12:20
@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

closing as blocked by #31108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Int64 with null value mangles large-ish integers
8 participants