-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
solve "Int64 with null value mangles large-ish integers" problem #30282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jorisvandenbossche |
@rushabh-v Thanks for the PR. can you add a whatsnew and test. |
The Tests are failing. |
@rushabh-v you will need to accept "boolean" as well as inferred dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test and whatsnew note for this?
pandas/core/arrays/integer.py
Outdated
@@ -205,7 +205,14 @@ def coerce_to_array(values, dtype, mask=None, copy=False): | |||
mask = mask.copy() | |||
return values, mask | |||
|
|||
values = np.array(values, copy=copy) | |||
if isinstance(values, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much more difficult would it be to do the proper fix as suggested by @jorisvandenbossche in the original issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much more difficult I think.
I assume we want to test if the values
are array-like (so basically already have a dtype
, i.e. numpy array, Series, Index, ), and in that case convert to a numpy array preserving the type. And otherwise always convert to object array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not fully sure how we do such a check in other places, maybe just actually checking for dtype? hasattr(values, 'dtype')
? Or explicitly checking for the different classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that is better for now. Not sure if @jbrockmendel has come across this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we usually check for a dtype
attribute, but this might merit a helper function in case corner cases come up. IIRC the last time I tried to write something general-ish I ended up with:
if not hasattr(data, "dtype"):
# e.g. list, tuple
if np.ndim(data) == 0:
# i.e. generator
data = list(data)
data = np.asarray(data)
https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/datetimes.py#L1827
https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/timedeltas.py#L991
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tell me if there are any potential changes now?
Yes, I'll add tests once the checks pass and I have added the whatsnew note above(If it should be somewhere else then please tell me.). @TomAugspurger. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs tests, even before an impl change
Oh Okay, I am adding tests. |
Hello @rushabh-v! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-07-30 18:54:09 UTC |
@rushabh-v This may also close #26259? |
Yes It's related to it. But I am not understanding why I have checked this locally. |
The tests are failing. It says,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a what's new.
What mistake have I made while adding whatsnew? |
It says "These conflicts are too complex to resolve in the web editor". |
There's a merge conflict (you edited the same line as another pull request). You'll need to fix that as described in https://dev.pandas.io/docs/development/contributing.html#updating-your-pull-request. |
They have started appearing after adding "boolean" as inffered type. |
pandas/core/arrays/integer.py
Outdated
if isinstance(values, list): | ||
values = np.array(values, dtype=object, copy=copy) | ||
else: | ||
values = np.array(values, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that this if/else is needed? Can't we do the np.array(values, dtype=object, copy=copy)
in all cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They both work equally. And I have updated that.
But I am not understanding why try_cast_to_ea(self._values, new_values)
returns int values for new_values=list of bools
and self._values=Series of integers
.
Find test log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I am not understanding why try_cast_to_ea(self._values, new_values)
returns int values for new_values=list of bools and self._values=Series of integers.
I think it creates an Extension array of type of values in self._values. Can we pass something else there as the first argument in order to get the EA of the type of new_values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only failing case is that whenever it gets self._values as integer EA and the new_values is a list of boolean then it returns EA of ints instead of EA of bools. Can we just put a condition for that case or is there any other suggestion @jorisvandenbossche
@jorisvandenbossche Can we ignore those tests for this PR, Because the error is general and not caused by this PR? Or we should wait for #31108 to be fixed? |
no cannot ignore tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks a lot better. why is this failing?
pandas/core/arrays/integer.py
Outdated
@@ -196,7 +196,10 @@ def coerce_to_array(values, dtype, mask=None, copy=False): | |||
mask = mask.copy() | |||
return values, mask | |||
|
|||
values = np.array(values, copy=copy) | |||
values = np.asarray(values, dtype=getattr(values, "dtype", object)) | |||
if copy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you only need to copy if values has a dtype (otherwise np.assarray
by definition will copy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
np.asarray raises |
@rushabh-v IIUC can just map to the numpy equivalent dtype before the asarray call |
@rushabh-v can you address comments and try to get green? |
I think this is still blocked by the issues I mentioned in #30282 (comment) ? |
@rushabh-v i am not sure what the status of this as we have had many moving parts under the hood. can you merge master and update. |
@rushabh-v is this still active? |
f22c1f4
to
45341e6
Compare
Yes, it's active but the issues still persist! |
@rushabh-v can you see if you can get this green? If so someone can review |
This PR is blocked by issue #31108 |
closing as blocked by #31108 |
The
pd.Series
was giving less precise output when inputting a list containing large integers and np.nan values and passingdtype="Int64"
. Which was due to converting it tonp.array
. This PR aims to solve that bug.