Skip to content

ENH: Use IntergerArray to avoid forced conversion from integer to float #27335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Nov 13, 2019

Conversation

jiangyue12392
Copy link
Contributor

@jiangyue12392 jiangyue12392 commented Jul 11, 2019

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This is a new attempt to solve issue #16918 after an attempt with PR #27073

@jiangyue12392 jiangyue12392 changed the title Use IntergerArray to avoid forced conversion from integer to float ENH: Use IntergerArray to avoid forced conversion from integer to float Jul 11, 2019
@WillAyd
Copy link
Member

WillAyd commented Jul 11, 2019

Is there a way to make this work without adding a keyword to the DataFrame constructor? That approach seems off to me

Also be sure to add tests with this - its not entirely clear that this does anything as is

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangyue12392 not what I wanted here. sure fixing maybe_convert_numeric to allow return integer_array might close #27283 , but the rest is completely orthogonal and we cannot propagate a keyword thru DataFrame.

@jiangyue12392
Copy link
Contributor Author

@WillAyd @jreback If the original implementation of using DataFrame to do the expansion of list of dict to 2d array and the type determination for columns is to be maintained then it seems to me that there is no way to circumvent propagating a new key in the DataFrame to make it backwards compatible. Is it more desirable to pull out the expansion and type determination logic from DataFrame and have some code duplications?

I will add tests later, but I want to know what is the more accepted methods here first.

@jreback
Copy link
Contributor

jreback commented Jul 11, 2019

@jiangyue12392 we are likely going to do the inference for expanding a int dtype with nulls to integer-na (rather than float) at some point, but currently this is NOT the default. We need some more infrastructure first (see the linked issue).

@jiangyue12392
Copy link
Contributor Author

@jreback Ok, I will wait till the infrastructure is in place after #27283 is sorted out

@jreback
Copy link
Contributor

jreback commented Jul 12, 2019

@jiangyue12392 actually part of this PR could partially satisfy that issue

@jiangyue12392
Copy link
Contributor Author

@jreback I suppose the part where the change in lib.pyx serves the purpose of distinguishing an integer_na. However, on that issue, both @TomAugspurger and @jorisvandenbossche express their concern to keep behaviour consistent with the past where the default datatype is float64. I am not sure how users can choose between the new Int64 behaviour or the old float64 for backward compatibility

@gfyoung gfyoung added Enhancement IO JSON read_json, to_json, json_normalize labels Jul 12, 2019
@jreback
Copy link
Contributor

jreback commented Jul 12, 2019

so what you do is make the return string integer-na (this is a small api change)
there are a small number of places then that would have to interpret this type (as float) for casting purposes

that’s it

@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

@jiangyue12392 can you merge master

@jiangyue12392
Copy link
Contributor Author

@jiangyue12392 can you merge master

Sorry, I have been busy with other stuff recently, will merge as soon as I get my hands free

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

@jiangyue12392 if you can merge master; I think parts of this patch we would want to merge.

@jiangyue12392
Copy link
Contributor Author

@jiangyue12392 if you can merge master; I think parts of this patch we would want to merge.

Master merged. Sorry for the delay.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you just do the internal change in maybe_convert_objects and add a test for that we can get this in. pls merge master as wel.

@@ -397,6 +397,7 @@ def __init__(
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
to_integer_array: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't have this keyword here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External keywords removed

@@ -1951,7 +1953,7 @@ def maybe_convert_numeric(ndarray[object] values, set na_values,
@cython.wraparound(False)
def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
bint safe=0, bint convert_datetime=0,
bint convert_timedelta=0):
bint convert_timedelta=0, to_integer_array=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with a keyword (here as this is in internal funtion), but can you name
bint convert_to_nullable_integer=False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name refactored.

@jiangyue12392
Copy link
Contributor Author

if you just do the internal change in maybe_convert_objects and add a test for that we can get this in. pls merge master as wel.

I have made the required changes and added the test.

# GH27335
arr = np.array([2, np.NaN], dtype=object)
result = lib.maybe_convert_objects(arr, convert_to_nullable_integer=1)
from pandas.core.arrays import IntegerArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can import at the top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shifted import statement

result = lib.maybe_convert_objects(arr, convert_to_nullable_integer=1)
from pandas.core.arrays import IntegerArray

exp = IntegerArray(np.array([2, 0], dtype="i8"), np.array([False, True]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this test with an int64 array as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameterized as requested

from pandas.core.arrays import IntegerArray

exp = IntegerArray(np.array([2, 0], dtype="i8"), np.array([False, True]))
tm.assert_equal(result, exp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_extension_array_equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion method changed

@@ -2085,11 +2094,19 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,

if not seen.object_:
if not safe:
if seen.null_:
if seen.null_ or seen.nan_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate issue is we should change line 2083 to use a DatetimeArray (can be separate PR) or here if it works out.

can you also update the doc-string (well add it really :->) thanks for workign on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave line 2083 out as it is a separate issue.
Which version of the doc-string shall I modify?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, pls rebase and ping on green.

@@ -1955,7 +1957,8 @@ def maybe_convert_numeric(ndarray[object] values, set na_values,
@cython.wraparound(False)
def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
bint safe=0, bint convert_datetime=0,
bint convert_timedelta=0):
bint convert_timedelta=0,
bint convert_to_nullable_integer=0):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can update this doc-string here e.g. Returns / Parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc-string added

@jiangyue12392
Copy link
Contributor Author

small comment, pls rebase and ping on green.

Green now.

@jreback jreback added this to the 1.0 milestone Nov 13, 2019
@jreback jreback merged commit 6b62e50 into pandas-dev:master Nov 13, 2019
@jreback
Copy link
Contributor

jreback commented Nov 13, 2019

thanks @jiangyue12392 nice patch. This should enable us to selectively convert to nullable integers with the user facing routines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants