Skip to content

BUG: use EA.astype in ExtensionBlock.to_native_types #28841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

Closes #28840

This implements a ExtensionBlock.to_native_types to override the base class which converts to a numpy array.

To test this is a bit tricky. The current JSONArray we have in the extension tests does not have this issue. I would need to implement and add tests for a new ListArray to check this (I can ensure that this fixes the geopandas issue)

@jorisvandenbossche jorisvandenbossche added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 8, 2019
@jorisvandenbossche
Copy link
Member Author

It is also annoying that this does not work for eg SparseArray, so currently I am falling back to the base implementation in case of any error ...

def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs):
"""override to use ExtensionArray astype for the conversion"""
try:
values = self.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of this can be build outside of the try/except right?

@jbrockmendel
Copy link
Member

to_native_types has always been a weird name to me. in practice it seems to be used for preparing __repr__ output, but elsewhere "native" refers to something like C types. Changing the naming scheme is definitely outside the scope of this PR, but maybe a comment or docstring indicating what this is doing?

Also, IIRC there is another similar piece of code in this file dealing with SparseArray.__setitem__ being disabled. Can any of that be shared?

@jorisvandenbossche
Copy link
Member Author

Also, IIRC there is another similar piece of code in this file dealing with SparseArray.setitem being disabled. Can any of that be shared?

There is indeed a if is_sparse(self.values): and a similar catching of an error in where, but since the code is for the test completely unrelated, I don't see any way to shared that.

to_native_types has always been a weird name to me.

Completely agree. We actually also have a public Index.to_native_types, that we should probably just deprecate.

@jorisvandenbossche
Copy link
Member Author

I get a mypy error: pandas/core/internals/blocks.py:2274: error: Definition of "to_native_types" in base class "ExtensionBlock" is incompatible with definition in base class "DatetimeBlock"

which is very strange, DatetimeBlock is not at all a base class of ExtensionBlock. The only relationship is that they both derive from the base Block class ..

cc @WillAyd @simonjayhawkins any idea how I can get this error go away?
So we currently already have multiple incompatible signatures. Eg DatetimeBlock.to_native_types and FloatBlock.to_native_types both have a different signature compared to Block.to_native_types. But now adding a ExtensionBlock.to_native_types (with the same signature as base Block.to_native_types) triggers this error.

@jorisvandenbossche
Copy link
Member Author

Ah, I understand now. I didn't look at the location of the error. It is reporting that about DatetimeTZBlock, which indeed subclasses both DatetimeBlock and ExtensionBlock

@jorisvandenbossche
Copy link
Member Author

Solved it (I think) by also overriding it in DatetimeTZBlock, which I think we needed to do anyway to pick DatetimeBlock's version (although the fact that no tests were failing might be a sign that the specialized DatetimeBlock.to_native_types is either not necessary or not properly tested)

@jorisvandenbossche
Copy link
Member Author

don't we already do something like this in the repr of EAs? we should have a same / generic mechanism, rather than implementing logic here.

I fully agree it would be nice to make that better, but right now, formatting on the one hand and csv on the other (which uses to_native_types) are two separate things, and fixing that seems out of scope for this PR.
For now, the only thing I am doing in this PR is actually using the EA's version of astype instead of converting to a ndarray first.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2019

don't we already do something like this in the repr of EAs? we should have a same / generic mechanism, rather than implementing logic here.

I fully agree it would be nice to make that better, but right now, formatting on the one hand and csv on the other (which uses to_native_types) are two separate things, and fixing that seems out of scope for this PR.
For now, the only thing I am doing in this PR is actually using the EA's version of astype instead of converting to a ndarray first.

ok fair enough. do we have an issue for this distinction?

@jreback jreback added this to the 1.0 milestone Oct 11, 2019
@jorisvandenbossche
Copy link
Member Author

The main remaining question is whether we want tests for this (or if we are happy with it being indirectly tested on geopandas). Certainly, ideally we test it here as well. But, we currently don't have any EA that has that problem.

But will check if I can do a minimal ListArray EA (without implementing it fully) to reproduce this issue.

@jorisvandenbossche
Copy link
Member Author

OK, added a minimal ListArray and a test based on that just for this issue. And I checked that the test actually fails without the overriden to_native_types

@@ -0,0 +1,131 @@
"""Test extension array for storing nested data in a pandas container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think newline after """


@classmethod
def construct_array_type(cls):
"""Return the array type associated with this dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline, period

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LIstArray tests look good

if copy:
return self.copy()
return self
elif pd.api.types.is_string_dtype(dtype) and not pd.api.types.is_object_dtype(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could import at the top

@jorisvandenbossche
Copy link
Member Author

Any opinions about including this in 0.25.2 ?

Normally it would not meet my requirements to be included in a bug fix release, but in this case it solves a regression in geopandas (due to moving to ExtensionArray, and it's not something I can fix in geopandas). But not fully how risky this change is for other EAs.

cc @TomAugspurger

@jreback
Copy link
Contributor

jreback commented Oct 14, 2019

seems ok for 0.25.2

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 14, 2019 via email

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.0, 0.25.2 Oct 14, 2019
@jorisvandenbossche
Copy link
Member Author

OK, then I will merge ;)

@jorisvandenbossche jorisvandenbossche merged commit 8c5941c into pandas-dev:master Oct 14, 2019
@jorisvandenbossche jorisvandenbossche deleted the EA-native-types branch October 14, 2019 23:33
@jorisvandenbossche
Copy link
Member Author

@meeseeksdev backport 0.25.x

@lumberbot-app
Copy link

lumberbot-app bot commented Oct 15, 2019

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 0.25.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 8c5941cd57e260f4c3552769e79d8d6dbd6283d9
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #28841: BUG: use EA.astype in ExtensionBlock.to_native_types'
  1. Push to a named branch :
git push YOURFORK 0.25.x:auto-backport-of-pr-28841-on-0.25.x
  1. Create a PR against branch 0.25.x, I would have named this PR:

"Backport PR #28841 on branch 0.25.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

@jorisvandenbossche
Copy link
Member Author

Manual backport: #28985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: EAs with list-like values fail in to_csv (due to to_native_types)
4 participants