Skip to content

WIP: EA SetArray #22382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

WIP: EA SetArray #22382

wants to merge 6 commits into from

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Aug 16, 2018

This is very WIP, but might IMO be helpful in figuring out the EA interfaces, since it has some other requirements than the EAs so far. It makes some way towards #4480 and might be a basis for #21547, where the comments were effectively "make this an EA", e.g. @jreback:

EA is an extension array & a dtype, both of which are first class. That's how I'd like to see sets (and lists and dicts)

The code works so far, and tests pass, but there's a large amount of issues (and probably some conflicts with some other current EA-PRs). As a disclaimer, in each case it may well be that I'm misunderstanding something, but I've tried my best, and the same issues would probably be encountered by other EA authors.

  • several methods, including fillna and astype do not dispatch to the EA impl
  • same for df.fillna not dispatching to Series
  • set is a case, where the nan-value is incompatible with the dtype in normal operations (i.e. {1} | np.nan raises). This means that tests like TestMethods.test_combine_add and TestMethods.test_combine_le will always fail on missing data
  • Bool ops had no dispatching mechanism. I added one, but can probably be improved. These are also not tested as far as I can tell.
  • Bool ops in particular will have to figure out a contract which side is cast to what, e.g. if op(EA, np.ndarray) is the same as op(np.ndarray, EA), or if the latter downcasts EA.
  • data[-1 must be non-na for BaseGetitemTests.test_take
  • found tests/extension/conftest.py too late, will need to adapt my fixtures
  • I'm sure my changes to pandas/tests/extension/base/ops.py will cause friction, but those are just WIP as well. In any case, tests like BaseComparisonOpsTests.test_compare_array can never suceed with comparing 0 to a non-numeric dtype (and I wrapped it into a Series since I was getting errors with the dispatch as a list object)

If/when this is ever merged, I'd like add more options to .astype('set') (example in #4480), and add a set accessor for other methods (again #4480).

@pep8speaks
Copy link

pep8speaks commented Aug 16, 2018

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 16, 2018 at 06:32 Hours UTC

@h-vetinari h-vetinari changed the title WIP: EA Setarray WIP: EA SetArray Aug 16, 2018
@jorisvandenbossche
Copy link
Member

@h-vetinari Thanks a lot for working on this!

To repeat my comment of the other PR, I personally don't think this should be necessarily included in pandas core, but would be perfect as an external package (now we have the EA interface).
That said, actually doing this (here or as external package) is indeed very good to iron out the interface, as arrays that contain sequences as scalars will certainly have some additional needs/complexities.

We have a dummy JSONArray implementation in the test suite that might run into similar problems, but for now that was not given much attention.

several methods, including fillna and astype do not dispatch to the EA impl

The astype is a know issue (#22343, as you are aware of), fillna should normally dispatch. Do you have a reproducible example to show it does not? If there are any others that don't dispatch, you can open separate issues/PRs for that.

Bool ops in particular will have to figure out a contract which side is cast to what, e.g. if op(EA, np.ndarray) is the same as op(np.ndarray, EA), or if the latter downcasts EA.

Can you give a more concrete example to show what you mean?

there's not a lot of documentation what the EA test suite expects. E.g. data must be length 100, data_missing must be [np.nan, instance of dtype] (but data can have missing values too?), data[-1] must non-na, etc.

The things you mention above (except the last one) are mentioned in the docstrings in pandas/tests/extensions/conftest.py ? But please feel free to add any better documentation on things that are unclear!
I think we assume that data has no missing values (and maybe also that all are unique, not sure).

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 16, 2018
@@ -1482,6 +1484,45 @@ def _bool_method_SERIES(cls, op, special):
code duplication.
"""

def dispatch_to_extension_op(op, left, right):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a repetition of the existing dispatch function?

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 16, 2018

The things you mention above (except the last one) are mentioned in the docstrings in pandas/tests/extensions/conftest.py ? But please feel free to add any better documentation on things that are unclear!
I think we assume that data has no missing values (and maybe also that all are unique, not sure).

I had edited the OP a few minutes before you responded, because I had finally found pandas/tests/extensions/conftest.py. Not all requirements are mentioned though, and test_integer.py does have missing values in data, for ex.

I'll respond to the rest later, no time right now.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

closing as stale. would be more amenable as an EA.

@jreback jreback closed this Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants