ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368

jorisvandenbossche · 2019-09-10T14:29:25Z

Adding custom conversion of IntegerArray to an Arrow array, which makes that this can also be written to parquet.
Currently it is only one way, for read_parquet it will come back as int or float (depending on presence of missing values), but fixing this is also being discussed (https://issues.apache.org/jira/browse/ARROW-2428).

The tests currently require the master version of Arrow, which we don't test. I can assure that the tests pass locally for me, but is this something we want to merge without coverage on CI?
(in principle we could add Arrow master in eg the numpy-dev build or another one, but that is a bit more work, so not sure that is worth the currently limited number of tests we have relying on arrow)

TomAugspurger · 2019-09-10T15:31:19Z

No preference on CI. If there are nightly binaries available then OK with adding them to the numpydev build.

Question: does pa.array(obj) call obj.__arrow_array__ automatically?

jorisvandenbossche · 2019-09-10T15:35:29Z

Question: does pa.array(obj) call obj.arrow_array automatically?

What do you mean with "automatically" ? It will call it if the method exists, and for Series it will also check if the underlying values has the method (you can see the exact implementation here: apache/arrow#5106)

TomAugspurger · 2019-09-10T15:57:06Z

Yeah, that's what I meant by automatically (rather than going through __iter__).

WillAyd · 2019-09-11T00:14:31Z

Do we just want to take on array as a dependency? I know that discussion has come up in the past without resolution. I don't object to it so if so could simplify some things here

jorisvandenbossche · 2019-09-11T06:36:41Z

For this PR pyarrow does not need to be a hard dependency, and actually here a lazy import is enough. It's only for #28371 that a lazy import is more problematic (but that still doesn't need to make it a hard dependency).

Since those PRs are about IO to pyarrow, you only need this if you are actually using pyarrow for others reasons (eg for IPC, for parquet, ..), not for internal functionality in pandas. So I would keep the discussion of making pyarrow a hard dependency for later if we would make more use of it in pandas itself.

jbrockmendel · 2019-09-12T01:06:50Z

#28371 does something similar for Period and Interval. Are there more coming after this? Would it make sense to collect these somewhere like compat._arrow? I don't have a problem with the placement in this PR, just thinking out loud.

jorisvandenbossche · 2019-09-12T06:58:33Z

Would it make sense to collect these somewhere like compat._arrow?

The actual __arrow_array__ of course needs to be defined on the actual IntegerArray, but the implementation itself could call out to another module. For the type classes I am defining in #28371 this might make sense, but for this PR the implementation is literally one line (pa.array(self._data, mask=self._mask, type=type)), so IMO that is not worth putting in a separate function somewhere else.

But let's discuss further in #28371

jorisvandenbossche · 2019-09-12T15:31:01Z

Any code comments?
I think this one is good to go, #28371 needs more discussion though (where to put it, always try to import pyarrow or not, ..)

WillAyd

Just some stylistic comments. So is everyone else aligned on this dunder for conversion? Haven't been too involved in conversation but figured worth double checking before moving forward

WillAyd · 2019-09-12T15:34:29Z

pandas/tests/arrays/test_integer.py

@@ -19,6 +21,13 @@
 from pandas.tests.extension.base import BaseOpsUtil
 import pandas.util.testing as tm

+try:


You should be able to replace this with compat._optional.import_optional_dependency

I've never used compat._optional.import_optional_dependency before, so correct if me if I am wrong: it seems that method is typically used in the code (not tests) and is meant to raise an error or return the module (so eg in functions that use an optional dependency).
So I would still need to catch the error, I think? In which case I am not sure it is necessarily clearer, or deduplicating code.

Ah, I see it has a raise_on_missing=False option. But in theory I then also need to specify on_version='ignore' to not have an error or warning on old pyarrow versions.

But, I can maybe actually replace it with pytest.importorskip inside the test, which seems better suited for test cases. That should also simplify it.

Or, we have our own wrapper around that as td.skip_if_no decorator

OK, simplified it with td.skip_if_no. I was actually already using it in the other test as well .. (so I was being a bit blind :-))

WillAyd · 2019-09-12T15:35:18Z

pandas/tests/arrays/test_integer.py

+    not _PYARROW_INSTALLED
+    or _PYARROW_INSTALLED
+    and LooseVersion(pyarrow.__version__) < LooseVersion("0.14.1.dev"),
+    reason="pyarrow >= 0.15.0 required",


Is there a particular reason for this requiring 0.15.0 but the next test requiring 0.14.1.dev?

Yes, this is to be able to test it with arrow master locally. We can also wait until final 0.15.0 is released, and then I can change this check. But in practice it gives the same.

jorisvandenbossche · 2019-09-12T15:42:33Z

So is everyone else aligned on this dunder for conversion?

Well, it's a protocol of pyarrow, not pandas, and the decision is made in pyarrow now (it would be like pandas not liking numpy's __array__). Of course, feedback on the protocol for conversion to arrow is certainly welcome, it's very new so can still be changed.

TomAugspurger

+1, if you could comment on https://github.com/pandas-dev/pandas/pull/28368/files#r323810167.

I suspect that >= 0.14.1.dev is just going to be pyarrow 0.15.0 or above? No 0.14.2 planned?

jorisvandenbossche · 2019-09-12T17:52:09Z

it's a protocol of pyarrow,

For context, see also #20612 (comment)

WillAyd · 2019-09-12T18:06:28Z

Not at a computer to check but I think it returns the module or None, so you can use that instead of rolling your own ZZZ_INSTALLED globals

…

Sent from my iPhone

On Sep 12, 2019, at 10:59 AM, Joris Van den Bossche ***@***.***> wrote: @jorisvandenbossche commented on this pull request. In pandas/tests/arrays/test_integer.py: > @@ -19,6 +21,13 @@ from pandas.tests.extension.base import BaseOpsUtil import pandas.util.testing as tm +try: I've never used compat._optional.import_optional_dependency before, so correct if me if I am wrong: it seems that method is typically used in the code (not tests) and is meant to raise an error or return the module (so eg in functions that use an optional dependency). So I would still need to catch the error, I think? In which case I am not sure it is necessarily clearer, or deduplicating code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

TomAugspurger · 2019-09-12T19:00:02Z

Probably want a release note? Or... perhaps not since pyarrow 0.15 isn't out yet?

Feel free to merge if a release note isn't needed.

TomAugspurger · 2019-09-12T20:57:02Z

Thanks.

…andas-dev#28368) * ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow * simplify pyarrow version check in tests * add whatsnew

ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow

253c6f4

jorisvandenbossche added Compat pandas objects compatability with Numpy or Python functions Enhancement labels Sep 10, 2019

jorisvandenbossche added this to the 1.0 milestone Sep 10, 2019

jorisvandenbossche mentioned this pull request Sep 10, 2019

ENH: add and register Arrow extension types for Period and Interval #28371

Merged

jorisvandenbossche mentioned this pull request Sep 12, 2019

Serialization / Deserialization of ExtensionArrays #20612

Open

WillAyd reviewed Sep 12, 2019

View reviewed changes

TomAugspurger reviewed Sep 12, 2019

View reviewed changes

simplify pyarrow version check in tests

e814271

TomAugspurger approved these changes Sep 12, 2019

View reviewed changes

add whatsnew

1e66165

TomAugspurger merged commit 34fff1f into pandas-dev:master Sep 12, 2019

jorisvandenbossche deleted the integer-arrow-array branch October 23, 2019 13:51

jorisvandenbossche mentioned this pull request Oct 23, 2019

ENH: Add StringArray.__arrow_array__ for conversion to Arrow #29182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368

ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368

jorisvandenbossche commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

jorisvandenbossche commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

WillAyd commented Sep 11, 2019

jorisvandenbossche commented Sep 11, 2019

jbrockmendel commented Sep 12, 2019

jorisvandenbossche commented Sep 12, 2019

jorisvandenbossche commented Sep 12, 2019

WillAyd left a comment

WillAyd Sep 12, 2019

jorisvandenbossche Sep 12, 2019

jorisvandenbossche Sep 12, 2019

jorisvandenbossche Sep 12, 2019

jorisvandenbossche Sep 12, 2019

WillAyd Sep 12, 2019

jorisvandenbossche Sep 12, 2019

jorisvandenbossche commented Sep 12, 2019

TomAugspurger left a comment

jorisvandenbossche commented Sep 12, 2019

WillAyd commented Sep 12, 2019 via email

TomAugspurger commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368

ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368

Conversation

jorisvandenbossche commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

jorisvandenbossche commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

WillAyd commented Sep 11, 2019

jorisvandenbossche commented Sep 11, 2019

jbrockmendel commented Sep 12, 2019

jorisvandenbossche commented Sep 12, 2019

jorisvandenbossche commented Sep 12, 2019

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Sep 12, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 12, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 12, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 12, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 12, 2019

Choose a reason for hiding this comment

WillAyd Sep 12, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 12, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 12, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 12, 2019

WillAyd commented Sep 12, 2019 via email

TomAugspurger commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019