ENH: add and register Arrow extension types for Period and Interval #28371

jorisvandenbossche · 2019-09-10T15:33:31Z

Related to #28368, but now for Period and Interval for which we define extension types to store them with metadata in arrow.

Still needs some more tests and fixing corner cases.
We probably also want to consolidate the pyarrow import checking somewhat.

I think a main question is how to deal with the import of pyarrow. The way I did it now makes that it is no longer a lazy import (as it was currently for the parquet functionality).

TomAugspurger

No strong thoughts on importing pyarrow. It'll take a bit of time which is unfortunate, but doing things lazily seems hard.

pandas/core/arrays/period.py

simonjayhawkins · 2019-09-10T17:33:39Z

pandas/tests/arrays/test_period.py

@@ -1,3 +1,5 @@
+from distutils.version import LooseVersion


can compat._optional.import_optional_dependency be used to simplify things?

From quickly looking at that function, I think it can only be used for lazy imports, which is not what I am doing here (but that is exactly something that we might need to discuss)

…ypes

jbrockmendel · 2019-11-02T03:16:25Z

@jorisvandenbossche is this actionable?

jorisvandenbossche · 2019-11-03T20:51:54Z

I have been updating this PR locally last week, but still thinking through how we can deal with a delayed pyarrow import.

I think a clear disadvantage of always trying to import pyarrow is import time (something we have been trying to reduce).
But the main problem of only importing pyarrow when needed here, is that those extension types need to be registered when arrow is creating the data (eg from reading a parquet file, or receiving IPC messages). So it is not necessarily pandas that knows when it is needed to have those types registered (eg when using pyarrow to read a parquet file and convert to pandas instead of pandas' read_parquet function).

So we could have a publicly exposed register_arrow_types function that does this, but that seems rather inconvenient if users have to call this manually in certain cases ..

…ypes

jorisvandenbossche · 2019-11-05T16:52:00Z

Updated this, still todo:

Fix missing value handling for interval types
Clean-up conversion of numpy/pandas dtype to arrow dtype
Move some arrow conversion code to common utilities (will also be useful for integer/string array)
Further look into if a lazy import is possible

…ypes

jorisvandenbossche · 2019-11-14T10:05:25Z

Any more feedback on the import issue? (see #28371 (comment))

jreback · 2019-11-14T10:56:38Z

@jorisvandenbossche the way you did the imports is fine; we must lazily import pyarrow or make it a hard dep

i think that would be better to make some functions to reduce the copy paste of what you did here in this PR. basically you isolate the arrow code (meaning you directly import it at the top of the module) in a separate module (file). then import that module conditionally.

jorisvandenbossche · 2019-11-14T12:31:04Z

the way you did the imports is fine; we must lazily import pyarrow or make it a hard dep

Yes, but the lazily import can be done in two ways: always try on pandas import, or only try when someone uses the parquet functionality.

It's similar as with plotting: matplotlib is not a hard dependency, but before, we always tried to import it (and register some converters), while now we moved to only importing it when someone tries to plot.

jbrockmendel · 2019-11-14T17:12:33Z

pyarrow import takes ~160 ms for me, but 135 of that is numpy, so would only add about 25 ms to our import time. If its easy to avoid that'd be nice, but not worth significant gymnastics

jorisvandenbossche · 2019-11-14T20:07:10Z

Ah, that's interesting (I tried to time it a while ago, but seem to remember is was rather fluctuating).
In such a case, I would personally maybe prefer to always try to import (although matplotlib is also not that much slower to import ..)

jbrockmendel · 2019-12-01T01:14:21Z

@jorisvandenbossche pls rebase

…ypes

…does not use the EA)

jbrockmendel · 2019-12-21T04:54:19Z

@jorisvandenbossche pls rebase

jorisvandenbossche · 2020-01-01T13:50:20Z

I will try to rebase shortly. Any final concerns on the lazy vs non-lazy import?

Right now, I would propose to keep this PR as is (i.e. non-lazy import -> always try to import pyarrow on pandas import if it is installed; Brock showed above that the additional import time (on top of importing numpy) is rather limited), but as I mentioned above, I can also refactor it to have the lazy import

TomAugspurger · 2020-01-01T14:38:57Z

I’m ok with non-lazy.

jreback

ok with non lazy import check

should likely do this just once and export the _PYARROW_INSTALLED variable like we do for numexpr

jreback · 2020-01-01T14:36:41Z

pandas/core/arrays/interval.py

+    import pyarrow
+
+    _PYARROW_INSTALLED = True
+except ImportError:


can u make this into a function and put in common location

I moved this for now into an _arrow_utils.py file in the arrays directory (open for other names), we can then put some common functions in that file as well

pandas/core/arrays/interval.py

jreback · 2020-01-01T14:38:50Z

pandas/core/arrays/interval.py

@@ -1217,3 +1279,55 @@ def maybe_convert_platform_interval(values):
        values = np.asarray(values)

    return maybe_convert_platform(values)
+
+
+if _PYARROW_INSTALLED and LooseVersion(pyarrow.__version__) >= LooseVersion("0.15"):


__PYARROW_INSTALLED needs to incorporate the version check

I moved the version check into the separate file (and made it a variable), but kept it separate from the import check as different functionalities might need a different pyarrow version

jreback · 2020-01-01T14:39:18Z

pandas/core/arrays/period.py

@@ -49,6 +51,13 @@
 from pandas.tseries import frequencies
 from pandas.tseries.offsets import DateOffset, Tick, _delta_to_tick



same as above

pandas/tests/arrays/test_period.py

…ypes

jorisvandenbossche · 2020-01-06T12:12:01Z

There is a CI failure for the Linux py36_locale build: apparently the pyarrow install failed there (and for the other tests, this gets ignored / skipped, so that's the reason we didn't see it yet). So that raises two issues: 1) we should fix that CI env and 2) we should probably catch a more general error to avoid that a wrong pyarrow installation let the pandas import fail.

jreback · 2020-01-06T12:20:10Z

There is a CI failure for the Linux py36_locale build: apparently the pyarrow install failed there (and for the other tests, this gets ignored / skipped, so that's the reason we didn't see it yet). So that raises two issues: 1) we should fix that CI env and 2) we should probably catch a more general error to avoid that a wrong pyarrow installation let the pandas import fail.

absolutely not

we need to see failures that are not the result of expected things

jorisvandenbossche · 2020-01-06T13:08:32Z

OK, so the reason is not a failed installation, but actually that pyarrow up to version 0.12 also tries to import pandas, so you get a circular dependency which outs itself in the AttributeError.

So, one option is to increase the minimum pyarrow version to 0.13. But, the problem with that is that this will still give a confusing error message when people have pyarrow 0.12 installed.

jreback · 2020-01-06T13:18:50Z

OK, so the reason is not a failed installation, but actually that pyarrow up to version 0.12 also tries to import pandas, so you get a circular dependency which outs itself in the AttributeError.

So, one option is to increase the minimum pyarrow version to 0.13. But, the problem with that is that this will still give a confusing error message when people have pyarrow 0.12 installed.

we already bumped to 0.12 for various things, you could push this to 0.13 (no objection). these conversions require a higher version anyhow? (e.g. interval/period)?

…ypes

jorisvandenbossche · 2020-01-08T12:46:55Z

OK, for now (to get this PR in a mergeable state), I moved away from always trying to import. This will give some realistic corner cases (like reading a parquet file with pyarrow instead of with pandas will not give the extension type), but we can see this new feature as experimental anyway, and to be improved later.

By not importing it by default, we avoid the circular import problem with old pyarrow versions. Bumping the required pyarrow version might solve this, but that still causes pyarrow to become un-importable, if you install pyarrow 0.12 and new pandas side by side with a very cryptic error message.
I might still try to add something more clever (based on pkg_resources to check the installed pyarrow version, to avoid needing to import pyarrow to know if it is recent enough to import, but that can go in a separate PR).

jorisvandenbossche · 2020-01-08T13:27:32Z

The remaining "failure" is codecov because the functionality I added is not run in the coverage build (only with arrow master)

TomAugspurger

Two questions:

~~Do we have any tests that ensure pyarrow is imported with pandas? IIRC that was somewhat important for getting the types registered?~~ (just read ENH: add and register Arrow extension types for Period and Interval #28371 (comment))
What's the behavior for __arrow_array__ pyarrow 0.15 and earlier? Is it just not called, so we don't need to worry about checking the pyarrow version within there? Or will the line from pandas.core.arrays._arrow_utils import ArrowIntervalType raise an ImportError?

jorisvandenbossche · 2020-01-08T14:06:25Z

What's the behavior for arrow_array pyarrow 0.15 and earlier? Is it just not called, so we don't need to worry about checking the pyarrow version within there?

Indeed, it should never be called with versions older than 0.15 (unless you would manually call the method, but the method is only called by pyarrow starting from 0.15)

TomAugspurger · 2020-01-08T14:16:54Z

Great, +1 then.

jreback

lgtm. some small questions. ping on green.

jreback · 2020-01-09T03:18:08Z

doc/source/whatsnew/v1.0.0.rst

  (:meth:`~DataFrame.to_parquet` / :func:`read_parquet`) using the `'pyarrow'` engine
-  now preserve those data types with pyarrow >= 1.0.0 (:issue:`20612`).
+  now preserve those data types with pyarrow >= 0.16.0 (:issue:`20612`, :issue:`28371`).


should this be 0.15?

No, the pandas -> arrow conversion protocol was already included in 0.15, but the other way (for a full roundtrip) only landed after 0.15. It was just decided that the next arrow release will be 0.16 and not 1.0, so therefore changed the text here.

jreback · 2020-01-09T03:21:10Z

pandas/tests/arrays/interval/test_interval.py

+# Arrow interaction
+
+
+pyarrow_skip = td.skip_if_no("pyarrow", min_version="0.15.1.dev")


is this right?

As mentioned above, the pandas -> arrow part already works on 0.15 in general, but due to a "bug" in pyarrow (due to the use of .values, see apache/arrow#5753), period and interval ExtensionArrays specifically don't work yet (because .values returns an object array for those, and not an EA). So therefore those tests for period and interval also only work with pyarrow higher than 0.15.

…ith pandas master https://issues.apache.org/jira/browse/ARROW-7527 Period dtype is now supported in pandas <-> arrow conversions with pandas master (pandas-dev/pandas#28371) Closes #6147 from jorisvandenbossche/ARROW-7527 and squashes the following commits: a64da2c <Joris Van den Bossche> ARROW-7527: Fix pandas/feather tests for unsupported types with pandas master Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: François Saint-Jacques <[email protected]>

jorisvandenbossche added 2 commits September 10, 2019 17:01

add PeriodType arrow extension type

e3ab110

add IntervalType arrow extension type

6c1300f

TomAugspurger reviewed Sep 10, 2019

View reviewed changes

pandas/core/arrays/period.py Outdated Show resolved Hide resolved

pandas/core/arrays/period.py Outdated Show resolved Hide resolved

pandas/core/arrays/period.py Outdated Show resolved Hide resolved

simonjayhawkins reviewed Sep 10, 2019

View reviewed changes

rename + make hashable

5eb8ad6

This was referenced Sep 11, 2019

ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368

Merged

Serialization / Deserialization of ExtensionArrays #20612

Open

Merge remote-tracking branch 'upstream/master' into arrow-extension-t…

47c4755

…ypes

jorisvandenbossche added 4 commits November 5, 2019 14:49

Merge remote-tracking branch 'upstream/master' into arrow-extension-t…

e7e0674

…ypes

better validation of types + tests

85bf36c

add tests for missing values with IntervalArray

f325ff1

Add arrow -> pandas conversion + tests

82589dd

jorisvandenbossche marked this pull request as ready for review November 5, 2019 16:49

jorisvandenbossche changed the title ~~[WIP] ENH: add and register Arrow extension types for Period and Interval~~ ENH: add and register Arrow extension types for Period and Interval Nov 5, 2019

jorisvandenbossche added this to the 1.0 milestone Nov 5, 2019

jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Interval Interval data type Period Period data type labels Nov 5, 2019

jorisvandenbossche added 4 commits November 8, 2019 13:32

Merge remote-tracking branch 'upstream/master' into arrow-extension-t…

64bf38b

…ypes

fix interval subtype and missing value handling

70e7023

Merge remote-tracking branch 'upstream/master' into arrow-extension-t…

b09f54d

…ypes

Merge remote-tracking branch 'upstream/master' into arrow-extension-t…

913f310

…ypes

jorisvandenbossche added 2 commits December 9, 2019 11:17

Merge remote-tracking branch 'upstream/master' into arrow-extension-t…

206c609

…ypes

period test only for pyarrow 0.15dev (in 0.15 .values was used which …

e9a032d

…does not use the EA)

jreback requested changes Jan 1, 2020

View reviewed changes

jorisvandenbossche added 2 commits January 6, 2020 10:35

Merge remote-tracking branch 'upstream/master' into arrow-extension-t…

16523af

…ypes

move common things to _arrow_utils

1b6f21e

jorisvandenbossche added 3 commits January 8, 2020 11:25

Merge remote-tracking branch 'upstream/master' into arrow-extension-t…

d39b8a3

…ypes

use commong function in IntDtype from_arrow

4156718

lazy import for now

92a1ede

update whatsnew for pyarrow next version

e303749

jorisvandenbossche mentioned this pull request Jan 8, 2020

COMPAT: bump minimum version to pyarrow 0.13 #30812

Merged

TomAugspurger reviewed Jan 8, 2020

View reviewed changes

jreback approved these changes Jan 9, 2020

View reviewed changes

jorisvandenbossche merged commit 2198f51 into pandas-dev:master Jan 9, 2020

jorisvandenbossche deleted the arrow-extension-types branch January 9, 2020 08:34

jorisvandenbossche mentioned this pull request Jan 9, 2020

ARROW-7527: [Python] Fix pandas/feather tests for unsupported types with pandas master apache/arrow#6147

Closed

jorisvandenbossche mentioned this pull request May 12, 2021

ENH: always register our Arrow extension types on pandas import #41432

Open

asfimport mentioned this pull request Nov 5, 2019

[Python] __arrow_array__ does not work for ExtensionTypes in Table.from_pandas apache/arrow#23334

Closed

		@@ -49,6 +51,13 @@
		from pandas.tseries import frequencies
		from pandas.tseries.offsets import DateOffset, Tick, _delta_to_tick

		# Arrow interaction


		pyarrow_skip = td.skip_if_no("pyarrow", min_version="0.15.1.dev")

ENH: add and register Arrow extension types for Period and Interval #28371

ENH: add and register Arrow extension types for Period and Interval #28371

Conversation

jorisvandenbossche commented Sep 10, 2019 • edited Loading

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Nov 2, 2019

jorisvandenbossche commented Nov 3, 2019

jorisvandenbossche commented Nov 5, 2019 • edited Loading

jorisvandenbossche commented Nov 14, 2019

jreback commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

jbrockmendel commented Dec 1, 2019

jbrockmendel commented Dec 21, 2019

jorisvandenbossche commented Jan 1, 2020

TomAugspurger commented Jan 1, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 6, 2020

jreback commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

jreback commented Jan 6, 2020

jorisvandenbossche commented Jan 8, 2020

jorisvandenbossche commented Jan 8, 2020

TomAugspurger left a comment • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 8, 2020 • edited Loading

TomAugspurger commented Jan 8, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 10, 2019 •

edited

Loading

jorisvandenbossche commented Nov 5, 2019 •

edited

Loading

jorisvandenbossche Jan 6, 2020 •

edited

Loading

TomAugspurger left a comment •

edited

Loading

jorisvandenbossche commented Jan 8, 2020 •

edited

Loading