-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add and register Arrow extension types for Period and Interval #28371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add and register Arrow extension types for Period and Interval #28371
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong thoughts on importing pyarrow. It'll take a bit of time which is unfortunate, but doing things lazily seems hard.
pandas/tests/arrays/test_period.py
Outdated
@@ -1,3 +1,5 @@ | |||
from distutils.version import LooseVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can compat._optional.import_optional_dependency be used to simplify things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From quickly looking at that function, I think it can only be used for lazy imports, which is not what I am doing here (but that is exactly something that we might need to discuss)
@jorisvandenbossche is this actionable? |
I have been updating this PR locally last week, but still thinking through how we can deal with a delayed pyarrow import. I think a clear disadvantage of always trying to import pyarrow is import time (something we have been trying to reduce). So we could have a publicly exposed |
Updated this, still todo:
|
Any more feedback on the import issue? (see #28371 (comment)) |
@jorisvandenbossche the way you did the imports is fine; we must lazily import pyarrow or make it a hard dep i think that would be better to make some functions to reduce the copy paste of what you did here in this PR. basically you isolate the arrow code (meaning you directly import it at the top of the module) in a separate module (file). then import that module conditionally. |
Yes, but the lazily import can be done in two ways: always try on pandas import, or only try when someone uses the parquet functionality. It's similar as with plotting: matplotlib is not a hard dependency, but before, we always tried to import it (and register some converters), while now we moved to only importing it when someone tries to plot. |
pyarrow import takes ~160 ms for me, but 135 of that is numpy, so would only add about 25 ms to our import time. If its easy to avoid that'd be nice, but not worth significant gymnastics |
Ah, that's interesting (I tried to time it a while ago, but seem to remember is was rather fluctuating). |
@jorisvandenbossche pls rebase |
@jorisvandenbossche pls rebase |
I will try to rebase shortly. Any final concerns on the lazy vs non-lazy import? Right now, I would propose to keep this PR as is (i.e. non-lazy import -> always try to import pyarrow on pandas import if it is installed; Brock showed above that the additional import time (on top of importing numpy) is rather limited), but as I mentioned above, I can also refactor it to have the lazy import |
I’m ok with non-lazy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok with non lazy import check
should likely do this just once and export the _PYARROW_INSTALLED variable like we do for numexpr
pandas/core/arrays/interval.py
Outdated
import pyarrow | ||
|
||
_PYARROW_INSTALLED = True | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u make this into a function and put in common location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this for now into an _arrow_utils.py
file in the arrays
directory (open for other names), we can then put some common functions in that file as well
pandas/core/arrays/interval.py
Outdated
@@ -1217,3 +1279,55 @@ def maybe_convert_platform_interval(values): | |||
values = np.asarray(values) | |||
|
|||
return maybe_convert_platform(values) | |||
|
|||
|
|||
if _PYARROW_INSTALLED and LooseVersion(pyarrow.__version__) >= LooseVersion("0.15"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__PYARROW_INSTALLED needs to incorporate the version check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the version check into the separate file (and made it a variable), but kept it separate from the import check as different functionalities might need a different pyarrow version
pandas/core/arrays/period.py
Outdated
@@ -49,6 +51,13 @@ | |||
from pandas.tseries import frequencies | |||
from pandas.tseries.offsets import DateOffset, Tick, _delta_to_tick | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There is a CI failure for the Linux py36_locale build: apparently the pyarrow install failed there (and for the other tests, this gets ignored / skipped, so that's the reason we didn't see it yet). So that raises two issues: 1) we should fix that CI env and 2) we should probably catch a more general error to avoid that a wrong pyarrow installation let the pandas import fail. |
we need to see failures that are not the result of expected things |
OK, so the reason is not a failed installation, but actually that pyarrow up to version 0.12 also tries to import pandas, so you get a circular dependency which outs itself in the AttributeError. So, one option is to increase the minimum pyarrow version to 0.13. But, the problem with that is that this will still give a confusing error message when people have pyarrow 0.12 installed. |
we already bumped to 0.12 for various things, you could push this to 0.13 (no objection). these conversions require a higher version anyhow? (e.g. interval/period)? |
OK, for now (to get this PR in a mergeable state), I moved away from always trying to import. This will give some realistic corner cases (like reading a parquet file with pyarrow instead of with pandas will not give the extension type), but we can see this new feature as experimental anyway, and to be improved later. By not importing it by default, we avoid the circular import problem with old pyarrow versions. Bumping the required pyarrow version might solve this, but that still causes pyarrow to become un-importable, if you install pyarrow 0.12 and new pandas side by side with a very cryptic error message. |
The remaining "failure" is codecov because the functionality I added is not run in the coverage build (only with arrow master) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
Do we have any tests that ensure pyarrow is imported with pandas? IIRC that was somewhat important for getting the types registered?(just read ENH: add and register Arrow extension types for Period and Interval #28371 (comment))- What's the behavior for
__arrow_array__
pyarrow 0.15 and earlier? Is it just not called, so we don't need to worry about checking the pyarrow version within there? Or will the linefrom pandas.core.arrays._arrow_utils import ArrowIntervalType
raise an ImportError?
Indeed, it should never be called with versions older than 0.15 (unless you would manually call the method, but the method is only called by pyarrow starting from 0.15) |
Great, +1 then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some small questions. ping on green.
(:meth:`~DataFrame.to_parquet` / :func:`read_parquet`) using the `'pyarrow'` engine | ||
now preserve those data types with pyarrow >= 1.0.0 (:issue:`20612`). | ||
now preserve those data types with pyarrow >= 0.16.0 (:issue:`20612`, :issue:`28371`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be 0.15?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the pandas -> arrow conversion protocol was already included in 0.15, but the other way (for a full roundtrip) only landed after 0.15. It was just decided that the next arrow release will be 0.16 and not 1.0, so therefore changed the text here.
# Arrow interaction | ||
|
||
|
||
pyarrow_skip = td.skip_if_no("pyarrow", min_version="0.15.1.dev") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, the pandas -> arrow part already works on 0.15 in general, but due to a "bug" in pyarrow (due to the use of .values
, see apache/arrow#5753), period and interval ExtensionArrays specifically don't work yet (because .values
returns an object array for those, and not an EA). So therefore those tests for period and interval also only work with pyarrow higher than 0.15.
…ith pandas master https://issues.apache.org/jira/browse/ARROW-7527 Period dtype is now supported in pandas <-> arrow conversions with pandas master (pandas-dev/pandas#28371) Closes #6147 from jorisvandenbossche/ARROW-7527 and squashes the following commits: a64da2c <Joris Van den Bossche> ARROW-7527: Fix pandas/feather tests for unsupported types with pandas master Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: François Saint-Jacques <[email protected]>
Related to #28368, but now for Period and Interval for which we define extension types to store them with metadata in arrow.
Still needs some more tests and fixing corner cases.
We probably also want to consolidate the pyarrow import checking somewhat.
I think a main question is how to deal with the import of pyarrow. The way I did it now makes that it is no longer a lazy import (as it was currently for the parquet functionality).