-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368
Conversation
No preference on CI. If there are nightly binaries available then OK with adding them to the numpydev build. Question: does |
What do you mean with "automatically" ? It will call it if the method exists, and for Series it will also check if the underlying values has the method (you can see the exact implementation here: apache/arrow#5106) |
Yeah, that's what I meant by automatically (rather than going through |
Do we just want to take on array as a dependency? I know that discussion has come up in the past without resolution. I don't object to it so if so could simplify some things here |
For this PR Since those PRs are about IO to pyarrow, you only need this if you are actually using pyarrow for others reasons (eg for IPC, for parquet, ..), not for internal functionality in pandas. So I would keep the discussion of making pyarrow a hard dependency for later if we would make more use of it in pandas itself. |
#28371 does something similar for Period and Interval. Are there more coming after this? Would it make sense to collect these somewhere like compat._arrow? I don't have a problem with the placement in this PR, just thinking out loud. |
The actual But let's discuss further in #28371 |
Any code comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some stylistic comments. So is everyone else aligned on this dunder for conversion? Haven't been too involved in conversation but figured worth double checking before moving forward
pandas/tests/arrays/test_integer.py
Outdated
@@ -19,6 +21,13 @@ | |||
from pandas.tests.extension.base import BaseOpsUtil | |||
import pandas.util.testing as tm | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to replace this with compat._optional.import_optional_dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used compat._optional.import_optional_dependency
before, so correct if me if I am wrong: it seems that method is typically used in the code (not tests) and is meant to raise an error or return the module (so eg in functions that use an optional dependency).
So I would still need to catch the error, I think? In which case I am not sure it is necessarily clearer, or deduplicating code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see it has a raise_on_missing=False
option. But in theory I then also need to specify on_version='ignore'
to not have an error or warning on old pyarrow versions.
But, I can maybe actually replace it with pytest.importorskip
inside the test, which seems better suited for test cases. That should also simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, we have our own wrapper around that as td.skip_if_no
decorator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, simplified it with td.skip_if_no
. I was actually already using it in the other test as well .. (so I was being a bit blind :-))
pandas/tests/arrays/test_integer.py
Outdated
not _PYARROW_INSTALLED | ||
or _PYARROW_INSTALLED | ||
and LooseVersion(pyarrow.__version__) < LooseVersion("0.14.1.dev"), | ||
reason="pyarrow >= 0.15.0 required", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for this requiring 0.15.0
but the next test requiring 0.14.1.dev
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is to be able to test it with arrow master locally. We can also wait until final 0.15.0 is released, and then I can change this check. But in practice it gives the same.
Well, it's a protocol of pyarrow, not pandas, and the decision is made in pyarrow now (it would be like pandas not liking numpy's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, if you could comment on https://github.com/pandas-dev/pandas/pull/28368/files#r323810167.
I suspect that >= 0.14.1.dev
is just going to be pyarrow 0.15.0 or above? No 0.14.2 planned?
For context, see also #20612 (comment) |
Not at a computer to check but I think it returns the module or None, so you can use that instead of rolling your own ZZZ_INSTALLED globals
…Sent from my iPhone
On Sep 12, 2019, at 10:59 AM, Joris Van den Bossche ***@***.***> wrote:
@jorisvandenbossche commented on this pull request.
In pandas/tests/arrays/test_integer.py:
> @@ -19,6 +21,13 @@
from pandas.tests.extension.base import BaseOpsUtil
import pandas.util.testing as tm
+try:
I've never used compat._optional.import_optional_dependency before, so correct if me if I am wrong: it seems that method is typically used in the code (not tests) and is meant to raise an error or return the module (so eg in functions that use an optional dependency).
So I would still need to catch the error, I think? In which case I am not sure it is necessarily clearer, or deduplicating code.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Probably want a release note? Or... perhaps not since pyarrow 0.15 isn't out yet? Feel free to merge if a release note isn't needed. |
Thanks. |
…andas-dev#28368) * ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow * simplify pyarrow version check in tests * add whatsnew
…andas-dev#28368) * ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow * simplify pyarrow version check in tests * add whatsnew
…andas-dev#28368) * ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow * simplify pyarrow version check in tests * add whatsnew
Adding custom conversion of IntegerArray to an Arrow array, which makes that this can also be written to parquet.
Currently it is only one way, for read_parquet it will come back as int or float (depending on presence of missing values), but fixing this is also being discussed (https://issues.apache.org/jira/browse/ARROW-2428).
The tests currently require the master version of Arrow, which we don't test. I can assure that the tests pass locally for me, but is this something we want to merge without coverage on CI?
(in principle we could add Arrow master in eg the numpy-dev build or another one, but that is a bit more work, so not sure that is worth the currently limited number of tests we have relying on arrow)