-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: add and register Arrow extension types for Period and Interval #28371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jorisvandenbossche
merged 24 commits into
pandas-dev:master
from
jorisvandenbossche:arrow-extension-types
Jan 9, 2020
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
e3ab110
add PeriodType arrow extension type
jorisvandenbossche 6c1300f
add IntervalType arrow extension type
jorisvandenbossche 5eb8ad6
rename + make hashable
jorisvandenbossche 47c4755
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche e7e0674
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche 85bf36c
better validation of types + tests
jorisvandenbossche f325ff1
add tests for missing values with IntervalArray
jorisvandenbossche 82589dd
Add arrow -> pandas conversion + tests
jorisvandenbossche 64bf38b
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche 70e7023
fix interval subtype and missing value handling
jorisvandenbossche b09f54d
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche 913f310
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche 76a6f46
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche 6587bd2
use skip_if_no decorator
jorisvandenbossche 5303bae
add parquet tests
jorisvandenbossche a97808c
clean-up type conversion
jorisvandenbossche 206c609
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche e9a032d
period test only for pyarrow 0.15dev (in 0.15 .values was used which …
jorisvandenbossche 16523af
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche 1b6f21e
move common things to _arrow_utils
jorisvandenbossche d39b8a3
Merge remote-tracking branch 'upstream/master' into arrow-extension-t…
jorisvandenbossche 4156718
use commong function in IntDtype from_arrow
jorisvandenbossche 92a1ede
lazy import for now
jorisvandenbossche e303749
update whatsnew for pyarrow next version
jorisvandenbossche File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
from distutils.version import LooseVersion | ||
import json | ||
|
||
import numpy as np | ||
import pyarrow | ||
|
||
from pandas.core.arrays.interval import _VALID_CLOSED | ||
|
||
_pyarrow_version_ge_015 = LooseVersion(pyarrow.__version__) >= LooseVersion("0.15") | ||
|
||
|
||
def pyarrow_array_to_numpy_and_mask(arr, dtype): | ||
""" | ||
Convert a primitive pyarrow.Array to a numpy array and boolean mask based | ||
on the buffers of the Array. | ||
|
||
Parameters | ||
---------- | ||
arr : pyarrow.Array | ||
dtype : numpy.dtype | ||
|
||
Returns | ||
------- | ||
(data, mask) | ||
Tuple of two numpy arrays with the raw data (with specified dtype) and | ||
a boolean mask (validity mask, so False means missing) | ||
""" | ||
buflist = arr.buffers() | ||
data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + len(arr)] | ||
bitmask = buflist[0] | ||
if bitmask is not None: | ||
mask = pyarrow.BooleanArray.from_buffers( | ||
pyarrow.bool_(), len(arr), [None, bitmask] | ||
) | ||
mask = np.asarray(mask) | ||
else: | ||
mask = np.ones(len(arr), dtype=bool) | ||
return data, mask | ||
|
||
|
||
if _pyarrow_version_ge_015: | ||
# the pyarrow extension types are only available for pyarrow 0.15+ | ||
|
||
class ArrowPeriodType(pyarrow.ExtensionType): | ||
def __init__(self, freq): | ||
# attributes need to be set first before calling | ||
# super init (as that calls serialize) | ||
self._freq = freq | ||
pyarrow.ExtensionType.__init__(self, pyarrow.int64(), "pandas.period") | ||
|
||
@property | ||
def freq(self): | ||
return self._freq | ||
|
||
def __arrow_ext_serialize__(self): | ||
metadata = {"freq": self.freq} | ||
return json.dumps(metadata).encode() | ||
|
||
@classmethod | ||
def __arrow_ext_deserialize__(cls, storage_type, serialized): | ||
metadata = json.loads(serialized.decode()) | ||
return ArrowPeriodType(metadata["freq"]) | ||
|
||
def __eq__(self, other): | ||
if isinstance(other, pyarrow.BaseExtensionType): | ||
return type(self) == type(other) and self.freq == other.freq | ||
else: | ||
return NotImplemented | ||
|
||
def __hash__(self): | ||
return hash((str(self), self.freq)) | ||
|
||
# register the type with a dummy instance | ||
_period_type = ArrowPeriodType("D") | ||
pyarrow.register_extension_type(_period_type) | ||
|
||
class ArrowIntervalType(pyarrow.ExtensionType): | ||
def __init__(self, subtype, closed): | ||
# attributes need to be set first before calling | ||
# super init (as that calls serialize) | ||
assert closed in _VALID_CLOSED | ||
self._closed = closed | ||
if not isinstance(subtype, pyarrow.DataType): | ||
subtype = pyarrow.type_for_alias(str(subtype)) | ||
self._subtype = subtype | ||
|
||
storage_type = pyarrow.struct([("left", subtype), ("right", subtype)]) | ||
pyarrow.ExtensionType.__init__(self, storage_type, "pandas.interval") | ||
|
||
@property | ||
def subtype(self): | ||
return self._subtype | ||
|
||
@property | ||
def closed(self): | ||
return self._closed | ||
|
||
def __arrow_ext_serialize__(self): | ||
metadata = {"subtype": str(self.subtype), "closed": self.closed} | ||
return json.dumps(metadata).encode() | ||
|
||
@classmethod | ||
def __arrow_ext_deserialize__(cls, storage_type, serialized): | ||
metadata = json.loads(serialized.decode()) | ||
subtype = pyarrow.type_for_alias(metadata["subtype"]) | ||
closed = metadata["closed"] | ||
return ArrowIntervalType(subtype, closed) | ||
|
||
def __eq__(self, other): | ||
if isinstance(other, pyarrow.BaseExtensionType): | ||
return ( | ||
type(self) == type(other) | ||
and self.subtype == other.subtype | ||
and self.closed == other.closed | ||
) | ||
else: | ||
return NotImplemented | ||
|
||
def __hash__(self): | ||
return hash((str(self), str(self.subtype), self.closed)) | ||
|
||
# register the type with a dummy instance | ||
_interval_type = ArrowIntervalType(pyarrow.int64(), "left") | ||
pyarrow.register_extension_type(_interval_type) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be 0.15?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the pandas -> arrow conversion protocol was already included in 0.15, but the other way (for a full roundtrip) only landed after 0.15. It was just decided that the next arrow release will be 0.16 and not 1.0, so therefore changed the text here.