-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[EHN] pandas.DataFrame.to_orc #44554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 39 commits
9a7b29a
d11026f
0146ac3
0571602
d970b58
8b12e9f
65e6b7a
2114616
e4b40ef
a7aa3e0
1ab9b6c
96969d5
2a54b8c
1caec9e
6f0a538
ae65214
045c411
c00ed0f
fe275d7
9d3e0df
971f31c
52b68a0
76437ba
c5d5852
b5cd022
7ad3df9
a73bb70
20aefe7
e7e81fe
6b659f7
18e5429
21cba6e
c7bf39f
e43c6dd
afa0a8a
cd585e6
b509c3c
1001002
55cab6e
89283e0
989468a
a7fca36
7fc338c
91d1556
a28c5a8
162e5bb
b230583
e16edab
e4770b8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -159,6 +159,7 @@ ORC | |
:toctree: api/ | ||
|
||
read_orc | ||
DataFrame.to_orc | ||
|
||
SAS | ||
~~~ | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -2858,6 +2858,7 @@ def to_parquet( | |||||
See Also | ||||||
-------- | ||||||
read_parquet : Read a parquet file. | ||||||
DataFrame.to_orc : Write an orc file. | ||||||
DataFrame.to_csv : Write a csv file. | ||||||
DataFrame.to_sql : Write to a sql table. | ||||||
DataFrame.to_hdf : Write to hdf. | ||||||
|
@@ -2901,6 +2902,93 @@ def to_parquet( | |||||
**kwargs, | ||||||
) | ||||||
|
||||||
def to_orc( | ||||||
self, | ||||||
path: FilePath | WriteBuffer[bytes] | None = None, | ||||||
*, | ||||||
engine: Literal["pyarrow"] = "pyarrow", | ||||||
twoertwein marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
index: bool | None = None, | ||||||
**kwargs, | ||||||
) -> bytes | None: | ||||||
twoertwein marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
""" | ||||||
Write a DataFrame to the ORC format. | ||||||
|
||||||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
.. versionadded:: 1.5.0 | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
path : str, file-like object or None, default None | ||||||
If a string, it will be used as Root Directory path | ||||||
when writing a partitioned dataset. By file-like object, | ||||||
we refer to objects with a write() method, such as a file handle | ||||||
(e.g. via builtin open function). If path is None, | ||||||
a bytes object is returned. | ||||||
engine : {{'pyarrow'}}, default 'pyarrow' | ||||||
iajoiner marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
ORC library to use, or library it self, checked with 'pyarrow' name | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. Fixed! |
||||||
and version >= 7.0.0. Raises ValueError if it is anything but | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure! I will fix that tonight. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||||||
'pyarrow'. | ||||||
index : bool, optional | ||||||
If ``True``, include the dataframe's index(es) in the file output. | ||||||
If ``False``, they will not be written to the file. | ||||||
If ``None``, similar to ``infer`` the dataframe's index(es) | ||||||
will be saved. However, instead of being saved as values, | ||||||
the RangeIndex will be stored as a range in the metadata so it | ||||||
doesn't require much space and is faster. Other indexes will | ||||||
be included as columns in the file output. | ||||||
**kwargs | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you name this Also is there there documentation you can link from pyarrow on what other engine keyword arguments can be accepted? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mroeschke You mean just like the excel methods but without having to support the legacy I've followed |
||||||
Additional keyword arguments passed to the engine. | ||||||
|
||||||
Returns | ||||||
------- | ||||||
bytes if no path argument is provided else None | ||||||
|
||||||
Raises | ||||||
------ | ||||||
NotImplementedError | ||||||
Dtype of one or more columns is category, unsigned integers, interval, | ||||||
period or sparse. | ||||||
ValueError | ||||||
engine is not pyarrow. | ||||||
|
||||||
See Also | ||||||
-------- | ||||||
read_orc : Read a ORC file. | ||||||
DataFrame.to_parquet : Write a parquet file. | ||||||
DataFrame.to_csv : Write a csv file. | ||||||
DataFrame.to_sql : Write to a sql table. | ||||||
DataFrame.to_hdf : Write to hdf. | ||||||
|
||||||
Notes | ||||||
----- | ||||||
* Before using this function you should read the :ref:`user guide about | ||||||
ORC <io.orc>` and :ref:`install optional dependencies <install.warn_orc>`. | ||||||
* This function requires `pyarrow <https://arrow.apache.org/docs/python/>`_ | ||||||
library. | ||||||
* Category, unsigned integers, interval, period and sparse Dtypes | ||||||
are not supported yet. | ||||||
* Currently timezones in datetime columns are not preserved when a | ||||||
dataframe is converted into ORC files. | ||||||
|
||||||
Examples | ||||||
-------- | ||||||
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) | ||||||
>>> df.to_orc('df.orc') # doctest: +SKIP | ||||||
>>> pd.read_orc('df.orc') # doctest: +SKIP | ||||||
col1 col2 | ||||||
0 1 3 | ||||||
1 2 4 | ||||||
|
||||||
If you want to get a buffer to the orc content you can write it to io.BytesIO | ||||||
>>> import io | ||||||
>>> b = io.BytesIO(df.to_orc()) # doctest: +SKIP | ||||||
>>> b.seek(0) # doctest: +SKIP | ||||||
0 | ||||||
>>> content = b.read() # doctest: +SKIP | ||||||
""" | ||||||
from pandas.io.orc import to_orc | ||||||
|
||||||
return to_orc(self, path, engine=engine, index=index, **kwargs) | ||||||
|
||||||
@Substitution( | ||||||
header_type="bool", | ||||||
header="Whether to print column labels, default True", | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,17 @@ | ||
""" orc compat """ | ||
from __future__ import annotations | ||
|
||
from typing import TYPE_CHECKING | ||
import io | ||
from types import ModuleType | ||
from typing import ( | ||
TYPE_CHECKING, | ||
Literal, | ||
) | ||
|
||
from pandas._typing import ( | ||
FilePath, | ||
ReadBuffer, | ||
WriteBuffer, | ||
) | ||
from pandas.compat._optional import import_optional_dependency | ||
|
||
|
@@ -52,3 +58,106 @@ def read_orc( | |
with get_handle(path, "rb", is_text=False) as handles: | ||
orc_file = orc.ORCFile(handles.handle) | ||
return orc_file.read(columns=columns, **kwargs).to_pandas() | ||
|
||
|
||
def to_orc( | ||
df: DataFrame, | ||
path: FilePath | WriteBuffer[bytes] | None = None, | ||
twoertwein marked this conversation as resolved.
Show resolved
Hide resolved
|
||
*, | ||
engine: Literal["pyarrow"] = "pyarrow", | ||
twoertwein marked this conversation as resolved.
Show resolved
Hide resolved
|
||
index: bool | None = None, | ||
**kwargs, | ||
) -> bytes | None: | ||
""" | ||
Write a DataFrame to the ORC format. | ||
|
||
.. versionadded:: 1.5.0 | ||
|
||
Parameters | ||
---------- | ||
df : DataFrame | ||
The dataframe to be written to ORC. Raises NotImplementedError | ||
if dtype of one or more columns is category, unsigned integers, | ||
intervals, periods or sparse. | ||
path : str, file-like object or None, default None | ||
If a string, it will be used as Root Directory path | ||
when writing a partitioned dataset. By file-like object, | ||
we refer to objects with a write() method, such as a file handle | ||
(e.g. via builtin open function). If path is None, | ||
a bytes object is returned. | ||
engine : {{'pyarrow'}}, default 'pyarrow' | ||
ORC library to use, or library it self, checked with 'pyarrow' name | ||
and version >= 7.0.0. Raises ValueError if it is anything but | ||
'pyarrow'. | ||
index : bool, optional | ||
If ``True``, include the dataframe's index(es) in the file output. If | ||
``False``, they will not be written to the file. | ||
If ``None``, similar to ``infer`` the dataframe's index(es) | ||
will be saved. However, instead of being saved as values, | ||
the RangeIndex will be stored as a range in the metadata so it | ||
doesn't require much space and is faster. Other indexes will | ||
be included as columns in the file output. | ||
**kwargs | ||
Additional keyword arguments passed to the engine. | ||
|
||
Returns | ||
------- | ||
bytes if no path argument is provided else None | ||
|
||
Raises | ||
------ | ||
NotImplementedError | ||
Dtype of one or more columns is category, unsigned integers, interval, | ||
period or sparse. | ||
ValueError | ||
engine is not pyarrow. | ||
|
||
Notes | ||
----- | ||
* Before using this function you should read the | ||
:ref:`user guide about ORC <io.orc>` and | ||
:ref:`install optional dependencies <install.warn_orc>`. | ||
* This function requires `pyarrow <https://arrow.apache.org/docs/python/>`_ | ||
library. | ||
* Category, unsigned integers, interval, period and sparse Dtypes | ||
are not supported yet. | ||
* Currently timezones in datetime columns are not preserved when a | ||
dataframe is converted into ORC files. | ||
""" | ||
if index is None: | ||
index = df.index.names[0] is not None | ||
|
||
# If unsupported dtypes are found raise NotImplementedError | ||
for dtype in df.dtypes: | ||
dtype_str = dtype.__str__().lower() | ||
if ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will pyarrow raise if these dtypes are passed? If so, can a a pyarrow error be caught and reraised as a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I need to test these types individually. Not sure right now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mroeschke It seg faults out for all instances but sparse. I need to catch them in Arrow 9.0.0. Meanwhile can we use the current dtype filter? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, this is fine then given:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done! Since for sparse dtypes we get a |
||
"category" in dtype_str | ||
or "interval" in dtype_str | ||
or "sparse" in dtype_str | ||
or "period" in dtype_str | ||
or "uint" in dtype_str | ||
): | ||
raise NotImplementedError( | ||
"""The dtype of one or more columns is unsigned integers, | ||
intervals, periods, sparse or categorical which is not supported yet.""" | ||
) | ||
|
||
if engine != "pyarrow": | ||
raise ValueError("engine must be 'pyarrow'") | ||
engine = import_optional_dependency(engine, min_version="7.0.0") | ||
orc = import_optional_dependency("pyarrow.orc") | ||
|
||
was_none = path is None | ||
if was_none: | ||
path = io.BytesIO() | ||
assert path is not None # For mypy | ||
with get_handle(path, "wb", is_text=False) as handles: | ||
assert isinstance(engine, ModuleType) # For mypy | ||
orc.write_table( | ||
engine.Table.from_pandas(df, preserve_index=index), handles.handle, **kwargs | ||
) | ||
|
||
if was_none: | ||
assert isinstance(path, io.BytesIO) # For mypy | ||
return path.getvalue() | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this documented in pyarrow? Would be great to reference from pyarrow documentation instead of listing like this since these types of notes tend to get stale and libraries advance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke The Arrow version of the type restrictions are documented here: https://arrow.apache.org/docs/cpp/orc.html
Using the usual pyarrow to pandas correspondence people may be able to deduce what dtypes are not allowed. Since it is not straightforward maybe we should keep both the pandas doc and link to the Arrow one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still slightly prefer to link to https://arrow.apache.org/docs/cpp/orc.html#data-types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!