Skip to content

Commit 15902bd

Browse files
chloeandmargaretNickFillotmroeschke
authored
[EHN] pandas.DataFrame.to_orc (#44554)
* [ENH] to_orc pandas.io.orc.to_orc method definition * pandas.DataFrame.to_orc set to_orc to pandas.DataFrame * Cleaning * Fix style & edit comments & change min dependency version to 5.0.0 * Fix style & add to see also * Add ORC to documentation * Changes according to review * Fix problems mentioned in comment * Linter compliance * Address comments * Add orc test * Fixes from pre-commit [automated commit] * Fix issues according to comments * Simplify the code base after raising Arrow version to 7.0.0 * Fix min arrow version in to_orc * Add to_orc test in line with other formats * Add BytesIO support & test * Fix some docs issues * Use keyword only arguments * Fix bug * Fix param issue * Doctest skipping due to minimal versions * Doctest skipping due to minimal versions * Improve spacing in docstring & remove orc test in test_common that has unusual pyarrow version requirement and is with a lot of other tests * Fix docstring syntax * ORC is not text * Fix BytesIO bug && do not require orc to be explicitly imported before usage && all pytest tests have passed * ORC writer does not work for categorical columns yet * Appease mypy * Appease mypy * Edit according to reviews * Fix path bug in test_orc * Fix testdata tuple bug in test_orc * Fix docstrings for check compliance * read_orc does not have engine as a param * Fix sphinx warnings * Improve docs & rerun tests * Force retrigger * Fix test_orc according to review * Rename some variables and func * Update pandas/core/frame.py Co-authored-by: Matthew Roeschke <[email protected]> * Fix issues according to review * Forced reruns * Fix issues according to review * Reraise Pyarrow TypeError as NotImplementedError * Fix bugs * Fix expected error msg in orc tests * Avoid deprecated functions * Replace {} with None in arg Co-authored-by: NickFillot <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>
1 parent 830130a commit 15902bd

File tree

8 files changed

+372
-5
lines changed

8 files changed

+372
-5
lines changed

doc/source/reference/frame.rst

+1
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,7 @@ Serialization / IO / conversion
373373

374374
DataFrame.from_dict
375375
DataFrame.from_records
376+
DataFrame.to_orc
376377
DataFrame.to_parquet
377378
DataFrame.to_pickle
378379
DataFrame.to_csv

doc/source/reference/io.rst

+1
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,7 @@ ORC
159159
:toctree: api/
160160

161161
read_orc
162+
DataFrame.to_orc
162163

163164
SAS
164165
~~~

doc/source/user_guide/io.rst

+55-4
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
3030
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
3131
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
3232
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
33-
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;
33+
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;:ref:`to_orc<io.orc>`
3434
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
3535
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
3636
binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`;
@@ -5562,13 +5562,64 @@ ORC
55625562
.. versionadded:: 1.0.0
55635563

55645564
Similar to the :ref:`parquet <io.parquet>` format, the `ORC Format <https://orc.apache.org/>`__ is a binary columnar serialization
5565-
for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
5566-
ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library.
5565+
for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the
5566+
ORC format, :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library.
55675567

55685568
.. warning::
55695569

55705570
* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
5571-
* :func:`~pandas.read_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies <install.warn_orc>`.
5571+
* :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0.
5572+
* :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc` are not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies <install.warn_orc>`.
5573+
* For supported dtypes please refer to `supported ORC features in Arrow <https://arrow.apache.org/docs/cpp/orc.html#data-types>`__.
5574+
* Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
5575+
5576+
.. ipython:: python
5577+
5578+
df = pd.DataFrame(
5579+
{
5580+
"a": list("abc"),
5581+
"b": list(range(1, 4)),
5582+
"c": np.arange(4.0, 7.0, dtype="float64"),
5583+
"d": [True, False, True],
5584+
"e": pd.date_range("20130101", periods=3),
5585+
}
5586+
)
5587+
5588+
df
5589+
df.dtypes
5590+
5591+
Write to an orc file.
5592+
5593+
.. ipython:: python
5594+
:okwarning:
5595+
5596+
df.to_orc("example_pa.orc", engine="pyarrow")
5597+
5598+
Read from an orc file.
5599+
5600+
.. ipython:: python
5601+
:okwarning:
5602+
5603+
result = pd.read_orc("example_pa.orc")
5604+
5605+
result.dtypes
5606+
5607+
Read only certain columns of an orc file.
5608+
5609+
.. ipython:: python
5610+
5611+
result = pd.read_orc(
5612+
"example_pa.orc",
5613+
columns=["a", "b"],
5614+
)
5615+
result.dtypes
5616+
5617+
5618+
.. ipython:: python
5619+
:suppress:
5620+
5621+
os.remove("example_pa.orc")
5622+
55725623
55735624
.. _io.sql:
55745625

doc/source/whatsnew/v1.5.0.rst

+22
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,28 @@ as seen in the following example.
100100
1 2021-01-02 08:00:00 4
101101
2 2021-01-02 16:00:00 5
102102
103+
.. _whatsnew_150.enhancements.orc:
104+
105+
Writing to ORC files
106+
^^^^^^^^^^^^^^^^^^^^
107+
108+
The new method :meth:`DataFrame.to_orc` allows writing to ORC files (:issue:`43864`).
109+
110+
This functionality depends the `pyarrow <http://arrow.apache.org/docs/python/>`__ library. For more details, see :ref:`the IO docs on ORC <io.orc>`.
111+
112+
.. warning::
113+
114+
* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
115+
* :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0.
116+
* :func:`~pandas.DataFrame.to_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies <install.warn_orc>`.
117+
* For supported dtypes please refer to `supported ORC features in Arrow <https://arrow.apache.org/docs/cpp/orc.html#data-types>`__.
118+
* Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
119+
120+
.. code-block:: python
121+
122+
df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]})
123+
df.to_orc("./out.orc")
124+
103125
.. _whatsnew_150.enhancements.tar:
104126

105127
Reading directly from TAR archives

pandas/core/frame.py

+88
Original file line numberDiff line numberDiff line change
@@ -2854,6 +2854,7 @@ def to_parquet(
28542854
See Also
28552855
--------
28562856
read_parquet : Read a parquet file.
2857+
DataFrame.to_orc : Write an orc file.
28572858
DataFrame.to_csv : Write a csv file.
28582859
DataFrame.to_sql : Write to a sql table.
28592860
DataFrame.to_hdf : Write to hdf.
@@ -2897,6 +2898,93 @@ def to_parquet(
28972898
**kwargs,
28982899
)
28992900

2901+
def to_orc(
2902+
self,
2903+
path: FilePath | WriteBuffer[bytes] | None = None,
2904+
*,
2905+
engine: Literal["pyarrow"] = "pyarrow",
2906+
index: bool | None = None,
2907+
engine_kwargs: dict[str, Any] | None = None,
2908+
) -> bytes | None:
2909+
"""
2910+
Write a DataFrame to the ORC format.
2911+
2912+
.. versionadded:: 1.5.0
2913+
2914+
Parameters
2915+
----------
2916+
path : str, file-like object or None, default None
2917+
If a string, it will be used as Root Directory path
2918+
when writing a partitioned dataset. By file-like object,
2919+
we refer to objects with a write() method, such as a file handle
2920+
(e.g. via builtin open function). If path is None,
2921+
a bytes object is returned.
2922+
engine : str, default 'pyarrow'
2923+
ORC library to use. Pyarrow must be >= 7.0.0.
2924+
index : bool, optional
2925+
If ``True``, include the dataframe's index(es) in the file output.
2926+
If ``False``, they will not be written to the file.
2927+
If ``None``, similar to ``infer`` the dataframe's index(es)
2928+
will be saved. However, instead of being saved as values,
2929+
the RangeIndex will be stored as a range in the metadata so it
2930+
doesn't require much space and is faster. Other indexes will
2931+
be included as columns in the file output.
2932+
engine_kwargs : dict[str, Any] or None, default None
2933+
Additional keyword arguments passed to :func:`pyarrow.orc.write_table`.
2934+
2935+
Returns
2936+
-------
2937+
bytes if no path argument is provided else None
2938+
2939+
Raises
2940+
------
2941+
NotImplementedError
2942+
Dtype of one or more columns is category, unsigned integers, interval,
2943+
period or sparse.
2944+
ValueError
2945+
engine is not pyarrow.
2946+
2947+
See Also
2948+
--------
2949+
read_orc : Read a ORC file.
2950+
DataFrame.to_parquet : Write a parquet file.
2951+
DataFrame.to_csv : Write a csv file.
2952+
DataFrame.to_sql : Write to a sql table.
2953+
DataFrame.to_hdf : Write to hdf.
2954+
2955+
Notes
2956+
-----
2957+
* Before using this function you should read the :ref:`user guide about
2958+
ORC <io.orc>` and :ref:`install optional dependencies <install.warn_orc>`.
2959+
* This function requires `pyarrow <https://arrow.apache.org/docs/python/>`_
2960+
library.
2961+
* For supported dtypes please refer to `supported ORC features in Arrow
2962+
<https://arrow.apache.org/docs/cpp/orc.html#data-types>`__.
2963+
* Currently timezones in datetime columns are not preserved when a
2964+
dataframe is converted into ORC files.
2965+
2966+
Examples
2967+
--------
2968+
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]})
2969+
>>> df.to_orc('df.orc') # doctest: +SKIP
2970+
>>> pd.read_orc('df.orc') # doctest: +SKIP
2971+
col1 col2
2972+
0 1 4
2973+
1 2 3
2974+
2975+
If you want to get a buffer to the orc content you can write it to io.BytesIO
2976+
>>> import io
2977+
>>> b = io.BytesIO(df.to_orc()) # doctest: +SKIP
2978+
>>> b.seek(0) # doctest: +SKIP
2979+
0
2980+
>>> content = b.read() # doctest: +SKIP
2981+
"""
2982+
from pandas.io.orc import to_orc
2983+
2984+
return to_orc(
2985+
self, path, engine=engine, index=index, engine_kwargs=engine_kwargs
2986+
)
2987+
29002988
@Substitution(
29012989
header_type="bool",
29022990
header="Whether to print column labels, default True",

pandas/core/generic.py

+1
Original file line numberDiff line numberDiff line change
@@ -2630,6 +2630,7 @@ def to_hdf(
26302630
See Also
26312631
--------
26322632
read_hdf : Read from HDF file.
2633+
DataFrame.to_orc : Write a DataFrame to the binary orc format.
26332634
DataFrame.to_parquet : Write a DataFrame to the binary parquet format.
26342635
DataFrame.to_sql : Write to a SQL table.
26352636
DataFrame.to_feather : Write out feather-format for DataFrames.

pandas/io/orc.py

+123-1
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,28 @@
11
""" orc compat """
22
from __future__ import annotations
33

4-
from typing import TYPE_CHECKING
4+
import io
5+
from types import ModuleType
6+
from typing import (
7+
TYPE_CHECKING,
8+
Any,
9+
Literal,
10+
)
511

612
from pandas._typing import (
713
FilePath,
814
ReadBuffer,
15+
WriteBuffer,
916
)
1017
from pandas.compat._optional import import_optional_dependency
1118

19+
from pandas.core.dtypes.common import (
20+
is_categorical_dtype,
21+
is_interval_dtype,
22+
is_period_dtype,
23+
is_unsigned_integer_dtype,
24+
)
25+
1226
from pandas.io.common import get_handle
1327

1428
if TYPE_CHECKING:
@@ -52,3 +66,111 @@ def read_orc(
5266
with get_handle(path, "rb", is_text=False) as handles:
5367
orc_file = orc.ORCFile(handles.handle)
5468
return orc_file.read(columns=columns, **kwargs).to_pandas()
69+
70+
71+
def to_orc(
72+
df: DataFrame,
73+
path: FilePath | WriteBuffer[bytes] | None = None,
74+
*,
75+
engine: Literal["pyarrow"] = "pyarrow",
76+
index: bool | None = None,
77+
engine_kwargs: dict[str, Any] | None = None,
78+
) -> bytes | None:
79+
"""
80+
Write a DataFrame to the ORC format.
81+
82+
.. versionadded:: 1.5.0
83+
84+
Parameters
85+
----------
86+
df : DataFrame
87+
The dataframe to be written to ORC. Raises NotImplementedError
88+
if dtype of one or more columns is category, unsigned integers,
89+
intervals, periods or sparse.
90+
path : str, file-like object or None, default None
91+
If a string, it will be used as Root Directory path
92+
when writing a partitioned dataset. By file-like object,
93+
we refer to objects with a write() method, such as a file handle
94+
(e.g. via builtin open function). If path is None,
95+
a bytes object is returned.
96+
engine : str, default 'pyarrow'
97+
ORC library to use. Pyarrow must be >= 7.0.0.
98+
index : bool, optional
99+
If ``True``, include the dataframe's index(es) in the file output. If
100+
``False``, they will not be written to the file.
101+
If ``None``, similar to ``infer`` the dataframe's index(es)
102+
will be saved. However, instead of being saved as values,
103+
the RangeIndex will be stored as a range in the metadata so it
104+
doesn't require much space and is faster. Other indexes will
105+
be included as columns in the file output.
106+
engine_kwargs : dict[str, Any] or None, default None
107+
Additional keyword arguments passed to :func:`pyarrow.orc.write_table`.
108+
109+
Returns
110+
-------
111+
bytes if no path argument is provided else None
112+
113+
Raises
114+
------
115+
NotImplementedError
116+
Dtype of one or more columns is category, unsigned integers, interval,
117+
period or sparse.
118+
ValueError
119+
engine is not pyarrow.
120+
121+
Notes
122+
-----
123+
* Before using this function you should read the
124+
:ref:`user guide about ORC <io.orc>` and
125+
:ref:`install optional dependencies <install.warn_orc>`.
126+
* This function requires `pyarrow <https://arrow.apache.org/docs/python/>`_
127+
library.
128+
* For supported dtypes please refer to `supported ORC features in Arrow
129+
<https://arrow.apache.org/docs/cpp/orc.html#data-types>`__.
130+
* Currently timezones in datetime columns are not preserved when a
131+
dataframe is converted into ORC files.
132+
"""
133+
if index is None:
134+
index = df.index.names[0] is not None
135+
if engine_kwargs is None:
136+
engine_kwargs = {}
137+
138+
# If unsupported dtypes are found raise NotImplementedError
139+
# In Pyarrow 9.0.0 this check will no longer be needed
140+
for dtype in df.dtypes:
141+
if (
142+
is_categorical_dtype(dtype)
143+
or is_interval_dtype(dtype)
144+
or is_period_dtype(dtype)
145+
or is_unsigned_integer_dtype(dtype)
146+
):
147+
raise NotImplementedError(
148+
"The dtype of one or more columns is not supported yet."
149+
)
150+
151+
if engine != "pyarrow":
152+
raise ValueError("engine must be 'pyarrow'")
153+
engine = import_optional_dependency(engine, min_version="7.0.0")
154+
orc = import_optional_dependency("pyarrow.orc")
155+
156+
was_none = path is None
157+
if was_none:
158+
path = io.BytesIO()
159+
assert path is not None # For mypy
160+
with get_handle(path, "wb", is_text=False) as handles:
161+
assert isinstance(engine, ModuleType) # For mypy
162+
try:
163+
orc.write_table(
164+
engine.Table.from_pandas(df, preserve_index=index),
165+
handles.handle,
166+
**engine_kwargs,
167+
)
168+
except TypeError as e:
169+
raise NotImplementedError(
170+
"The dtype of one or more columns is not supported yet."
171+
) from e
172+
173+
if was_none:
174+
assert isinstance(path, io.BytesIO) # For mypy
175+
return path.getvalue()
176+
return None

0 commit comments

Comments
 (0)