Skip to content

[EHN] pandas.DataFrame.to_orc #44554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Jun 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
9a7b29a
[ENH] to_orc
Oct 3, 2021
d11026f
pandas.DataFrame.to_orc
Oct 3, 2021
0146ac3
Cleaning
Oct 3, 2021
0571602
Fix style & edit comments & change min dependency version to 5.0.0
chloeandmargaret Nov 21, 2021
d970b58
Fix style & add to see also
chloeandmargaret Nov 21, 2021
8b12e9f
Add ORC to documentation
chloeandmargaret Nov 21, 2021
65e6b7a
Changes according to review
chloeandmargaret Nov 22, 2021
2114616
Fix problems mentioned in comment
chloeandmargaret Nov 24, 2021
e4b40ef
Linter compliance
chloeandmargaret Nov 24, 2021
a7aa3e0
Address comments
chloeandmargaret Nov 24, 2021
1ab9b6c
Add orc test
chloeandmargaret Dec 2, 2021
96969d5
Fixes from pre-commit [automated commit]
chloeandmargaret Dec 3, 2021
2a54b8c
Fix issues according to comments
chloeandmargaret Mar 20, 2022
1caec9e
Simplify the code base after raising Arrow version to 7.0.0
chloeandmargaret Mar 21, 2022
6f0a538
Fix min arrow version in to_orc
chloeandmargaret Mar 21, 2022
ae65214
Add to_orc test in line with other formats
chloeandmargaret Mar 21, 2022
045c411
Add BytesIO support & test
chloeandmargaret Mar 22, 2022
c00ed0f
Fix some docs issues
chloeandmargaret Mar 22, 2022
fe275d7
Use keyword only arguments
chloeandmargaret Mar 25, 2022
9d3e0df
Fix bug
chloeandmargaret May 12, 2022
971f31c
Fix param issue
chloeandmargaret May 29, 2022
52b68a0
Doctest skipping due to minimal versions
chloeandmargaret May 29, 2022
76437ba
Doctest skipping due to minimal versions
chloeandmargaret May 29, 2022
c5d5852
Improve spacing in docstring & remove orc test in test_common that ha…
chloeandmargaret May 29, 2022
b5cd022
Fix docstring syntax
chloeandmargaret May 29, 2022
7ad3df9
ORC is not text
chloeandmargaret May 29, 2022
a73bb70
Fix BytesIO bug && do not require orc to be explicitly imported befor…
chloeandmargaret May 29, 2022
20aefe7
ORC writer does not work for categorical columns yet
chloeandmargaret May 29, 2022
e7e81fe
Appease mypy
chloeandmargaret May 29, 2022
6b659f7
Appease mypy
chloeandmargaret May 29, 2022
18e5429
Edit according to reviews
chloeandmargaret May 30, 2022
21cba6e
Fix path bug in test_orc
chloeandmargaret May 30, 2022
c7bf39f
Fix testdata tuple bug in test_orc
chloeandmargaret May 30, 2022
e43c6dd
Fix docstrings for check compliance
chloeandmargaret May 30, 2022
afa0a8a
read_orc does not have engine as a param
chloeandmargaret May 30, 2022
cd585e6
Fix sphinx warnings
chloeandmargaret May 30, 2022
b509c3c
Improve docs & rerun tests
chloeandmargaret May 30, 2022
1001002
Force retrigger
chloeandmargaret May 30, 2022
55cab6e
Fix test_orc according to review
chloeandmargaret Jun 7, 2022
89283e0
Rename some variables and func
chloeandmargaret Jun 7, 2022
989468a
Update pandas/core/frame.py
chloeandmargaret Jun 7, 2022
a7fca36
Fix issues according to review
chloeandmargaret Jun 12, 2022
7fc338c
Forced reruns
chloeandmargaret Jun 12, 2022
91d1556
Fix issues according to review
chloeandmargaret Jun 13, 2022
a28c5a8
Reraise Pyarrow TypeError as NotImplementedError
chloeandmargaret Jun 13, 2022
162e5bb
Fix bugs
chloeandmargaret Jun 13, 2022
b230583
Fix expected error msg in orc tests
chloeandmargaret Jun 13, 2022
e16edab
Avoid deprecated functions
chloeandmargaret Jun 13, 2022
e4770b8
Replace {} with None in arg
chloeandmargaret Jun 13, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/reference/frame.rst
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,7 @@ Serialization / IO / conversion

DataFrame.from_dict
DataFrame.from_records
DataFrame.to_orc
DataFrame.to_parquet
DataFrame.to_pickle
DataFrame.to_csv
Expand Down
1 change: 1 addition & 0 deletions doc/source/reference/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ ORC
:toctree: api/

read_orc
DataFrame.to_orc

SAS
~~~
Expand Down
59 changes: 55 additions & 4 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;:ref:`to_orc<io.orc>`
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`;
Expand Down Expand Up @@ -5562,13 +5562,64 @@ ORC
.. versionadded:: 1.0.0

Similar to the :ref:`parquet <io.parquet>` format, the `ORC Format <https://orc.apache.org/>`__ is a binary columnar serialization
for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library.
for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the
ORC format, :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library.

.. warning::

* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
* :func:`~pandas.read_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies <install.warn_orc>`.
* :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0.
* :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc` are not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies <install.warn_orc>`.
* For supported dtypes please refer to `supported ORC features in Arrow <https://arrow.apache.org/docs/cpp/orc.html#data-types>`__.
* Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.

.. ipython:: python

df = pd.DataFrame(
{
"a": list("abc"),
"b": list(range(1, 4)),
"c": np.arange(4.0, 7.0, dtype="float64"),
"d": [True, False, True],
"e": pd.date_range("20130101", periods=3),
}
)

df
df.dtypes

Write to an orc file.

.. ipython:: python
:okwarning:

df.to_orc("example_pa.orc", engine="pyarrow")

Read from an orc file.

.. ipython:: python
:okwarning:

result = pd.read_orc("example_pa.orc")

result.dtypes

Read only certain columns of an orc file.

.. ipython:: python

result = pd.read_orc(
"example_pa.orc",
columns=["a", "b"],
)
result.dtypes


.. ipython:: python
:suppress:

os.remove("example_pa.orc")


.. _io.sql:

Expand Down
22 changes: 22 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,28 @@ as seen in the following example.
1 2021-01-02 08:00:00 4
2 2021-01-02 16:00:00 5

.. _whatsnew_150.enhancements.orc:

Writing to ORC files
^^^^^^^^^^^^^^^^^^^^

The new method :meth:`DataFrame.to_orc` allows writing to ORC files (:issue:`43864`).

This functionality depends the `pyarrow <http://arrow.apache.org/docs/python/>`__ library. For more details, see :ref:`the IO docs on ORC <io.orc>`.

.. warning::

* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
* :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0.
* :func:`~pandas.DataFrame.to_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies <install.warn_orc>`.
* For supported dtypes please refer to `supported ORC features in Arrow <https://arrow.apache.org/docs/cpp/orc.html#data-types>`__.
* Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.

.. code-block:: python

df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]})
df.to_orc("./out.orc")

.. _whatsnew_150.enhancements.tar:

Reading directly from TAR archives
Expand Down
88 changes: 88 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2858,6 +2858,7 @@ def to_parquet(
See Also
--------
read_parquet : Read a parquet file.
DataFrame.to_orc : Write an orc file.
DataFrame.to_csv : Write a csv file.
DataFrame.to_sql : Write to a sql table.
DataFrame.to_hdf : Write to hdf.
Expand Down Expand Up @@ -2901,6 +2902,93 @@ def to_parquet(
**kwargs,
)

def to_orc(
self,
path: FilePath | WriteBuffer[bytes] | None = None,
*,
engine: Literal["pyarrow"] = "pyarrow",
index: bool | None = None,
engine_kwargs: dict[str, Any] | None = None,
) -> bytes | None:
"""
Write a DataFrame to the ORC format.

.. versionadded:: 1.5.0

Parameters
----------
path : str, file-like object or None, default None
If a string, it will be used as Root Directory path
when writing a partitioned dataset. By file-like object,
we refer to objects with a write() method, such as a file handle
(e.g. via builtin open function). If path is None,
a bytes object is returned.
engine : str, default 'pyarrow'
ORC library to use. Pyarrow must be >= 7.0.0.
index : bool, optional
If ``True``, include the dataframe's index(es) in the file output.
If ``False``, they will not be written to the file.
If ``None``, similar to ``infer`` the dataframe's index(es)
will be saved. However, instead of being saved as values,
the RangeIndex will be stored as a range in the metadata so it
doesn't require much space and is faster. Other indexes will
be included as columns in the file output.
engine_kwargs : dict[str, Any] or None, default None
Additional keyword arguments passed to :func:`pyarrow.orc.write_table`.

Returns
-------
bytes if no path argument is provided else None

Raises
------
NotImplementedError
Dtype of one or more columns is category, unsigned integers, interval,
period or sparse.
ValueError
engine is not pyarrow.

See Also
--------
read_orc : Read a ORC file.
DataFrame.to_parquet : Write a parquet file.
DataFrame.to_csv : Write a csv file.
DataFrame.to_sql : Write to a sql table.
DataFrame.to_hdf : Write to hdf.

Notes
-----
* Before using this function you should read the :ref:`user guide about
ORC <io.orc>` and :ref:`install optional dependencies <install.warn_orc>`.
* This function requires `pyarrow <https://arrow.apache.org/docs/python/>`_
library.
* For supported dtypes please refer to `supported ORC features in Arrow
<https://arrow.apache.org/docs/cpp/orc.html#data-types>`__.
* Currently timezones in datetime columns are not preserved when a
dataframe is converted into ORC files.

Examples
--------
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]})
>>> df.to_orc('df.orc') # doctest: +SKIP
>>> pd.read_orc('df.orc') # doctest: +SKIP
col1 col2
0 1 4
1 2 3

If you want to get a buffer to the orc content you can write it to io.BytesIO
>>> import io
>>> b = io.BytesIO(df.to_orc()) # doctest: +SKIP
>>> b.seek(0) # doctest: +SKIP
0
>>> content = b.read() # doctest: +SKIP
"""
from pandas.io.orc import to_orc

return to_orc(
self, path, engine=engine, index=index, engine_kwargs=engine_kwargs
)

@Substitution(
header_type="bool",
header="Whether to print column labels, default True",
Expand Down
1 change: 1 addition & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2629,6 +2629,7 @@ def to_hdf(
See Also
--------
read_hdf : Read from HDF file.
DataFrame.to_orc : Write a DataFrame to the binary orc format.
DataFrame.to_parquet : Write a DataFrame to the binary parquet format.
DataFrame.to_sql : Write to a SQL table.
DataFrame.to_feather : Write out feather-format for DataFrames.
Expand Down
124 changes: 123 additions & 1 deletion pandas/io/orc.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,28 @@
""" orc compat """
from __future__ import annotations

from typing import TYPE_CHECKING
import io
from types import ModuleType
from typing import (
TYPE_CHECKING,
Any,
Literal,
)

from pandas._typing import (
FilePath,
ReadBuffer,
WriteBuffer,
)
from pandas.compat._optional import import_optional_dependency

from pandas.core.dtypes.common import (
is_categorical_dtype,
is_interval_dtype,
is_period_dtype,
is_unsigned_integer_dtype,
)

from pandas.io.common import get_handle

if TYPE_CHECKING:
Expand Down Expand Up @@ -52,3 +66,111 @@ def read_orc(
with get_handle(path, "rb", is_text=False) as handles:
orc_file = orc.ORCFile(handles.handle)
return orc_file.read(columns=columns, **kwargs).to_pandas()


def to_orc(
df: DataFrame,
path: FilePath | WriteBuffer[bytes] | None = None,
*,
engine: Literal["pyarrow"] = "pyarrow",
index: bool | None = None,
engine_kwargs: dict[str, Any] | None = None,
) -> bytes | None:
"""
Write a DataFrame to the ORC format.

.. versionadded:: 1.5.0

Parameters
----------
df : DataFrame
The dataframe to be written to ORC. Raises NotImplementedError
if dtype of one or more columns is category, unsigned integers,
intervals, periods or sparse.
path : str, file-like object or None, default None
If a string, it will be used as Root Directory path
when writing a partitioned dataset. By file-like object,
we refer to objects with a write() method, such as a file handle
(e.g. via builtin open function). If path is None,
a bytes object is returned.
engine : str, default 'pyarrow'
ORC library to use. Pyarrow must be >= 7.0.0.
index : bool, optional
If ``True``, include the dataframe's index(es) in the file output. If
``False``, they will not be written to the file.
If ``None``, similar to ``infer`` the dataframe's index(es)
will be saved. However, instead of being saved as values,
the RangeIndex will be stored as a range in the metadata so it
doesn't require much space and is faster. Other indexes will
be included as columns in the file output.
engine_kwargs : dict[str, Any] or None, default None
Additional keyword arguments passed to :func:`pyarrow.orc.write_table`.

Returns
-------
bytes if no path argument is provided else None

Raises
------
NotImplementedError
Dtype of one or more columns is category, unsigned integers, interval,
period or sparse.
ValueError
engine is not pyarrow.

Notes
-----
* Before using this function you should read the
:ref:`user guide about ORC <io.orc>` and
:ref:`install optional dependencies <install.warn_orc>`.
* This function requires `pyarrow <https://arrow.apache.org/docs/python/>`_
library.
* For supported dtypes please refer to `supported ORC features in Arrow
<https://arrow.apache.org/docs/cpp/orc.html#data-types>`__.
* Currently timezones in datetime columns are not preserved when a
dataframe is converted into ORC files.
"""
if index is None:
index = df.index.names[0] is not None
if engine_kwargs is None:
engine_kwargs = {}

# If unsupported dtypes are found raise NotImplementedError
# In Pyarrow 9.0.0 this check will no longer be needed
for dtype in df.dtypes:
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will pyarrow raise if these dtypes are passed? If so, can a a pyarrow error be caught and reraised as a NotImplementedError so this can be more flexible to other potential dtypes not supported in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to test these types individually. Not sure right now.

Copy link
Contributor Author

@iajoiner iajoiner Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke It seg faults out for all instances but sparse. I need to catch them in Arrow 9.0.0. Meanwhile can we use the current dtype filter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is fine then given:

  1. Could you use the type checking functions in pandas.core.dtypes.common instead? e.g. is_categorical_dtype(dtype)?
  2. Could you make a note that in pyarrow 9.0.0 this checking should not be needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

@iajoiner iajoiner Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Since for sparse dtypes we get a TypeError from Arrow when converting the dataframe to a pyarrow table I plan to use TypeError for the other 4 in pyarrow 9.0.0 as well. The try-except block has been added in addition to the type checks for the 4 that segfault out right now with the note.

is_categorical_dtype(dtype)
or is_interval_dtype(dtype)
or is_period_dtype(dtype)
or is_unsigned_integer_dtype(dtype)
):
raise NotImplementedError(
"The dtype of one or more columns is not supported yet."
)

if engine != "pyarrow":
raise ValueError("engine must be 'pyarrow'")
engine = import_optional_dependency(engine, min_version="7.0.0")
orc = import_optional_dependency("pyarrow.orc")

was_none = path is None
if was_none:
path = io.BytesIO()
assert path is not None # For mypy
with get_handle(path, "wb", is_text=False) as handles:
assert isinstance(engine, ModuleType) # For mypy
try:
orc.write_table(
engine.Table.from_pandas(df, preserve_index=index),
handles.handle,
**engine_kwargs,
)
except TypeError as e:
raise NotImplementedError(
"The dtype of one or more columns is not supported yet."
) from e

if was_none:
assert isinstance(path, io.BytesIO) # For mypy
return path.getvalue()
return None
Loading