[EHN] pandas.DataFrame.to_orc #44554

iajoiner · 2021-11-21T08:29:36Z

closes ENH: to_orc #43864
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

pep8speaks · 2021-11-21T08:29:40Z

Hello @iajoiner! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-13 21:37:24 UTC

iajoiner · 2021-11-21T08:30:54Z

Thanks @NickFillot for your effort in integrating my ORC writer into Pandas. Now since your PR has been closed I will take care of modifying it and getting it approved from now on.

Here is the link to Nick's PR which I unfortunately can not reopen: #43860

pandas/core/frame.py

pandas/io/orc.py

iajoiner · 2021-11-25T07:22:27Z

Since apache/arrow#9702 will significantly add to the ORC writer API shall we delay the merge until late Jan when Arrow 7.0.0 is released?

twoertwein · 2021-11-25T15:04:31Z

Since apache/arrow#9702 will significantly add to the ORC writer API shall we delay the merge until late Jan when Arrow 7.0.0 is released?

I can't judge this. Probably depends at least on 1) whether it would create inconsistencies (early implementation and then with the new pyarrow later) and on 2) how soon you want it (pandas 1.4 is scheduled for New Year's Eve, 1.5/2.0 will then probably be a year after that).

pandas/io/orc.py

iajoiner · 2021-11-26T01:49:54Z

@twoertwein I don't think there will be inconsistencies. I'm really just adding optional writer arguments that can be **kwargs in Pandas. Yes I want it out there ASAP.

iajoiner · 2021-12-03T07:55:14Z

@github-actions pre-commit

pandas/tests/io/test_orc.py

pandas/core/frame.py

pandas/io/orc.py

github-actions · 2022-01-05T00:04:01Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

iajoiner · 2022-01-05T00:56:14Z

I’m still around. Will update this PR this month. It’s just that I have a few Arrow tickets to deal with now.

jreback · 2022-01-16T19:11:33Z

love to have this, if you can merge master and address comments we can look again

mroeschke · 2022-06-07T16:07:12Z

pandas/core/frame.py

+            a bytes object is returned.
+        engine : {{'pyarrow'}}, default 'pyarrow'
+            ORC library to use, or library it self, checked with 'pyarrow' name
+            and version >= 7.0.0. Raises ValueError if it is anything but


IMO Raises ValueError if it is anything but... is redundant with the raises section below so I think this can be removed.

Sure! I will fix that tonight.

Co-authored-by: Matthew Roeschke <[email protected]>

mroeschke · 2022-06-07T16:13:06Z

pandas/io/orc.py

+    # If unsupported dtypes are found raise NotImplementedError
+    for dtype in df.dtypes:
+        dtype_str = dtype.__str__().lower()
+        if (


Will pyarrow raise if these dtypes are passed? If so, can a a pyarrow error be caught and reraised as a NotImplementedError so this can be more flexible to other potential dtypes not supported in the future?

I need to test these types individually. Not sure right now.

@mroeschke It seg faults out for all instances but sparse. I need to catch them in Arrow 9.0.0. Meanwhile can we use the current dtype filter?

Okay, this is fine then given:

Could you use the type checking functions in pandas.core.dtypes.common instead? e.g. is_categorical_dtype(dtype)?

Could you make a note that in pyarrow 9.0.0 this checking should not be needed?

Done!

Since for sparse dtypes we get a TypeError from Arrow when converting the dataframe to a pyarrow table I plan to use TypeError for the other 4 in pyarrow 9.0.0 as well. The try-except block has been added in addition to the type checks for the 4 that segfault out right now with the note.

mroeschke · 2022-06-07T16:15:05Z

pandas/tests/io/test_orc.py

+    }
+    expected = pd.DataFrame.from_dict(data)
+
+    outputfile = os.path.join(dirpath, "TestOrcFile.testReadWrite.orc")


Please use tm.ensure_clean("TestOrcFile.testReadWrite.orc") as a context manager

iajoiner · 2022-06-07T16:15:42Z

@mroeschke Really thanks for reviewing this PR! I have one question about What’s New. Does the ORC writer qualify as a major enhancement? Personally I think it does and is the 4th most important major enhancement in 1.5.0.

pandas/tests/io/test_orc.py

mroeschke · 2022-06-07T16:17:09Z

pandas/tests/io/test_orc.py

+def test_orc_writer_dtypes_not_supported(df_not_supported):
+    # GH44554
+    # PyArrow gained ORC write support with the current argument order
+    msg = """The dtype of one or more columns is unsigned integers,


Nit: Single quotes please

Looks like this wasn't changed still.

Oops. Now it has been changed.

@mroeschke Ah that’s because black replaces single quotes with double ones automatically.

mroeschke

Looks like there's also a pandas/tests/io/data/orc/TestOrcFile.testReadWrite.orc file that was added that can just be created/destroyed in the test.

mroeschke · 2022-06-07T16:21:13Z

@mroeschke Really thanks for reviewing this PR! I have one question about What’s New. Does the ORC writer qualify as a major enhancement? Personally I think it does and is the 4th most important major enhancement in 1.5.0.

Sure! I think the whatsnew entry can be a small section instead of a one line mention

iajoiner · 2022-06-12T05:02:36Z

@mroeschke The only red is likely transient and completely unrelated to the PR. What's new has been updated.

The only issue you pointed out which I haven't fixed is the PyArrow error issue since unsupport dtypes often cause segfaults in pyarrow. I have filed an Arrow ticket to fix it here: https://issues.apache.org/jira/browse/ARROW-16817 . Meanwhile due to the fact that Pandas 1.5.0 will be released before Arrow 9.0.0 in order not to have segfault when users do try to use unsupported dtypes can we use a filter in Pandas right now? We can then change it once the Arrow issue is fixed.

mroeschke · 2022-06-12T17:45:33Z

pandas/core/frame.py

+            the RangeIndex will be stored as a range in the metadata so it
+            doesn't require much space and is faster. Other indexes will
+            be included as columns in the file output.
+        **kwargs


Could you name this engine_kwargs and have this take a Dict[str, Any] instead? It's the pattern we've been using in other methods.

Also is there there documentation you can link from pyarrow on what other engine keyword arguments can be accepted?

@mroeschke You mean just like the excel methods but without having to support the legacy **kwargs approach? Sure!

I've followed to_feather convention and added :func:pyarrow.orc.write_table which should link to the correct documentation.

mroeschke · 2022-06-12T17:48:40Z

pandas/core/frame.py

+            (e.g. via builtin open function). If path is None,
+            a bytes object is returned.
+        engine : str, default 'pyarrow'
+            ORC library to use, or library it self, checked with 'pyarrow' name


Suggested change

ORC library to use, or library it self, checked with 'pyarrow' name

ORC library to use. Pyarrow must be >= 7.0.0.

Library it self is a little unclear to me here

Sure. Fixed!

mroeschke · 2022-06-12T17:55:12Z

@mroeschke The only red is likely transient and completely unrelated to the PR. What's new has been updated.

The only issue you pointed out which I haven't fixed is the PyArrow error issue since unsupport dtypes often cause segfaults in pyarrow. I have filed an Arrow ticket to fix it here: https://issues.apache.org/jira/browse/ARROW-16817 . Meanwhile due to the fact that Pandas 1.5.0 will be released before Arrow 9.0.0 in order not to have segfault when users do try to use unsupported dtypes can we use a filter in Pandas right now? We can then change it once the Arrow issue is fixed.

Sure sounds good. Can use manual dtype filtering for now with some corrections in my review.

Could you also uncommit pandas/tests/io/data/orc/TestOrcFile.testReadWrite.orc?

iajoiner · 2022-06-13T11:13:50Z

@mroeschke Please review again. All issues have been taken care of with the exception of using single quotes for msg which is not possible due to black.

P.S. Red stuff is unrelated to the PR.

mroeschke · 2022-06-13T17:06:09Z

pandas/io/orc.py

+    *,
+    engine: Literal["pyarrow"] = "pyarrow",
+    index: bool | None = None,
+    engine_kwargs: dict[str, Any] = {},


Sorry just realized. Could you default this here engine_kwargs: dict[str, Any] | None = None, and if None convert to empty dict in the function?

mroeschke

Thanks for sticking with it! One more minor comment then LGTM

iajoiner · 2022-06-13T22:03:30Z

@mroeschke Fixed haha. Really thanks for reviewing! Is the next step merging once it is clear that no error related to the PR exists? :)

mroeschke · 2022-06-14T00:02:45Z

Awesome thanks for the responsiveness down stretch!

* [ENH] to_orc pandas.io.orc.to_orc method definition * pandas.DataFrame.to_orc set to_orc to pandas.DataFrame * Cleaning * Fix style & edit comments & change min dependency version to 5.0.0 * Fix style & add to see also * Add ORC to documentation * Changes according to review * Fix problems mentioned in comment * Linter compliance * Address comments * Add orc test * Fixes from pre-commit [automated commit] * Fix issues according to comments * Simplify the code base after raising Arrow version to 7.0.0 * Fix min arrow version in to_orc * Add to_orc test in line with other formats * Add BytesIO support & test * Fix some docs issues * Use keyword only arguments * Fix bug * Fix param issue * Doctest skipping due to minimal versions * Doctest skipping due to minimal versions * Improve spacing in docstring & remove orc test in test_common that has unusual pyarrow version requirement and is with a lot of other tests * Fix docstring syntax * ORC is not text * Fix BytesIO bug && do not require orc to be explicitly imported before usage && all pytest tests have passed * ORC writer does not work for categorical columns yet * Appease mypy * Appease mypy * Edit according to reviews * Fix path bug in test_orc * Fix testdata tuple bug in test_orc * Fix docstrings for check compliance * read_orc does not have engine as a param * Fix sphinx warnings * Improve docs & rerun tests * Force retrigger * Fix test_orc according to review * Rename some variables and func * Update pandas/core/frame.py Co-authored-by: Matthew Roeschke <[email protected]> * Fix issues according to review * Forced reruns * Fix issues according to review * Reraise Pyarrow TypeError as NotImplementedError * Fix bugs * Fix expected error msg in orc tests * Avoid deprecated functions * Replace {} with None in arg Co-authored-by: NickFillot <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>

iajoiner marked this pull request as draft November 21, 2021 08:29

iajoiner marked this pull request as ready for review November 21, 2021 09:29

iajoiner changed the title ~~Feature to orc~~ [EHN] pandas.DataFrame.to_orc Nov 21, 2021

iajoiner mentioned this pull request Nov 21, 2021

[ENH] to_orc #43860

Closed

twoertwein reviewed Nov 21, 2021

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

twoertwein reviewed Nov 21, 2021

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

twoertwein reviewed Nov 21, 2021

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

twoertwein reviewed Nov 21, 2021

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

twoertwein reviewed Nov 21, 2021

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

twoertwein reviewed Nov 22, 2021

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

lithomas1 added Arrow pyarrow functionality Enhancement IO Data IO issues that don't fit into a more specific label labels Nov 23, 2021

twoertwein reviewed Nov 24, 2021

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

twoertwein reviewed Nov 25, 2021

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

twoertwein reviewed Dec 5, 2021

View reviewed changes

pandas/tests/io/test_orc.py Outdated Show resolved Hide resolved

twoertwein reviewed Dec 5, 2021

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

twoertwein reviewed Dec 5, 2021

View reviewed changes

pandas/io/orc.py Show resolved Hide resolved

twoertwein reviewed Dec 5, 2021

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

github-actions bot added the Stale label Jan 5, 2022

jreback removed the Stale label Jan 16, 2022

mroeschke reviewed Jun 7, 2022

View reviewed changes

Update pandas/core/frame.py

989468a

Co-authored-by: Matthew Roeschke <[email protected]>

mroeschke reviewed Jun 7, 2022

View reviewed changes

pandas/tests/io/test_orc.py Show resolved Hide resolved

mroeschke reviewed Jun 7, 2022

View reviewed changes

mroeschke requested changes Jun 7, 2022

View reviewed changes

Fix issues according to review

a7fca36

Forced reruns

7fc338c

mroeschke reviewed Jun 12, 2022

View reviewed changes

chloeandmargaret added 5 commits June 13, 2022 05:12

Fix issues according to review

91d1556

Reraise Pyarrow TypeError as NotImplementedError

a28c5a8

Fix bugs

162e5bb

Fix expected error msg in orc tests

b230583

Avoid deprecated functions

e16edab

mroeschke reviewed Jun 13, 2022

View reviewed changes

Replace {} with None in arg

e4770b8

mroeschke approved these changes Jun 14, 2022

View reviewed changes

mroeschke merged commit 15902bd into pandas-dev:main Jun 14, 2022

iajoiner deleted the feature-to-orc branch June 14, 2022 00:22

	ORC library to use, or library it self, checked with 'pyarrow' name
	ORC library to use. Pyarrow must be >= 7.0.0.

[EHN] pandas.DataFrame.to_orc #44554

[EHN] pandas.DataFrame.to_orc #44554

Conversation

iajoiner commented Nov 21, 2021 • edited Loading

pep8speaks commented Nov 21, 2021 • edited Loading

Comment last updated at 2022-06-13 21:37:24 UTC

iajoiner commented Nov 21, 2021 • edited Loading

iajoiner commented Nov 25, 2021

twoertwein commented Nov 25, 2021 • edited Loading

iajoiner commented Nov 26, 2021

iajoiner commented Dec 3, 2021

github-actions bot commented Jan 5, 2022

iajoiner commented Jan 5, 2022 • edited Loading

jreback commented Jan 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Jun 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner commented Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Jun 7, 2022

iajoiner commented Jun 12, 2022 • edited Loading

Choose a reason for hiding this comment

iajoiner Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Jun 12, 2022

iajoiner commented Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

iajoiner commented Jun 13, 2022 • edited Loading

mroeschke commented Jun 14, 2022

iajoiner commented Nov 21, 2021 •

edited

Loading

pep8speaks commented Nov 21, 2021 •

edited

Loading

iajoiner commented Nov 21, 2021 •

edited

Loading

twoertwein commented Nov 25, 2021 •

edited

Loading

iajoiner commented Jan 5, 2022 •

edited

Loading

iajoiner Jun 12, 2022 •

edited

Loading

iajoiner Jun 13, 2022 •

edited

Loading

iajoiner commented Jun 7, 2022 •

edited

Loading

iajoiner commented Jun 12, 2022 •

edited

Loading

iajoiner Jun 13, 2022 •

edited

Loading

iajoiner commented Jun 13, 2022 •

edited

Loading

iajoiner commented Jun 13, 2022 •

edited

Loading