Skip to content

[EHN] pandas.DataFrame.to_orc #44554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Jun 14, 2022
Merged

Conversation

iajoiner
Copy link
Contributor

@iajoiner iajoiner commented Nov 21, 2021

  • closes ENH: to_orc #43864
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Nov 21, 2021

Hello @iajoiner! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-13 21:37:24 UTC

@iajoiner iajoiner marked this pull request as draft November 21, 2021 08:29
@iajoiner
Copy link
Contributor Author

iajoiner commented Nov 21, 2021

Thanks @NickFillot for your effort in integrating my ORC writer into Pandas. Now since your PR has been closed I will take care of modifying it and getting it approved from now on.

Here is the link to Nick's PR which I unfortunately can not reopen: #43860

@iajoiner iajoiner marked this pull request as ready for review November 21, 2021 09:29
@iajoiner iajoiner changed the title Feature to orc [EHN] pandas.DataFrame.to_orc Nov 21, 2021
@iajoiner iajoiner mentioned this pull request Nov 21, 2021
@lithomas1 lithomas1 added Arrow pyarrow functionality Enhancement IO Data IO issues that don't fit into a more specific label labels Nov 23, 2021
@iajoiner
Copy link
Contributor Author

Since apache/arrow#9702 will significantly add to the ORC writer API shall we delay the merge until late Jan when Arrow 7.0.0 is released?

@twoertwein
Copy link
Member

twoertwein commented Nov 25, 2021

Since apache/arrow#9702 will significantly add to the ORC writer API shall we delay the merge until late Jan when Arrow 7.0.0 is released?

I can't judge this. Probably depends at least on 1) whether it would create inconsistencies (early implementation and then with the new pyarrow later) and on 2) how soon you want it (pandas 1.4 is scheduled for New Year's Eve, 1.5/2.0 will then probably be a year after that).

@iajoiner
Copy link
Contributor Author

@twoertwein I don't think there will be inconsistencies. I'm really just adding optional writer arguments that can be **kwargs in Pandas. Yes I want it out there ASAP.

@iajoiner
Copy link
Contributor Author

iajoiner commented Dec 3, 2021

@github-actions pre-commit

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 5, 2022
@iajoiner
Copy link
Contributor Author

iajoiner commented Jan 5, 2022

I’m still around. Will update this PR this month. It’s just that I have a few Arrow tickets to deal with now.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

love to have this, if you can merge master and address comments we can look again

@jreback jreback removed the Stale label Jan 16, 2022
a bytes object is returned.
engine : {{'pyarrow'}}, default 'pyarrow'
ORC library to use, or library it self, checked with 'pyarrow' name
and version >= 7.0.0. Raises ValueError if it is anything but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO Raises ValueError if it is anything but... is redundant with the raises section below so I think this can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I will fix that tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Co-authored-by: Matthew Roeschke <[email protected]>
# If unsupported dtypes are found raise NotImplementedError
for dtype in df.dtypes:
dtype_str = dtype.__str__().lower()
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will pyarrow raise if these dtypes are passed? If so, can a a pyarrow error be caught and reraised as a NotImplementedError so this can be more flexible to other potential dtypes not supported in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to test these types individually. Not sure right now.

Copy link
Contributor Author

@iajoiner iajoiner Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke It seg faults out for all instances but sparse. I need to catch them in Arrow 9.0.0. Meanwhile can we use the current dtype filter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is fine then given:

  1. Could you use the type checking functions in pandas.core.dtypes.common instead? e.g. is_categorical_dtype(dtype)?
  2. Could you make a note that in pyarrow 9.0.0 this checking should not be needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

@iajoiner iajoiner Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Since for sparse dtypes we get a TypeError from Arrow when converting the dataframe to a pyarrow table I plan to use TypeError for the other 4 in pyarrow 9.0.0 as well. The try-except block has been added in addition to the type checks for the 4 that segfault out right now with the note.

}
expected = pd.DataFrame.from_dict(data)

outputfile = os.path.join(dirpath, "TestOrcFile.testReadWrite.orc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use tm.ensure_clean("TestOrcFile.testReadWrite.orc") as a context manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@iajoiner
Copy link
Contributor Author

iajoiner commented Jun 7, 2022

@mroeschke Really thanks for reviewing this PR! I have one question about What’s New. Does the ORC writer qualify as a major enhancement? Personally I think it does and is the 4th most important major enhancement in 1.5.0.

def test_orc_writer_dtypes_not_supported(df_not_supported):
# GH44554
# PyArrow gained ORC write support with the current argument order
msg = """The dtype of one or more columns is unsigned integers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Single quotes please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this wasn't changed still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Now it has been changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Ah that’s because black replaces single quotes with double ones automatically.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's also a pandas/tests/io/data/orc/TestOrcFile.testReadWrite.orc file that was added that can just be created/destroyed in the test.

@mroeschke
Copy link
Member

@mroeschke Really thanks for reviewing this PR! I have one question about What’s New. Does the ORC writer qualify as a major enhancement? Personally I think it does and is the 4th most important major enhancement in 1.5.0.

Sure! I think the whatsnew entry can be a small section instead of a one line mention

@iajoiner
Copy link
Contributor Author

iajoiner commented Jun 12, 2022

@mroeschke The only red is likely transient and completely unrelated to the PR. What's new has been updated.

The only issue you pointed out which I haven't fixed is the PyArrow error issue since unsupport dtypes often cause segfaults in pyarrow. I have filed an Arrow ticket to fix it here: https://issues.apache.org/jira/browse/ARROW-16817 . Meanwhile due to the fact that Pandas 1.5.0 will be released before Arrow 9.0.0 in order not to have segfault when users do try to use unsupported dtypes can we use a filter in Pandas right now? We can then change it once the Arrow issue is fixed.

the RangeIndex will be stored as a range in the metadata so it
doesn't require much space and is faster. Other indexes will
be included as columns in the file output.
**kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you name this engine_kwargs and have this take a Dict[str, Any] instead? It's the pattern we've been using in other methods.

Also is there there documentation you can link from pyarrow on what other engine keyword arguments can be accepted?

Copy link
Contributor Author

@iajoiner iajoiner Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke You mean just like the excel methods but without having to support the legacy **kwargs approach? Sure!

I've followed to_feather convention and added :func:pyarrow.orc.write_table which should link to the correct documentation.

(e.g. via builtin open function). If path is None,
a bytes object is returned.
engine : str, default 'pyarrow'
ORC library to use, or library it self, checked with 'pyarrow' name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ORC library to use, or library it self, checked with 'pyarrow' name
ORC library to use. Pyarrow must be >= 7.0.0.

Library it self is a little unclear to me here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Fixed!

@mroeschke
Copy link
Member

@mroeschke The only red is likely transient and completely unrelated to the PR. What's new has been updated.

The only issue you pointed out which I haven't fixed is the PyArrow error issue since unsupport dtypes often cause segfaults in pyarrow. I have filed an Arrow ticket to fix it here: https://issues.apache.org/jira/browse/ARROW-16817 . Meanwhile due to the fact that Pandas 1.5.0 will be released before Arrow 9.0.0 in order not to have segfault when users do try to use unsupported dtypes can we use a filter in Pandas right now? We can then change it once the Arrow issue is fixed.

Sure sounds good. Can use manual dtype filtering for now with some corrections in my review.

Could you also uncommit pandas/tests/io/data/orc/TestOrcFile.testReadWrite.orc?

@iajoiner
Copy link
Contributor Author

iajoiner commented Jun 13, 2022

@mroeschke Please review again. All issues have been taken care of with the exception of using single quotes for msg which is not possible due to black.

P.S. Red stuff is unrelated to the PR.

pandas/io/orc.py Outdated
*,
engine: Literal["pyarrow"] = "pyarrow",
index: bool | None = None,
engine_kwargs: dict[str, Any] = {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just realized. Could you default this here engine_kwargs: dict[str, Any] | None = None, and if None convert to empty dict in the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with it! One more minor comment then LGTM

@iajoiner
Copy link
Contributor Author

iajoiner commented Jun 13, 2022

@mroeschke Fixed haha. Really thanks for reviewing! Is the next step merging once it is clear that no error related to the PR exists? :)

@mroeschke mroeschke merged commit 15902bd into pandas-dev:main Jun 14, 2022
@mroeschke
Copy link
Member

Awesome thanks for the responsiveness down stretch!

@iajoiner iajoiner deleted the feature-to-orc branch June 14, 2022 00:22
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* [ENH] to_orc

pandas.io.orc.to_orc method definition

* pandas.DataFrame.to_orc

set to_orc to pandas.DataFrame

* Cleaning

* Fix style & edit comments & change min dependency version to 5.0.0

* Fix style & add to see also

* Add ORC to documentation

* Changes according to review

* Fix problems mentioned in comment

* Linter compliance

* Address comments

* Add orc test

* Fixes from pre-commit [automated commit]

* Fix issues according to comments

* Simplify the code base after raising Arrow version to 7.0.0

* Fix min arrow version in to_orc

* Add to_orc test in line with other formats

* Add BytesIO support & test

* Fix some docs issues

* Use keyword only arguments

* Fix bug

* Fix param issue

* Doctest skipping due to minimal versions

* Doctest skipping due to minimal versions

* Improve spacing in docstring & remove orc test in test_common that has unusual pyarrow version requirement and is with a lot of other tests

* Fix docstring syntax

* ORC is not text

* Fix BytesIO bug && do not require orc to be explicitly imported before usage && all pytest tests have passed

* ORC writer does not work for categorical columns yet

* Appease mypy

* Appease mypy

* Edit according to reviews

* Fix path bug in test_orc

* Fix testdata tuple bug in test_orc

* Fix docstrings for check compliance

* read_orc does not have engine as a param

* Fix sphinx warnings

* Improve docs & rerun tests

* Force retrigger

* Fix test_orc according to review

* Rename some variables and func

* Update pandas/core/frame.py

Co-authored-by: Matthew Roeschke <[email protected]>

* Fix issues according to review

* Forced reruns

* Fix issues according to review

* Reraise Pyarrow TypeError as NotImplementedError

* Fix bugs

* Fix expected error msg in orc tests

* Avoid deprecated functions

* Replace {} with None in arg

Co-authored-by: NickFillot <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: to_orc
8 participants