-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[EHN] pandas.DataFrame.to_orc #44554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
iajoiner
commented
Nov 21, 2021
•
edited
Loading
edited
- closes ENH: to_orc #43864
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
Thanks @NickFillot for your effort in integrating my ORC writer into Pandas. Now since your PR has been closed I will take care of modifying it and getting it approved from now on. Here is the link to Nick's PR which I unfortunately can not reopen: #43860 |
Since apache/arrow#9702 will significantly add to the ORC writer API shall we delay the merge until late Jan when Arrow 7.0.0 is released? |
I can't judge this. Probably depends at least on 1) whether it would create inconsistencies (early implementation and then with the new pyarrow later) and on 2) how soon you want it (pandas 1.4 is scheduled for New Year's Eve, 1.5/2.0 will then probably be a year after that). |
@twoertwein I don't think there will be inconsistencies. I'm really just adding optional writer arguments that can be |
@github-actions pre-commit |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
I’m still around. Will update this PR this month. It’s just that I have a few Arrow tickets to deal with now. |
love to have this, if you can merge master and address comments we can look again |
pandas/core/frame.py
Outdated
a bytes object is returned. | ||
engine : {{'pyarrow'}}, default 'pyarrow' | ||
ORC library to use, or library it self, checked with 'pyarrow' name | ||
and version >= 7.0.0. Raises ValueError if it is anything but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO Raises ValueError if it is anything but...
is redundant with the raises section below so I think this can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I will fix that tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Co-authored-by: Matthew Roeschke <[email protected]>
# If unsupported dtypes are found raise NotImplementedError | ||
for dtype in df.dtypes: | ||
dtype_str = dtype.__str__().lower() | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will pyarrow raise if these dtypes are passed? If so, can a a pyarrow error be caught and reraised as a NotImplementedError
so this can be more flexible to other potential dtypes not supported in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to test these types individually. Not sure right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke It seg faults out for all instances but sparse. I need to catch them in Arrow 9.0.0. Meanwhile can we use the current dtype filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is fine then given:
- Could you use the type checking functions in
pandas.core.dtypes.common
instead? e.g.is_categorical_dtype(dtype)
? - Could you make a note that in pyarrow 9.0.0 this checking should not be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Since for sparse dtypes we get a TypeError
from Arrow when converting the dataframe to a pyarrow table I plan to use TypeError
for the other 4 in pyarrow 9.0.0 as well. The try-except block has been added in addition to the type checks for the 4 that segfault out right now with the note.
pandas/tests/io/test_orc.py
Outdated
} | ||
expected = pd.DataFrame.from_dict(data) | ||
|
||
outputfile = os.path.join(dirpath, "TestOrcFile.testReadWrite.orc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use tm.ensure_clean("TestOrcFile.testReadWrite.orc")
as a context manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@mroeschke Really thanks for reviewing this PR! I have one question about What’s New. Does the ORC writer qualify as a major enhancement? Personally I think it does and is the 4th most important major enhancement in 1.5.0. |
pandas/tests/io/test_orc.py
Outdated
def test_orc_writer_dtypes_not_supported(df_not_supported): | ||
# GH44554 | ||
# PyArrow gained ORC write support with the current argument order | ||
msg = """The dtype of one or more columns is unsigned integers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Single quotes please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this wasn't changed still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Now it has been changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke Ah that’s because black replaces single quotes with double ones automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's also a pandas/tests/io/data/orc/TestOrcFile.testReadWrite.orc
file that was added that can just be created/destroyed in the test.
Sure! I think the whatsnew entry can be a small section instead of a one line mention |
@mroeschke The only red is likely transient and completely unrelated to the PR. What's new has been updated. The only issue you pointed out which I haven't fixed is the PyArrow error issue since unsupport dtypes often cause segfaults in pyarrow. I have filed an Arrow ticket to fix it here: https://issues.apache.org/jira/browse/ARROW-16817 . Meanwhile due to the fact that Pandas 1.5.0 will be released before Arrow 9.0.0 in order not to have segfault when users do try to use unsupported dtypes can we use a filter in Pandas right now? We can then change it once the Arrow issue is fixed. |
pandas/core/frame.py
Outdated
the RangeIndex will be stored as a range in the metadata so it | ||
doesn't require much space and is faster. Other indexes will | ||
be included as columns in the file output. | ||
**kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you name this engine_kwargs
and have this take a Dict[str, Any]
instead? It's the pattern we've been using in other methods.
Also is there there documentation you can link from pyarrow on what other engine keyword arguments can be accepted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke You mean just like the excel methods but without having to support the legacy **kwargs
approach? Sure!
I've followed to_feather
convention and added :func:pyarrow.orc.write_table
which should link to the correct documentation.
pandas/core/frame.py
Outdated
(e.g. via builtin open function). If path is None, | ||
a bytes object is returned. | ||
engine : str, default 'pyarrow' | ||
ORC library to use, or library it self, checked with 'pyarrow' name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORC library to use, or library it self, checked with 'pyarrow' name | |
ORC library to use. Pyarrow must be >= 7.0.0. |
Library it self
is a little unclear to me here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Fixed!
Sure sounds good. Can use manual dtype filtering for now with some corrections in my review. Could you also uncommit |
@mroeschke Please review again. All issues have been taken care of with the exception of using single quotes for msg which is not possible due to black. P.S. Red stuff is unrelated to the PR. |
pandas/io/orc.py
Outdated
*, | ||
engine: Literal["pyarrow"] = "pyarrow", | ||
index: bool | None = None, | ||
engine_kwargs: dict[str, Any] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry just realized. Could you default this here engine_kwargs: dict[str, Any] | None = None,
and if None
convert to empty dict in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with it! One more minor comment then LGTM
@mroeschke Fixed haha. Really thanks for reviewing! Is the next step merging once it is clear that no error related to the PR exists? :) |
Awesome thanks for the responsiveness down stretch! |
* [ENH] to_orc pandas.io.orc.to_orc method definition * pandas.DataFrame.to_orc set to_orc to pandas.DataFrame * Cleaning * Fix style & edit comments & change min dependency version to 5.0.0 * Fix style & add to see also * Add ORC to documentation * Changes according to review * Fix problems mentioned in comment * Linter compliance * Address comments * Add orc test * Fixes from pre-commit [automated commit] * Fix issues according to comments * Simplify the code base after raising Arrow version to 7.0.0 * Fix min arrow version in to_orc * Add to_orc test in line with other formats * Add BytesIO support & test * Fix some docs issues * Use keyword only arguments * Fix bug * Fix param issue * Doctest skipping due to minimal versions * Doctest skipping due to minimal versions * Improve spacing in docstring & remove orc test in test_common that has unusual pyarrow version requirement and is with a lot of other tests * Fix docstring syntax * ORC is not text * Fix BytesIO bug && do not require orc to be explicitly imported before usage && all pytest tests have passed * ORC writer does not work for categorical columns yet * Appease mypy * Appease mypy * Edit according to reviews * Fix path bug in test_orc * Fix testdata tuple bug in test_orc * Fix docstrings for check compliance * read_orc does not have engine as a param * Fix sphinx warnings * Improve docs & rerun tests * Force retrigger * Fix test_orc according to review * Rename some variables and func * Update pandas/core/frame.py Co-authored-by: Matthew Roeschke <[email protected]> * Fix issues according to review * Forced reruns * Fix issues according to review * Reraise Pyarrow TypeError as NotImplementedError * Fix bugs * Fix expected error msg in orc tests * Avoid deprecated functions * Replace {} with None in arg Co-authored-by: NickFillot <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>