-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Allow compression in NDFrame.to_csv to be a dict with optional arguments (#26023) #26024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I am -1 on adding another parameter here. What makes Zip extraction different than the other compressions methods? |
Other compression methods apply to a single file and add an extension to the name (i.e. An alternative would be to infer the file name, but that would be a breaking change. |
So I understand the use-case. However, I do agree that adding another parameter is not necessarily the best way to do this. This parameter is really related to our This leads me to think that a cc @jreback (related to #25990 (comment)) |
Codecov Report
@@ Coverage Diff @@
## master #26024 +/- ##
==========================================
- Coverage 91.96% 91.95% -0.01%
==========================================
Files 175 175
Lines 52405 52425 +20
==========================================
+ Hits 48193 48207 +14
- Misses 4212 4218 +6
Continue to review full report at Codecov.
|
@gfyoung I agree that this approach makes more sense - I'll modify the functions to optionally take a |
@drew-heenan : Sounds good! One minor thing: let's use |
pandas/io/common.py
Outdated
.. versionchanged:: 0.25.0 | ||
|
||
May now be a dict with key 'method' as compression mode | ||
and 'arcname' as CSV file name if mode is 'zip' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this whatsnew
note a little more generic. In reality, we should be just accepting any keyword arguments to BytesZipFile
. Also, we should have an example.
pandas/io/formats/csvs.py
Outdated
@@ -36,8 +36,20 @@ def __init__(self, obj, path_or_buf=None, sep=",", na_rep='', | |||
if path_or_buf is None: | |||
path_or_buf = StringIO() | |||
|
|||
self._compression_arg = compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't you change get_filepath_or_buffer to already handle this? why is this special cased here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not modify get_filepath_or_buffer
, though I suppose I certainly could to support taking a dict
as compression
. The self._compression_arg
is there to avoid changing self.compression
from only ever holding the inferred compression method, while self._compression_arg
would include any additional arguments if compression
was passed as a dict
. I do now think that I should have instead had a self.compression_args
hold this dict with the method
key popped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback With regard to get_filepath_or_buffer
(and similarly _infer_compression
), I've added a function _get_compression_method
which handles the case where compression
is given as a dict
to to_csv
, CSVFormatter
or _get_handle
. It extracts the compression method string before passing to get_filepath_buffer
or _infer_compression
.
Would it be preferable then to keep both functions' original functionality where compression
may only be a string/None, as neither need the additional arguments, or include a call to _get_compression_method
in both to handle dicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add type annotations for the changed / added parameters?
pandas/io/common.py
Outdated
elif compression == 'zip': | ||
zf = BytesZipFile(path_or_buf, mode) | ||
elif compression_method == 'zip': | ||
arcname = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this to archive_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd The specific instance of arcname
above is no longer there; Though, do you mean change arcname
to archive_name
in general, including the dict key?
6058bbb
to
b1889ef
Compare
@WillAyd I've added type annotations to parameters which I've changed, but it seems that doing so caused the typing validation check to fail on other parts of the files which I have not modified. |
b1889ef
to
a1cb3f7
Compare
Yea I think a general rename would make things clearer
…Sent from my iPhone
On Apr 13, 2019, at 6:24 PM, Drew Heenan ***@***.***> wrote:
@drew-heenan commented on this pull request.
In pandas/io/common.py:
> if is_path:
f = bz2.BZ2File(path_or_buf, mode)
else:
f = bz2.BZ2File(path_or_buf)
# ZIP Compression
- elif compression == 'zip':
- zf = BytesZipFile(path_or_buf, mode)
+ elif compression_method == 'zip':
+ arcname = None
@WillAyd The specific instance of arcname above is no longer there; Though, do you mean change arcname to archive_name in general, including the dict key?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@WillAyd : Do you plan on annotating on affected methods in this PR? I see that some of the inputs to some of the affected methods are annotated, but not all of them. |
I can take a look |
OK @gfyoung added annotations where I think feasible. Two blockers prevented full annotations of modified funcs, particularly _get_handle:
|
pandas/io/common.py
Outdated
@@ -435,13 +486,23 @@ class BytesZipFile(zipfile.ZipFile, BytesIO): # type: ignore | |||
""" | |||
|
|||
# GH 17778 | |||
def __init__(self, file, mode, compression=zipfile.ZIP_DEFLATED, **kwargs): | |||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be clear in the diff but note that I removed the compression
argument here. The reason for this was that it is never actually called in code.
It makes annotations more complex, because the keyword argument unpacking of compression_args
in this PR only ever has str
as keys and therefore mypy complains that there is a type mismatch because int
values are never unpacked. Figured easier to just remove than muck around types since it is not ever used
Merged again to keep fresh. There's a Mypy failure I'll have to look at later. @TomAugspurger @gfyoung would you still have time to review this one coming up? Have a lot in the queue so just want to prioritize workflow with you |
It looks pretty good overall! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think good for review again @gfyoung
path_or_buf, mode, encoding=None, compression=None, memory_map=False, is_text=True | ||
path_or_buf, | ||
mode: str, | ||
encoding=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't annotate this particular argument due to a minor bug in typeshed. Fixed on master so maybe something we can come back to soon (typeshed updates are pretty quick)
Thanks @drew-heenan |
git diff upstream/master -u -- "*.py" | flake8 --diff