Skip to content

ENH: Allow compression in NDFrame.to_csv to be a dict with optional arguments (#26023) #26024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Aug 26, 2019

Conversation

drew-heenan
Copy link
Contributor

@WillAyd
Copy link
Member

WillAyd commented Apr 8, 2019

I am -1 on adding another parameter here. What makes Zip extraction different than the other compressions methods?

@drew-heenan
Copy link
Contributor Author

drew-heenan commented Apr 8, 2019

Other compression methods apply to a single file and add an extension to the name (i.e. data.csv.gz extracted results in data.csv), but ZIP archives are meant to have each file within have a name. As is, calling df.to_csv('data.zip') creates an archive containing a CSV file named data.zip as well.

An alternative would be to infer the file name, but that would be a breaking change.

@drew-heenan drew-heenan marked this pull request as ready for review April 8, 2019 02:59
@gfyoung gfyoung added Enhancement IO CSV read_csv, to_csv labels Apr 8, 2019
@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

So I understand the use-case. However, I do agree that adding another parameter is not necessarily the best way to do this. This parameter is really related to our compression parameter.

This leads me to think that a dict might make more sense. It allows us to configure the compression without the bloat. What do you think?

cc @jreback (related to #25990 (comment))

@codecov
Copy link

codecov bot commented Apr 8, 2019

Codecov Report

Merging #26024 into master will decrease coverage by <.01%.
The diff coverage is 92.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26024      +/-   ##
==========================================
- Coverage   91.96%   91.95%   -0.01%     
==========================================
  Files         175      175              
  Lines       52405    52425      +20     
==========================================
+ Hits        48193    48207      +14     
- Misses       4212     4218       +6
Flag Coverage Δ
#multiple 90.51% <92.59%> (-0.01%) ⬇️
#single 40.72% <37.03%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 93.54% <100%> (ø) ⬆️
pandas/io/formats/csvs.py 98.23% <100%> (+0.03%) ⬆️
pandas/io/common.py 91.5% <90.9%> (-0.33%) ⬇️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b90f9db...5853a28. Read the comment docs.

@drew-heenan
Copy link
Contributor Author

@gfyoung I agree that this approach makes more sense - I'll modify the functions to optionally take a dict for compression, i.e. {'compression': 'zip', 'arcname': 'data.csv'}. Thanks for the feedback!

@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

@drew-heenan : Sounds good! One minor thing: let's use method as the key instead of compression. Otherwise, we'll be extracting compression["compression"], which is a little redundant.

.. versionchanged:: 0.25.0

May now be a dict with key 'method' as compression mode
and 'arcname' as CSV file name if mode is 'zip'
Copy link
Member

@gfyoung gfyoung Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this whatsnew note a little more generic. In reality, we should be just accepting any keyword arguments to BytesZipFile. Also, we should have an example.

@@ -36,8 +36,20 @@ def __init__(self, obj, path_or_buf=None, sep=",", na_rep='',
if path_or_buf is None:
path_or_buf = StringIO()

self._compression_arg = compression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you change get_filepath_or_buffer to already handle this? why is this special cased here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not modify get_filepath_or_buffer, though I suppose I certainly could to support taking a dict as compression. The self._compression_arg is there to avoid changing self.compression from only ever holding the inferred compression method, while self._compression_arg would include any additional arguments if compression was passed as a dict. I do now think that I should have instead had a self.compression_args hold this dict with the method key popped.

Copy link
Contributor Author

@drew-heenan drew-heenan Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback With regard to get_filepath_or_buffer (and similarly _infer_compression), I've added a function _get_compression_method which handles the case where compression is given as a dict to to_csv, CSVFormatter or _get_handle. It extracts the compression method string before passing to get_filepath_buffer or _infer_compression.
Would it be preferable then to keep both functions' original functionality where compression may only be a string/None, as neither need the additional arguments, or include a call to _get_compression_method in both to handle dicts?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add type annotations for the changed / added parameters?

elif compression == 'zip':
zf = BytesZipFile(path_or_buf, mode)
elif compression_method == 'zip':
arcname = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to archive_name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd The specific instance of arcname above is no longer there; Though, do you mean change arcname to archive_name in general, including the dict key?

@drew-heenan drew-heenan force-pushed the issue-26023 branch 3 times, most recently from 6058bbb to b1889ef Compare April 12, 2019 06:48
@drew-heenan
Copy link
Contributor Author

@WillAyd I've added type annotations to parameters which I've changed, but it seems that doing so caused the typing validation check to fail on other parts of the files which I have not modified.

@drew-heenan drew-heenan changed the title ENH: Add arcname to to_csv for ZIP compressed CSV filename (#26023) ENH: Allow compression in NDFrame.to_csv to be a dict with optional arguments (#26023) Apr 14, 2019
@WillAyd
Copy link
Member

WillAyd commented Apr 14, 2019 via email

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2019

@WillAyd : Do you plan on annotating on affected methods in this PR? I see that some of the inputs to some of the affected methods are annotated, but not all of them.

@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

I can take a look

@WillAyd
Copy link
Member

WillAyd commented Jul 16, 2019

OK @gfyoung added annotations where I think feasible. Two blockers prevented full annotations of modified funcs, particularly _get_handle:

  1. An actual bug in type shed (see TextIOWrapper.encoding missing Optional Argument python/typeshed#3124)
  2. Variable reuse (see f in _get_handle); this requires a refactor that I think expands the diff a little much so better done as follow up

@@ -435,13 +486,23 @@ class BytesZipFile(zipfile.ZipFile, BytesIO): # type: ignore
"""

# GH 17778
def __init__(self, file, mode, compression=zipfile.ZIP_DEFLATED, **kwargs):
def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be clear in the diff but note that I removed the compression argument here. The reason for this was that it is never actually called in code.

It makes annotations more complex, because the keyword argument unpacking of compression_args in this PR only ever has str as keys and therefore mypy complains that there is a type mismatch because int values are never unpacked. Figured easier to just remove than muck around types since it is not ever used

@WillAyd WillAyd added this to the 1.0 milestone Aug 24, 2019
@WillAyd
Copy link
Member

WillAyd commented Aug 24, 2019

Merged again to keep fresh. There's a Mypy failure I'll have to look at later.

@TomAugspurger @gfyoung would you still have time to review this one coming up? Have a lot in the queue so just want to prioritize workflow with you

@gfyoung
Copy link
Member

gfyoung commented Aug 24, 2019

It looks pretty good overall!

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think good for review again @gfyoung

path_or_buf, mode, encoding=None, compression=None, memory_map=False, is_text=True
path_or_buf,
mode: str,
encoding=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't annotate this particular argument due to a minor bug in typeshed. Fixed on master so maybe something we can come back to soon (typeshed updates are pretty quick)

see python/typeshed#3125

@WillAyd WillAyd merged commit 0d0daa8 into pandas-dev:master Aug 26, 2019
@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2019

Thanks @drew-heenan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.DataFrame.to_csv('filename.zip') doesn't extract with a '.csv' extension
6 participants