DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880

dhimmel · 2016-12-14T16:45:26Z

Follow up on #14576, which refactored compression code to expand URL support.

Fixes up some small remaining issues and adds a what's new entry.

Closes Reading bz2-compressed tables from URL fails with c-engine in Python 2 #14874

jorisvandenbossche · 2016-12-14T16:57:36Z

doc/source/whatsnew/v0.20.0.txt

+   url = ('https://github.com/pandas-dev/pandas/raw/master/' +
+          'pandas/io/tests/parser/data/salaries.csv.bz2')
+   pd.read_table(url, compression='infer')  # default, infer compression
+   pd.read_table(url, compression='xz')  # explicitly specify compression


This will give an error, as the url is bz2, is that the intention?

No, this is a mistake... they should both be bz2 or xz.

Can you also add .head() or so, as otherwise the output will get a bit long for just demonstrating the compression keyword

or assign to df = without showing its content, also a possibility

Didn't realize this code would get executed. Now assigns to df and then uses head to display first two rows. In cb40a41.

jorisvandenbossche · 2016-12-14T17:00:06Z

pandas/parser.pyx

-                                     'handle')
+                    content = source.read()
+                    source.close()
+                    from pandas.compat import StringIO


pandas.compat is already imported as compat

Fixed those two issues. Was a nice opportunity for me to try out --fixup. We can do the --autosquash when the PR looks ready to go.

jreback · 2016-12-14T20:45:51Z

doc/source/whatsnew/v0.20.0.txt

@@ -44,6 +44,19 @@ fixed-width text files, and :func:`read_excel` for parsing Excel files.
   pd.read_fwf(StringIO(data)).dtypes
   pd.read_fwf(StringIO(data), dtype={'a':'float64', 'b':'object'}).dtypes

+Reading dataframes from URLs, in :func:`read_csv` or :func:`read_table`, now


I would make a sub-section

jreback · 2016-12-14T20:46:23Z

doc/source/whatsnew/v0.20.0.txt

+Reading dataframes from URLs, in :func:`read_csv` or :func:`read_table`, now
+supports additional compression methods (`xz`, `bz2`, `zip`). Previously, only
+`gzip` compression was supported. By default, compression of URLs and paths are
+now both inferred using their file extensions.


can you add all of the issues that were closed here, e.g. (:issue:`....`, :issue:`...`)

jreback · 2016-12-14T20:47:11Z

lgtm (minor doc comment)

dhimmel · 2016-12-14T21:42:09Z

Okay I cleaned up the commit history with

git rebase --interactive --autosquash 30025d82564fc27fbab58fbd791009e5b77a23db
git push --force

Assuming tests pass, I think this PR is ready. @jreback / @jorisvandenbossche -- you may want to double check the what's new, since made some additional changes in 3f4cd45.

jorisvandenbossche

Only some remaining lint:

pandas/io/tests/parser/compression.py:11:1: F401 'pandas.compat' imported but unused

jorisvandenbossche · 2016-12-15T08:48:35Z

pandas/io/common.py

-    Get file handle for given path/buffer and mode.
+    Get the compression method for filepath_or_buffer. If compression='infer',
+    the inferred compression method is returned. Otherwise, the input
+    compression method is returned unchanged, unless it's invalid, in which case


PEP8 line too long (see bottom third travis test)

codecov-io · 2016-12-15T14:44:26Z

Current coverage is 84.55% (diff: 100%)

Merging #14880 into master will decrease coverage by 0.75%

@@             master     #14880   diff @@
==========================================
  Files           144        144          
  Lines         51015      51043    +28   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43522      43160   -362   
- Misses         7493       7883   +390   
  Partials          0          0

Powered by Codecov. Last update d1b1720...e1b5d42

pandas-dev@4a5aec4#commitcomment-20178761

Closes pandas-dev#14874

Add what's new corresponding to pandas-dev#14576.

Reference corresponding issues in What's New. Change code example to use string formating for improved modularity. Add what's new id

dhimmel · 2016-12-15T15:17:03Z

@jorisvandenbossche, I fixed the PEP8 issues. Had to rebase to resolve a conflict that arose in doc/source/whatsnew/v0.20.0.txt and did some squashing/rearranging. As a result, the chronological commit order GitHub shows, isn't the actual commit order.

Also there may be an issue with S3 tests failing. I was going to wait for the latest build and see if the issues are still there.

jorisvandenbossche · 2016-12-16T09:45:45Z

It's indeed failing. What is the reason for this? It seems you didn't change anything related to that?

did some squashing/rearranging

Just as a personal recommendation, I would not put too much effort in it. Previously we asked people to squash their PRs, but now the github interface (and our own tooling) has improved, and we squash on merge. Your previous PR was just a special case as there were two commit authors to retain.

Addresses pandas-dev#14874

dhimmel · 2016-12-16T20:54:40Z

It's indeed failing. What is the reason for this? It seems you didn't change anything related to that?

Some S3 tests were still expecting bz2 failure in Python 2. Fixed in 8568aed.

Previously we asked people to squash their PRs, but now the github interface (and our own tooling) has improved, and we squash on merge.

Instead of rebasing, is it acceptable to merge in the current master? This way we don't keep destroying references to old commit hashes.

jreback · 2016-12-16T23:13:07Z

doc/source/whatsnew/v0.20.0.txt

+dataframes from URLs in :func:`read_csv` or :func:`read_table` now supports
+additional compression methods: ``xz``, ``bz2``, and ``zip`` (:issue:`14570`).
+Previously, only ``gzip`` compression was supported. By default, compression of
+URLs and paths are now both inferred using their file extensions. Additionally,


The compression code

paths are not inferred using (remove both)

Additionally, support for bz2 compress in the python 2 c-engine improved.

Addressed comments 1 and 3 in e1b5d42. @jreback, I didn't change:

By default, compression of URLs and paths are now both inferred using their file extensions.

Previously, compression of paths was by default inferred from their extension, but not URLs. Now both are inferred by their extension. Am I missing something?

jreback · 2016-12-16T23:14:14Z

doc/source/whatsnew/v0.20.0.txt

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Compression code was refactored (:issue:`12688`). As a result, reading
+dataframes from URLs in :func:`read_csv` or :func:`read_table` now supports


if there are any other issues that were closed by this, pls list them as well.

Rechecked... they're all already listed.

jreback · 2016-12-16T23:15:13Z

@dhimmel very minor doc changes. ping on when green.

@TomAugspurger after this is merged, pls rebase #13137

dhimmel · 2016-12-17T22:48:35Z

@dhimmel very minor doc changes. ping on when green.

@jreback @jorisvandenbossche -- we're green.

jreback · 2016-12-17T23:09:29Z

thanks @dhimmel great responsiveness!

…th C engine (GH14874) Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874 Author: Daniel Himmelstein <[email protected]> Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits: e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring

jorisvandenbossche reviewed Dec 14, 2016

View reviewed changes

jorisvandenbossche added this to the 0.20.0 milestone Dec 14, 2016

jorisvandenbossche added the IO Data IO issues that don't fit into a more specific label label Dec 14, 2016

jreback reviewed Dec 14, 2016

View reviewed changes

dhimmel force-pushed the whats-new branch from 267e955 to 3f4cd45 Compare December 14, 2016 21:38

jorisvandenbossche approved these changes Dec 14, 2016

View reviewed changes

jorisvandenbossche changed the title ~~Make refactored compression code release ready~~ DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) Dec 15, 2016

jorisvandenbossche reviewed Dec 15, 2016

View reviewed changes

dhimmel added 6 commits December 15, 2016 09:50

DOC: Improve _infer_compression docstring

a7960f6

pandas-dev@4a5aec4#commitcomment-20178761

ENH: Support bz2 compression in PY2 for c engine

85630ea

Closes pandas-dev#14874

TST: Read compressed URLs with c engine

cb91007

DOC: What's New for refactored compression code

210fb20

Add what's new corresponding to pandas-dev#14576.

DOC: Reword get_filepath_or_buffer docstring

0e0fa0a

TST: check bz2 compression in PY2 c engine

f8a7900

dhimmel force-pushed the whats-new branch from c5e001b to c2d99f4 Compare December 15, 2016 14:58

dhimmel added 2 commits December 15, 2016 10:08

STY: PEP8 fixes

c4ea3d3

DOC: Improve what's new

09dcbff

Reference corresponding issues in What's New. Change code example to use string formating for improved modularity. Add what's new id

dhimmel force-pushed the whats-new branch from 81ee72c to 09dcbff Compare December 15, 2016 15:09

TST: Read bz2 files from S3 in PY2

8568aed

Addresses pandas-dev#14874

dhimmel force-pushed the whats-new branch from 6c632db to 8568aed Compare December 16, 2016 20:41

dhimmel mentioned this pull request Dec 16, 2016

COMPAT/REF: Use s3fs for s3 IO #13137

Closed

4 tasks

jreback reviewed Dec 16, 2016

View reviewed changes

Address what's new review comments

e1b5d42

jreback closed this in e80a2b9 Dec 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880

DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880

dhimmel commented Dec 14, 2016 •

edited

Loading

jorisvandenbossche Dec 14, 2016

dhimmel Dec 14, 2016

jorisvandenbossche Dec 14, 2016

jorisvandenbossche Dec 14, 2016

dhimmel Dec 14, 2016

jorisvandenbossche Dec 14, 2016

dhimmel Dec 14, 2016

jreback Dec 14, 2016

jreback Dec 14, 2016

jreback commented Dec 14, 2016

dhimmel commented Dec 14, 2016

jorisvandenbossche left a comment

jorisvandenbossche Dec 15, 2016

codecov-io commented Dec 15, 2016 •

edited

Loading

dhimmel commented Dec 15, 2016

jorisvandenbossche commented Dec 16, 2016

dhimmel commented Dec 16, 2016 •

edited

Loading

jreback Dec 16, 2016

dhimmel Dec 17, 2016

jreback Dec 16, 2016

dhimmel Dec 17, 2016

jreback commented Dec 16, 2016

dhimmel commented Dec 17, 2016

jreback commented Dec 17, 2016

DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880

DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880

Conversation

dhimmel commented Dec 14, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 14, 2016

dhimmel commented Dec 14, 2016

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 15, 2016 • edited Loading

Current coverage is 84.55% (diff: 100%)

dhimmel commented Dec 15, 2016

jorisvandenbossche commented Dec 16, 2016

dhimmel commented Dec 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 16, 2016

dhimmel commented Dec 17, 2016

jreback commented Dec 17, 2016

dhimmel commented Dec 14, 2016 •

edited

Loading

codecov-io commented Dec 15, 2016 •

edited

Loading

dhimmel commented Dec 16, 2016 •

edited

Loading