Skip to content

DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

dhimmel
Copy link
Contributor

@dhimmel dhimmel commented Dec 14, 2016

Follow up on #14576, which refactored compression code to expand URL support.

Fixes up some small remaining issues and adds a what's new entry.

url = ('https://github.com/pandas-dev/pandas/raw/master/' +
'pandas/io/tests/parser/data/salaries.csv.bz2')
pd.read_table(url, compression='infer') # default, infer compression
pd.read_table(url, compression='xz') # explicitly specify compression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will give an error, as the url is bz2, is that the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a mistake... they should both be bz2 or xz.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add .head() or so, as otherwise the output will get a bit long for just demonstrating the compression keyword

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or assign to df = without showing its content, also a possibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize this code would get executed. Now assigns to df and then uses head to display first two rows. In cb40a41.

'handle')
content = source.read()
source.close()
from pandas.compat import StringIO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.compat is already imported as compat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed those two issues. Was a nice opportunity for me to try out --fixup. We can do the --autosquash when the PR looks ready to go.

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Dec 14, 2016
@jorisvandenbossche jorisvandenbossche added the IO Data IO issues that don't fit into a more specific label label Dec 14, 2016
@@ -44,6 +44,19 @@ fixed-width text files, and :func:`read_excel` for parsing Excel files.
pd.read_fwf(StringIO(data)).dtypes
pd.read_fwf(StringIO(data), dtype={'a':'float64', 'b':'object'}).dtypes

Reading dataframes from URLs, in :func:`read_csv` or :func:`read_table`, now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a sub-section

Reading dataframes from URLs, in :func:`read_csv` or :func:`read_table`, now
supports additional compression methods (`xz`, `bz2`, `zip`). Previously, only
`gzip` compression was supported. By default, compression of URLs and paths are
now both inferred using their file extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add all of the issues that were closed here, e.g. (:issue:`....`, :issue:`...`)

@jreback
Copy link
Contributor

jreback commented Dec 14, 2016

lgtm (minor doc comment)

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 14, 2016

Okay I cleaned up the commit history with

git rebase --interactive --autosquash 30025d82564fc27fbab58fbd791009e5b77a23db
git push --force

Assuming tests pass, I think this PR is ready. @jreback / @jorisvandenbossche -- you may want to double check the what's new, since made some additional changes in 3f4cd45.

@jorisvandenbossche jorisvandenbossche changed the title Make refactored compression code release ready DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) Dec 15, 2016
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some remaining lint:

pandas/io/tests/parser/compression.py:11:1: F401 'pandas.compat' imported but unused

Get file handle for given path/buffer and mode.
Get the compression method for filepath_or_buffer. If compression='infer',
the inferred compression method is returned. Otherwise, the input
compression method is returned unchanged, unless it's invalid, in which case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 line too long (see bottom third travis test)

@codecov-io
Copy link

codecov-io commented Dec 15, 2016

Current coverage is 84.55% (diff: 100%)

Merging #14880 into master will decrease coverage by 0.75%

@@             master     #14880   diff @@
==========================================
  Files           144        144          
  Lines         51015      51043    +28   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43522      43160   -362   
- Misses         7493       7883   +390   
  Partials          0          0          

Powered by Codecov. Last update d1b1720...e1b5d42

Reference corresponding issues in What's New.

Change code example to use string formating for improved modularity.

Add what's new id
@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 15, 2016

@jorisvandenbossche, I fixed the PEP8 issues. Had to rebase to resolve a conflict that arose in doc/source/whatsnew/v0.20.0.txt and did some squashing/rearranging. As a result, the chronological commit order GitHub shows, isn't the actual commit order.

Also there may be an issue with S3 tests failing. I was going to wait for the latest build and see if the issues are still there.

@jorisvandenbossche
Copy link
Member

It's indeed failing. What is the reason for this? It seems you didn't change anything related to that?

did some squashing/rearranging

Just as a personal recommendation, I would not put too much effort in it. Previously we asked people to squash their PRs, but now the github interface (and our own tooling) has improved, and we squash on merge. Your previous PR was just a special case as there were two commit authors to retain.

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 16, 2016

It's indeed failing. What is the reason for this? It seems you didn't change anything related to that?

Some S3 tests were still expecting bz2 failure in Python 2. Fixed in 8568aed.

Previously we asked people to squash their PRs, but now the github interface (and our own tooling) has improved, and we squash on merge.

Instead of rebasing, is it acceptable to merge in the current master? This way we don't keep destroying references to old commit hashes.

@dhimmel dhimmel mentioned this pull request Dec 16, 2016
4 tasks
dataframes from URLs in :func:`read_csv` or :func:`read_table` now supports
additional compression methods: ``xz``, ``bz2``, and ``zip`` (:issue:`14570`).
Previously, only ``gzip`` compression was supported. By default, compression of
URLs and paths are now both inferred using their file extensions. Additionally,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compression code

paths are not inferred using (remove both)

Additionally, support for bz2 compress in the python 2 c-engine improved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments 1 and 3 in e1b5d42. @jreback, I didn't change:

By default, compression of URLs and paths are now both inferred using their file extensions.

Previously, compression of paths was by default inferred from their extension, but not URLs. Now both are inferred by their extension. Am I missing something?

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Compression code was refactored (:issue:`12688`). As a result, reading
dataframes from URLs in :func:`read_csv` or :func:`read_table` now supports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are any other issues that were closed by this, pls list them as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechecked... they're all already listed.

@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

@dhimmel very minor doc changes. ping on when green.

@TomAugspurger after this is merged, pls rebase #13137

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 17, 2016

@dhimmel very minor doc changes. ping on when green.

@jreback @jorisvandenbossche -- we're green.

@jreback jreback closed this in e80a2b9 Dec 17, 2016
@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

thanks @dhimmel great responsiveness!

ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
…th C engine (GH14874)

Follow up on pandas-dev#14576, which
refactored compression code to expand URL support.    Fixes up some
small remaining issues and adds a what's new entry.    - [x] Closes
pandas-dev#14874

Author: Daniel Himmelstein <[email protected]>

Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments
8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2
09dcbff [Daniel Himmelstein] DOC: Improve what's new
c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes
f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine
0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring
210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code
cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine
85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine
a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
…th C engine (GH14874)

Follow up on pandas-dev#14576, which
refactored compression code to expand URL support.    Fixes up some
small remaining issues and adds a what's new entry.    - [x] Closes
pandas-dev#14874

Author: Daniel Himmelstein <[email protected]>

Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments
8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2
09dcbff [Daniel Himmelstein] DOC: Improve what's new
c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes
f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine
0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring
210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code
cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine
85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine
a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants