-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
url = ('https://github.com/pandas-dev/pandas/raw/master/' + | ||
'pandas/io/tests/parser/data/salaries.csv.bz2') | ||
pd.read_table(url, compression='infer') # default, infer compression | ||
pd.read_table(url, compression='xz') # explicitly specify compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will give an error, as the url is bz2, is that the intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is a mistake... they should both be bz2 or xz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add .head()
or so, as otherwise the output will get a bit long for just demonstrating the compression keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or assign to df =
without showing its content, also a possibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize this code would get executed. Now assigns to df
and then uses head to display first two rows. In cb40a41.
'handle') | ||
content = source.read() | ||
source.close() | ||
from pandas.compat import StringIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.compat
is already imported as compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed those two issues. Was a nice opportunity for me to try out --fixup
. We can do the --autosquash
when the PR looks ready to go.
@@ -44,6 +44,19 @@ fixed-width text files, and :func:`read_excel` for parsing Excel files. | |||
pd.read_fwf(StringIO(data)).dtypes | |||
pd.read_fwf(StringIO(data), dtype={'a':'float64', 'b':'object'}).dtypes | |||
|
|||
Reading dataframes from URLs, in :func:`read_csv` or :func:`read_table`, now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make a sub-section
Reading dataframes from URLs, in :func:`read_csv` or :func:`read_table`, now | ||
supports additional compression methods (`xz`, `bz2`, `zip`). Previously, only | ||
`gzip` compression was supported. By default, compression of URLs and paths are | ||
now both inferred using their file extensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add all of the issues that were closed here, e.g. (:issue:`....`, :issue:`...`)
lgtm (minor doc comment) |
Okay I cleaned up the commit history with git rebase --interactive --autosquash 30025d82564fc27fbab58fbd791009e5b77a23db
git push --force Assuming tests pass, I think this PR is ready. @jreback / @jorisvandenbossche -- you may want to double check the what's new, since made some additional changes in 3f4cd45. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only some remaining lint:
pandas/io/tests/parser/compression.py:11:1: F401 'pandas.compat' imported but unused
Get file handle for given path/buffer and mode. | ||
Get the compression method for filepath_or_buffer. If compression='infer', | ||
the inferred compression method is returned. Otherwise, the input | ||
compression method is returned unchanged, unless it's invalid, in which case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 line too long (see bottom third travis test)
Current coverage is 84.55% (diff: 100%)@@ master #14880 diff @@
==========================================
Files 144 144
Lines 51015 51043 +28
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43522 43160 -362
- Misses 7493 7883 +390
Partials 0 0
|
Add what's new corresponding to pandas-dev#14576.
Reference corresponding issues in What's New. Change code example to use string formating for improved modularity. Add what's new id
@jorisvandenbossche, I fixed the PEP8 issues. Had to rebase to resolve a conflict that arose in Also there may be an issue with S3 tests failing. I was going to wait for the latest build and see if the issues are still there. |
It's indeed failing. What is the reason for this? It seems you didn't change anything related to that?
Just as a personal recommendation, I would not put too much effort in it. Previously we asked people to squash their PRs, but now the github interface (and our own tooling) has improved, and we squash on merge. Your previous PR was just a special case as there were two commit authors to retain. |
Some S3 tests were still expecting bz2 failure in Python 2. Fixed in 8568aed.
Instead of rebasing, is it acceptable to merge in the current master? This way we don't keep destroying references to old commit hashes. |
dataframes from URLs in :func:`read_csv` or :func:`read_table` now supports | ||
additional compression methods: ``xz``, ``bz2``, and ``zip`` (:issue:`14570`). | ||
Previously, only ``gzip`` compression was supported. By default, compression of | ||
URLs and paths are now both inferred using their file extensions. Additionally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compression code
paths are not inferred using
(remove both)
Additionally, support for bz2 compress in the python 2 c-engine improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comments 1 and 3 in e1b5d42. @jreback, I didn't change:
By default, compression of URLs and paths are now both inferred using their file extensions.
Previously, compression of paths was by default inferred from their extension, but not URLs. Now both are inferred by their extension. Am I missing something?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Compression code was refactored (:issue:`12688`). As a result, reading | ||
dataframes from URLs in :func:`read_csv` or :func:`read_table` now supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are any other issues that were closed by this, pls list them as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rechecked... they're all already listed.
@dhimmel very minor doc changes. ping on when green. @TomAugspurger after this is merged, pls rebase #13137 |
@jreback @jorisvandenbossche -- we're green. |
thanks @dhimmel great responsiveness! |
…th C engine (GH14874) Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874 Author: Daniel Himmelstein <[email protected]> Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits: e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
…th C engine (GH14874) Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874 Author: Daniel Himmelstein <[email protected]> Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits: e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
Follow up on #14576, which refactored compression code to expand URL support.
Fixes up some small remaining issues and adds a what's new entry.