-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix the error when reading the compressed UTF-16 file #18091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix the error when reading the compressed UTF-16 file #18091
Conversation
a327715
to
2f29a61
Compare
2f29a61
to
ae3d0d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. ex small comments. can you add a whatsnew note, bug fix io for 0.21.1
@@ -684,6 +684,12 @@ cdef class TextReader: | |||
else: | |||
raise ValueError('Unrecognized compression type: %s' % | |||
self.compression) | |||
|
|||
if b'utf-16' in (self.encoding or b''): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here on what is going on
pandas/io/parsers.py
Outdated
@@ -1671,7 +1671,8 @@ def __init__(self, src, **kwds): | |||
|
|||
ParserBase.__init__(self, kwds) | |||
|
|||
if 'utf-16' in (kwds.get('encoding') or ''): | |||
if kwds.get('compression') is None \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, use parentheses instead of the backslash to wrap multi-line conditional.
pandas/tests/io/parser/common.py
Outdated
@@ -750,6 +750,15 @@ def test_utf16_example(self): | |||
result = self.read_table(buf, encoding='utf-16') | |||
assert len(result) == 50 | |||
|
|||
def test_compressed_utf16_example(self): | |||
path = tm.get_data_path('utf16_ex.zip') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number here
Codecov Report
@@ Coverage Diff @@
## master #18091 +/- ##
==========================================
- Coverage 91.27% 91.26% -0.02%
==========================================
Files 163 163
Lines 50120 50120
==========================================
- Hits 45749 45740 -9
- Misses 4371 4380 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18091 +/- ##
==========================================
- Coverage 91.25% 91.23% -0.02%
==========================================
Files 163 163
Lines 50120 50120
==========================================
- Hits 45737 45728 -9
- Misses 4383 4392 +9
Continue to review full report at Codecov.
|
ae3d0d0
to
ea8d6bb
Compare
Thanks @jreback, fixed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments
doc/source/whatsnew/v0.21.1.txt
Outdated
@@ -76,7 +76,7 @@ I/O | |||
^^^ | |||
|
|||
- Bug in class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects. | |||
|
|||
- Bug in :func:`read_table` when reading the compressed UTF-16 file (:issue:`18071`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change to read_csv which is the common spelling here
pandas/_libs/parsers.pyx
Outdated
|
||
if b'utf-16' in (self.encoding or b''): | ||
# if source is utf-16, convert source to utf-8 | ||
source = com.UTF8Recoder(source, self.encoding.decode('utf-8')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a short 'why' we are doing this?
@Licht-T : Don't forget to add a test and check that the Python parser doesn't need patching either. |
@gfyoung I added the test in [pandas] python3 -m pytest pandas -k test_compressed_utf16_example -v 2:50:49 ☁ fix-read-zipped-utf-16-file ☀
========================================================================== test session starts ==========================================================================
platform darwin -- Python 3.6.2, pytest-3.2.3, py-1.4.34, pluggy-0.4.0 -- /usr/local/opt/python3/bin/python3.6
cachedir: .cache
metadata: {'Python': '3.6.2', 'Platform': 'Darwin-15.4.0-x86_64-i386-64bit', 'Packages': {'pytest': '3.2.3', 'py': '1.4.34', 'pluggy': '0.4.0'}, 'Plugins': {'metadata': '1.5.0', 'html': '1.15.2'}}
rootdir: /Users/rito/GitHub/pandas, inifile: setup.cfg
plugins: metadata-1.5.0, html-1.15.2
collected 15953 items / 2 skipped
pandas/tests/io/parser/test_parsers.py::TestCParserHighMemory::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED
pandas/tests/io/parser/test_parsers.py::TestCParserLowMemory::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED
pandas/tests/io/parser/test_parsers.py::TestPythonParser::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED
======================================================================== 15950 tests deselected =========================================================================
======================================================== 3 passed, 2 skipped, 15950 deselected in 16.35 seconds =========================================================
|
ea8d6bb
to
9da8edd
Compare
@jreback Fixed! |
pandas/tests/io/parser/common.py
Outdated
expected = self.read_table(expected_path, encoding='utf-16') | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Move this test to
compression.py
( same directory) - Explicitly construct the expected table instead of reading it via a text file
doc/source/whatsnew/v0.21.1.txt
Outdated
@@ -76,7 +76,7 @@ I/O | |||
^^^ | |||
|
|||
- Bug in class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects. | |||
|
|||
- Bug in :func:`read_csv` when reading the compressed UTF-16 file (:issue:`18071`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the compressed UTF-16" --> "a compressed UTF-16 encoded"
9da8edd
to
b2a3f97
Compare
Thanks @gfyoung, fixed! |
pandas/_libs/parsers.pyx
Outdated
@@ -374,6 +374,17 @@ cdef class TextReader: | |||
float_precision=None, | |||
skip_blank_lines=True): | |||
|
|||
# encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that you copied and pasted this, but let's take this opportunity to provide a much-more informative comment about this whole block of logic (a sentence is sufficient).
@gfyoung Added comment. |
thanks @Licht-T |
(cherry picked from commit e0c9c6)
git diff upstream/master -u -- "*.py" | flake8 --diff