Skip to content

BUG: Fix the error when reading the compressed UTF-16 file #18091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 4, 2017

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Nov 3, 2017

@Licht-T Licht-T force-pushed the fix-read-zipped-utf-16-file branch from a327715 to 2f29a61 Compare November 3, 2017 11:45
@Licht-T Licht-T force-pushed the fix-read-zipped-utf-16-file branch from 2f29a61 to ae3d0d0 Compare November 3, 2017 12:42
@jreback jreback added Bug IO CSV read_csv, to_csv labels Nov 3, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ex small comments. can you add a whatsnew note, bug fix io for 0.21.1

@@ -684,6 +684,12 @@ cdef class TextReader:
else:
raise ValueError('Unrecognized compression type: %s' %
self.compression)

if b'utf-16' in (self.encoding or b''):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what is going on

@@ -1671,7 +1671,8 @@ def __init__(self, src, **kwds):

ParserBase.__init__(self, kwds)

if 'utf-16' in (kwds.get('encoding') or ''):
if kwds.get('compression') is None \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here

Copy link
Member

@gfyoung gfyoung Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, use parentheses instead of the backslash to wrap multi-line conditional.

@@ -750,6 +750,15 @@ def test_utf16_example(self):
result = self.read_table(buf, encoding='utf-16')
assert len(result) == 50

def test_compressed_utf16_example(self):
path = tm.get_data_path('utf16_ex.zip')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number here

@codecov
Copy link

codecov bot commented Nov 3, 2017

Codecov Report

Merging #18091 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18091      +/-   ##
==========================================
- Coverage   91.27%   91.26%   -0.02%     
==========================================
  Files         163      163              
  Lines       50120    50120              
==========================================
- Hits        45749    45740       -9     
- Misses       4371     4380       +9
Flag Coverage Δ
#multiple 89.07% <100%> (ø) ⬆️
#single 40.32% <100%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4375bd...ae3d0d0. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 3, 2017

Codecov Report

Merging #18091 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18091      +/-   ##
==========================================
- Coverage   91.25%   91.23%   -0.02%     
==========================================
  Files         163      163              
  Lines       50120    50120              
==========================================
- Hits        45737    45728       -9     
- Misses       4383     4392       +9
Flag Coverage Δ
#multiple 89.04% <100%> (ø) ⬆️
#single 40.32% <100%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/core/reshape/merge.py 94.26% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27bbea7...1a06857. Read the comment docs.

@Licht-T Licht-T force-pushed the fix-read-zipped-utf-16-file branch from ae3d0d0 to ea8d6bb Compare November 3, 2017 13:34
@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 3, 2017

Thanks @jreback, fixed!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

@@ -76,7 +76,7 @@ I/O
^^^

- Bug in class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects.

- Bug in :func:`read_table` when reading the compressed UTF-16 file (:issue:`18071`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change to read_csv which is the common spelling here


if b'utf-16' in (self.encoding or b''):
# if source is utf-16, convert source to utf-8
source = com.UTF8Recoder(source, self.encoding.decode('utf-8'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a short 'why' we are doing this?

@jreback jreback added this to the 0.21.1 milestone Nov 3, 2017
@gfyoung
Copy link
Member

gfyoung commented Nov 3, 2017

@Licht-T : Don't forget to add a test and check that the Python parser doesn't need patching either.

@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 3, 2017

@gfyoung I added the test in common.py, and seems that running in both parser types. Is this enough?

[pandas] python3 -m pytest pandas -k test_compressed_utf16_example -v                                                          2:50:49  ☁  fix-read-zipped-utf-16-file ☀
========================================================================== test session starts ==========================================================================
platform darwin -- Python 3.6.2, pytest-3.2.3, py-1.4.34, pluggy-0.4.0 -- /usr/local/opt/python3/bin/python3.6
cachedir: .cache
metadata: {'Python': '3.6.2', 'Platform': 'Darwin-15.4.0-x86_64-i386-64bit', 'Packages': {'pytest': '3.2.3', 'py': '1.4.34', 'pluggy': '0.4.0'}, 'Plugins': {'metadata': '1.5.0', 'html': '1.15.2'}}
rootdir: /Users/rito/GitHub/pandas, inifile: setup.cfg
plugins: metadata-1.5.0, html-1.15.2
collected 15953 items / 2 skipped

pandas/tests/io/parser/test_parsers.py::TestCParserHighMemory::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED
pandas/tests/io/parser/test_parsers.py::TestCParserLowMemory::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED
pandas/tests/io/parser/test_parsers.py::TestPythonParser::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED

======================================================================== 15950 tests deselected =========================================================================
======================================================== 3 passed, 2 skipped, 15950 deselected in 16.35 seconds =========================================================

@Licht-T Licht-T force-pushed the fix-read-zipped-utf-16-file branch from ea8d6bb to 9da8edd Compare November 3, 2017 18:02
@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 3, 2017

@jreback Fixed!

expected = self.read_table(expected_path, encoding='utf-16')

tm.assert_frame_equal(result, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Move this test to compression.py ( same directory)
  • Explicitly construct the expected table instead of reading it via a text file

@@ -76,7 +76,7 @@ I/O
^^^

- Bug in class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects.

- Bug in :func:`read_csv` when reading the compressed UTF-16 file (:issue:`18071`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the compressed UTF-16" --> "a compressed UTF-16 encoded"

@Licht-T Licht-T force-pushed the fix-read-zipped-utf-16-file branch from 9da8edd to b2a3f97 Compare November 3, 2017 19:42
@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 3, 2017

Thanks @gfyoung, fixed!

@@ -374,6 +374,17 @@ cdef class TextReader:
float_precision=None,
skip_blank_lines=True):

# encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that you copied and pasted this, but let's take this opportunity to provide a much-more informative comment about this whole block of logic (a sentence is sufficient).

@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 3, 2017

@gfyoung Added comment.

@jreback jreback merged commit e0c9c67 into pandas-dev:master Nov 4, 2017
@jreback
Copy link
Contributor

jreback commented Nov 4, 2017

thanks @Licht-T

gfyoung added a commit that referenced this pull request Nov 4, 2017
gfyoung pushed a commit that referenced this pull request Nov 4, 2017
1kastner pushed a commit to 1kastner/pandas that referenced this pull request Nov 5, 2017
1kastner pushed a commit to 1kastner/pandas that referenced this pull request Nov 5, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading zipped utf-16 file: AttributeError: 'UTF8Recoder' object has no attribute 'seek'
4 participants