BUG: Fix the error when reading the compressed UTF-16 file #18091

Licht-T · 2017-11-03T11:42:20Z

closes Reading zipped utf-16 file: AttributeError: 'UTF8Recoder' object has no attribute 'seek' #18071
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback

lgtm. ex small comments. can you add a whatsnew note, bug fix io for 0.21.1

jreback · 2017-11-03T12:51:24Z

pandas/_libs/parsers.pyx

@@ -684,6 +684,12 @@ cdef class TextReader:
            else:
                raise ValueError('Unrecognized compression type: %s' %
                                 self.compression)
+
+            if b'utf-16' in (self.encoding or b''):


can you add a comment here on what is going on

jreback · 2017-11-03T12:51:31Z

pandas/io/parsers.py

@@ -1671,7 +1671,8 @@ def __init__(self, src, **kwds):

        ParserBase.__init__(self, kwds)

-        if 'utf-16' in (kwds.get('encoding') or ''):
+        if kwds.get('compression') is None \


comment here

Also, use parentheses instead of the backslash to wrap multi-line conditional.

jreback · 2017-11-03T12:51:38Z

pandas/tests/io/parser/common.py

@@ -750,6 +750,15 @@ def test_utf16_example(self):
            result = self.read_table(buf, encoding='utf-16')
            assert len(result) == 50

+    def test_compressed_utf16_example(self):
+        path = tm.get_data_path('utf16_ex.zip')


can you add the issue number here

codecov · 2017-11-03T13:19:52Z

Codecov Report

Merging #18091 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18091      +/-   ##
==========================================
- Coverage   91.27%   91.26%   -0.02%     
==========================================
  Files         163      163              
  Lines       50120    50120              
==========================================
- Hits        45749    45740       -9     
- Misses       4371     4380       +9

Flag	Coverage Δ
#multiple	`89.07% <100%> (ø)`	⬆️
#single	`40.32% <100%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.51% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4375bd...ae3d0d0. Read the comment docs.

codecov · 2017-11-03T13:19:54Z

Codecov Report

Merging #18091 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18091      +/-   ##
==========================================
- Coverage   91.25%   91.23%   -0.02%     
==========================================
  Files         163      163              
  Lines       50120    50120              
==========================================
- Hits        45737    45728       -9     
- Misses       4383     4392       +9

Flag	Coverage Δ
#multiple	`89.04% <100%> (ø)`	⬆️
#single	`40.32% <100%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.51% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️
pandas/core/reshape/merge.py	`94.26% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27bbea7...1a06857. Read the comment docs.

Licht-T · 2017-11-03T13:35:45Z

Thanks @jreback, fixed!

jreback

comments

jreback · 2017-11-03T16:11:40Z

doc/source/whatsnew/v0.21.1.txt

@@ -76,7 +76,7 @@ I/O
 ^^^

 - Bug in class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects.
-
+- Bug in :func:`read_table` when reading the compressed UTF-16 file (:issue:`18071`)


can you change to read_csv which is the common spelling here

jreback · 2017-11-03T16:11:58Z

pandas/_libs/parsers.pyx

+
+            if b'utf-16' in (self.encoding or b''):
+                # if source is utf-16, convert source to utf-8
+                source = com.UTF8Recoder(source, self.encoding.decode('utf-8'))


can you add a short 'why' we are doing this?

gfyoung · 2017-11-03T16:59:49Z

@Licht-T : Don't forget to add a test and check that the Python parser doesn't need patching either.

Licht-T · 2017-11-03T17:53:15Z

@gfyoung I added the test in common.py, and seems that running in both parser types. Is this enough?

[pandas] python3 -m pytest pandas -k test_compressed_utf16_example -v                                                          2:50:49  ☁  fix-read-zipped-utf-16-file ☀
========================================================================== test session starts ==========================================================================
platform darwin -- Python 3.6.2, pytest-3.2.3, py-1.4.34, pluggy-0.4.0 -- /usr/local/opt/python3/bin/python3.6
cachedir: .cache
metadata: {'Python': '3.6.2', 'Platform': 'Darwin-15.4.0-x86_64-i386-64bit', 'Packages': {'pytest': '3.2.3', 'py': '1.4.34', 'pluggy': '0.4.0'}, 'Plugins': {'metadata': '1.5.0', 'html': '1.15.2'}}
rootdir: /Users/rito/GitHub/pandas, inifile: setup.cfg
plugins: metadata-1.5.0, html-1.15.2
collected 15953 items / 2 skipped

pandas/tests/io/parser/test_parsers.py::TestCParserHighMemory::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED
pandas/tests/io/parser/test_parsers.py::TestCParserLowMemory::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED
pandas/tests/io/parser/test_parsers.py::TestPythonParser::test_compressed_utf16_example <- pandas/tests/io/parser/common.py PASSED

======================================================================== 15950 tests deselected =========================================================================
======================================================== 3 passed, 2 skipped, 15950 deselected in 16.35 seconds =========================================================

Licht-T · 2017-11-03T18:02:44Z

@jreback Fixed!

gfyoung · 2017-11-03T18:51:20Z

pandas/tests/io/parser/common.py

+        expected = self.read_table(expected_path, encoding='utf-16')
+
+        tm.assert_frame_equal(result, expected)
+


Move this test to compression.py ( same directory)

Explicitly construct the expected table instead of reading it via a text file

gfyoung · 2017-11-03T18:52:02Z

doc/source/whatsnew/v0.21.1.txt

@@ -76,7 +76,7 @@ I/O
 ^^^

 - Bug in class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects.
-
+- Bug in :func:`read_csv` when reading the compressed UTF-16 file (:issue:`18071`)


"the compressed UTF-16" --> "a compressed UTF-16 encoded"

…-16 file

Licht-T · 2017-11-03T19:42:40Z

Thanks @gfyoung, fixed!

gfyoung · 2017-11-03T20:14:33Z

pandas/_libs/parsers.pyx

@@ -374,6 +374,17 @@ cdef class TextReader:
                  float_precision=None,
                  skip_blank_lines=True):

+        # encoding


I know that you copied and pasted this, but let's take this opportunity to provide a much-more informative comment about this whole block of logic (a sentence is sufficient).

Licht-T · 2017-11-03T20:29:47Z

@gfyoung Added comment.

jreback · 2017-11-04T15:20:03Z

thanks @Licht-T

Follow-up to gh-18091.

(cherry picked from commit e0c9c6)

…v#18091)

Follow-up to pandas-devgh-18091.

…v#18091)

Follow-up to pandas-devgh-18091.

…v#18091) (cherry picked from commit e0c9c67)

Licht-T force-pushed the fix-read-zipped-utf-16-file branch from a327715 to 2f29a61 Compare November 3, 2017 11:45

BUG: Fix the error when reading the compressed UTF-16 file

973a2d8

Licht-T force-pushed the fix-read-zipped-utf-16-file branch from 2f29a61 to ae3d0d0 Compare November 3, 2017 12:42

jreback added Bug IO CSV read_csv, to_csv labels Nov 3, 2017

jreback requested changes Nov 3, 2017

View reviewed changes

TST: Add test for reading the zipped UTF-16 file

52d4266

Licht-T force-pushed the fix-read-zipped-utf-16-file branch from ae3d0d0 to ea8d6bb Compare November 3, 2017 13:34

jreback requested changes Nov 3, 2017

View reviewed changes

jreback added this to the 0.21.1 milestone Nov 3, 2017

jreback added the Needs Backport label Nov 3, 2017

DOC: Add comments about UTF-16 source conversion

abfdadd

Licht-T force-pushed the fix-read-zipped-utf-16-file branch from ea8d6bb to 9da8edd Compare November 3, 2017 18:02

gfyoung reviewed Nov 3, 2017

View reviewed changes

Licht-T added 2 commits November 4, 2017 04:02

DOC: Add whatsnew note about fixing the bug of reading compressed UTF…

bacf224

…-16 file

TST: Move and change the test case

b2a3f97

Licht-T force-pushed the fix-read-zipped-utf-16-file branch from 9da8edd to b2a3f97 Compare November 3, 2017 19:42

Use parentheses instead of the backslash to wrap multi-line conditional

012b496

gfyoung reviewed Nov 3, 2017

View reviewed changes

Add comment for encoding settings

1a06857

jreback approved these changes Nov 4, 2017

View reviewed changes

jreback merged commit e0c9c67 into pandas-dev:master Nov 4, 2017

gfyoung mentioned this pull request Nov 4, 2017

BLD: Make sure to copy ZIP files for parser tests #18108

Merged

gfyoung added a commit that referenced this pull request Nov 4, 2017

BLD: Make sure to copy ZIP files for parser tests (#18108)

a61bb64

Follow-up to gh-18091.

gfyoung pushed a commit that referenced this pull request Nov 4, 2017

BUG: Fix the error when reading the compressed UTF-16 file (#18091)

0c4cc0d

(cherry picked from commit e0c9c6)

1kastner pushed a commit to 1kastner/pandas that referenced this pull request Nov 5, 2017

BUG: Fix the error when reading the compressed UTF-16 file (pandas-de…

c440981

…v#18091)

1kastner pushed a commit to 1kastner/pandas that referenced this pull request Nov 5, 2017

BLD: Make sure to copy ZIP files for parser tests (pandas-dev#18108)

00f61bb

Follow-up to pandas-devgh-18091.

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

BUG: Fix the error when reading the compressed UTF-16 file (pandas-de…

b68eb34

…v#18091)

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

BLD: Make sure to copy ZIP files for parser tests (pandas-dev#18108)

13cb774

Follow-up to pandas-devgh-18091.

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017

BUG: Fix the error when reading the compressed UTF-16 file (pandas-de…

877917b

…v#18091) (cherry picked from commit e0c9c67)

TomAugspurger removed the Needs Backport label Dec 11, 2017

		expected = self.read_table(expected_path, encoding='utf-16')

		tm.assert_frame_equal(result, expected)

Uh oh!

BUG: Fix the error when reading the compressed UTF-16 file #18091

BUG: Fix the error when reading the compressed UTF-16 file #18091

Uh oh!

Conversation

Licht-T commented Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 3, 2017

Codecov Report

Uh oh!

codecov bot commented Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Licht-T commented Nov 3, 2017

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Nov 3, 2017

Uh oh!

Licht-T commented Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Licht-T commented Nov 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Licht-T commented Nov 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Licht-T commented Nov 3, 2017

Uh oh!

jreback commented Nov 4, 2017

Uh oh!

Uh oh!

Licht-T commented Nov 3, 2017 •

edited

Loading

gfyoung Nov 3, 2017 •

edited

Loading

codecov bot commented Nov 3, 2017 •

edited

Loading

Licht-T commented Nov 3, 2017 •

edited

Loading