REGR: memory_map with non-UTF8 encoding #40994

twoertwein · 2021-04-17T02:09:57Z

closes BUG: read_csv is failing with an encoding different that UTF-8 and memory_map set to True in version 1.2.4 #40986
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

My best guess is that memory_map=True always assumed UTF-8 with the python engine. Now that the c and python engine use the same IO code, the c engine assumed UTF-8 as well.

pep8speaks · 2021-04-17T02:10:00Z

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-25 14:56:08 UTC

twoertwein · 2021-04-25T15:22:44Z

pandas/io/common.py

@@ -836,19 +852,30 @@ def __getattr__(self, name: str):
    def __iter__(self) -> _MMapWrapper:
        return self

+    def read(self, size: int = -1) -> str | bytes:


This function could be removed if the c-engine handled non-utf-8 better.

The PR in its current form will in case of the c-engine: 1) use mmap to read the entire file 2) decode it appropriately (this function) 3) the c-code will encode the now utf-8 string into bytes again. It would be more efficient if the c-engine supported non-utf-8 in more places. I will look into that, but that might take some time.

prob not worth the effort to handle non-utf-8 better, but if you want to look...ok

jreback

lgtm. minor comments for future.

jreback · 2021-04-26T12:19:26Z

pandas/io/common.py

@@ -618,7 +618,12 @@ def get_handle(

    # memory mapping needs to be the first step
    handle, memory_map, handles = _maybe_memory_map(
-        handle, memory_map, ioargs.encoding, ioargs.mode, errors
+        handle,


wouldn't object to passing by kwargs for easier reading

jreback · 2021-04-26T12:20:20Z

pandas/io/common.py

@@ -836,19 +852,30 @@ def __getattr__(self, name: str):
    def __iter__(self) -> _MMapWrapper:
        return self

+    def read(self, size: int = -1) -> str | bytes:


prob not worth the effort to handle non-utf-8 better, but if you want to look...ok

jreback · 2021-04-26T12:21:04Z

@meeseeksdev backport 1.2.x

jreback · 2021-04-26T12:21:13Z

thanks @twoertwein keep em coming!

amznero · 2021-04-26T12:34:29Z

pandas/io/common.py

-        return newline
+
+        # IncrementalDecoder seems to push newline to the next line
+        return newline.lstrip("\n")


It may leave \r at the end of the line when newline is CRLF, which is often used in windows.

Maybe try newline.lstrip("\n").rstrip("\r")?

Thank you! It might be worth adding tests to directly test the output of the mmap wrapper independent of read_csv.

I would assume that CRLF is covered by some of the Windows CI but it might be that the c-engine and python's csv are robust enough to ignore that :)

Co-authored-by: phofl <[email protected]>

twoertwein added IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version labels Apr 17, 2021

twoertwein marked this pull request as draft April 17, 2021 02:26

twoertwein marked this pull request as ready for review April 25, 2021 14:20

REGR: memory_map with non-UTF8 encoding

148881f

twoertwein commented Apr 25, 2021

View reviewed changes

jreback added this to the 1.2.5 milestone Apr 26, 2021

jreback approved these changes Apr 26, 2021

View reviewed changes

jreback merged commit 0a0540c into pandas-dev:master Apr 26, 2021

This comment has been minimized.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Apr 26, 2021

amznero reviewed Apr 26, 2021

View reviewed changes

simonjayhawkins mentioned this pull request May 2, 2021

Backport PR #40994: REGR: memory_map with non-UTF8 encoding #41257

Merged

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

REGR: memory_map with non-UTF8 encoding (pandas-dev#40994)

c2a9853

simonjayhawkins pushed a commit that referenced this pull request May 25, 2021

Backport PR #40994: REGR: memory_map with non-UTF8 encoding (#41257)

e64410f

Co-authored-by: phofl <[email protected]>

simonjayhawkins removed the Still Needs Manual Backport label May 25, 2021

twoertwein deleted the memory_map branch June 5, 2021 20:50

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

REGR: memory_map with non-UTF8 encoding (pandas-dev#40994)

78d4978

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: memory_map with non-UTF8 encoding #40994

REGR: memory_map with non-UTF8 encoding #40994

twoertwein commented Apr 17, 2021 •

edited

Loading

pep8speaks commented Apr 17, 2021 •

edited

Loading

twoertwein Apr 25, 2021

jreback Apr 26, 2021

jreback left a comment

jreback Apr 26, 2021

jreback Apr 26, 2021

jreback commented Apr 26, 2021

jreback commented Apr 26, 2021

This comment has been minimized.

amznero Apr 26, 2021

twoertwein Apr 27, 2021

REGR: memory_map with non-UTF8 encoding #40994

REGR: memory_map with non-UTF8 encoding #40994

Conversation

twoertwein commented Apr 17, 2021 • edited Loading

pep8speaks commented Apr 17, 2021 • edited Loading

Comment last updated at 2021-04-25 14:56:08 UTC

twoertwein Apr 25, 2021

Choose a reason for hiding this comment

jreback Apr 26, 2021

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback Apr 26, 2021

Choose a reason for hiding this comment

jreback Apr 26, 2021

Choose a reason for hiding this comment

jreback commented Apr 26, 2021

jreback commented Apr 26, 2021

This comment has been minimized.

amznero Apr 26, 2021

Choose a reason for hiding this comment

twoertwein Apr 27, 2021

Choose a reason for hiding this comment

twoertwein commented Apr 17, 2021 •

edited

Loading

pep8speaks commented Apr 17, 2021 •

edited

Loading