Skip to content

REGR: memory_map with non-UTF8 encoding #40994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 26, 2021
Merged

REGR: memory_map with non-UTF8 encoding #40994

merged 1 commit into from
Apr 26, 2021

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Apr 17, 2021

My best guess is that memory_map=True always assumed UTF-8 with the python engine. Now that the c and python engine use the same IO code, the c engine assumed UTF-8 as well.

@pep8speaks
Copy link

pep8speaks commented Apr 17, 2021

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-25 14:56:08 UTC

@twoertwein twoertwein added IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version labels Apr 17, 2021
@twoertwein twoertwein marked this pull request as draft April 17, 2021 02:26
@twoertwein twoertwein marked this pull request as ready for review April 25, 2021 14:20
@@ -836,19 +852,30 @@ def __getattr__(self, name: str):
def __iter__(self) -> _MMapWrapper:
return self

def read(self, size: int = -1) -> str | bytes:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could be removed if the c-engine handled non-utf-8 better.

The PR in its current form will in case of the c-engine: 1) use mmap to read the entire file 2) decode it appropriately (this function) 3) the c-code will encode the now utf-8 string into bytes again. It would be more efficient if the c-engine supported non-utf-8 in more places. I will look into that, but that might take some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob not worth the effort to handle non-utf-8 better, but if you want to look...ok

@jreback jreback added this to the 1.2.5 milestone Apr 26, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comments for future.

@@ -618,7 +618,12 @@ def get_handle(

# memory mapping needs to be the first step
handle, memory_map, handles = _maybe_memory_map(
handle, memory_map, ioargs.encoding, ioargs.mode, errors
handle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't object to passing by kwargs for easier reading

@@ -836,19 +852,30 @@ def __getattr__(self, name: str):
def __iter__(self) -> _MMapWrapper:
return self

def read(self, size: int = -1) -> str | bytes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob not worth the effort to handle non-utf-8 better, but if you want to look...ok

@jreback jreback merged commit 0a0540c into pandas-dev:master Apr 26, 2021
@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

@meeseeksdev backport 1.2.x

@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

thanks @twoertwein keep em coming!

@lumberbot-app

This comment has been minimized.

return newline

# IncrementalDecoder seems to push newline to the next line
return newline.lstrip("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may leave \r at the end of the line when newline is CRLF, which is often used in windows.

Maybe try newline.lstrip("\n").rstrip("\r")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It might be worth adding tests to directly test the output of the mmap wrapper independent of read_csv.

I would assume that CRLF is covered by some of the Windows CI but it might be that the c-engine and python's csv are robust enough to ignore that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv is failing with an encoding different that UTF-8 and memory_map set to True in version 1.2.4
5 participants