-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: memory_map with non-UTF8 encoding #40994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-04-25 14:56:08 UTC |
@@ -836,19 +852,30 @@ def __getattr__(self, name: str): | |||
def __iter__(self) -> _MMapWrapper: | |||
return self | |||
|
|||
def read(self, size: int = -1) -> str | bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could be removed if the c-engine handled non-utf-8 better.
The PR in its current form will in case of the c-engine: 1) use mmap to read the entire file 2) decode it appropriately (this function) 3) the c-code will encode the now utf-8 string into bytes again. It would be more efficient if the c-engine supported non-utf-8 in more places. I will look into that, but that might take some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob not worth the effort to handle non-utf-8 better, but if you want to look...ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor comments for future.
@@ -618,7 +618,12 @@ def get_handle( | |||
|
|||
# memory mapping needs to be the first step | |||
handle, memory_map, handles = _maybe_memory_map( | |||
handle, memory_map, ioargs.encoding, ioargs.mode, errors | |||
handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't object to passing by kwargs for easier reading
@@ -836,19 +852,30 @@ def __getattr__(self, name: str): | |||
def __iter__(self) -> _MMapWrapper: | |||
return self | |||
|
|||
def read(self, size: int = -1) -> str | bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob not worth the effort to handle non-utf-8 better, but if you want to look...ok
@meeseeksdev backport 1.2.x |
thanks @twoertwein keep em coming! |
This comment has been minimized.
This comment has been minimized.
return newline | ||
|
||
# IncrementalDecoder seems to push newline to the next line | ||
return newline.lstrip("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may leave \r
at the end of the line when newline
is CRLF, which is often used in windows.
Maybe try newline.lstrip("\n").rstrip("\r")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! It might be worth adding tests to directly test the output of the mmap wrapper independent of read_csv
.
I would assume that CRLF is covered by some of the Windows CI but it might be that the c-engine and python's csv are robust enough to ignore that :)
Co-authored-by: phofl <[email protected]>
My best guess is that
memory_map=True
always assumed UTF-8 with the python engine. Now that the c and python engine use the same IO code, the c engine assumed UTF-8 as well.