ENH: 'encoding_errors' argument for read_csv/json #39777

twoertwein · 2021-02-12T14:06:32Z

closes BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450
closes ENH: Add encoding errors option in pandas.read_csv #39017
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Should encoding_errors be added to to_csv and should errors in to_csv show a DeprecationWarning (for consistent naming)?

jreback

lgtm. cc @WillAyd had some options on the naming

pandas/io/json/_json.py

WillAyd · 2021-02-12T17:49:36Z

I think errors is fine since it aligns with the stdlib. I do remember a similar conversation where I asked for something besides errors as a parameter, but I think that was scoped to very rare cases with old .xls file types which was a little different

jreback · 2021-02-12T17:53:04Z

I think errors is fine since it aligns with the stdlib. I do remember a similar conversation where I asked for something besides errors as a parameter, but I think that was scoped to very rare cases with old .xls file types which was a little different

hmm my concern here is errors could mean errros in the lines themselves

though we have

   error_bad_lines=True,
    warn_bad_lines=True,

for these.

WillAyd · 2021-02-12T17:55:31Z

Ah that’s a valid point. In that case I _do_ think something like `encoding_errors` is explicit, so maybe the best option

twoertwein · 2021-02-24T22:45:36Z

I don't understand why the C-engine doesn't work. I think it either a) gets the original byte sequence and cannot convert it to UTF-8 or b) it gets the corrected string ("corrected" = "surrogatepass") but it doesn't like that?

edit: probably the second case as "replace" seems to work.

WillAyd · 2021-02-24T23:52:41Z

Is the error happening in the tokenizer perhaps? You could try to enable more verbose debugging of that if you uncomment this line and rebuild the extensions:

pandas/pandas/_libs/src/parser/tokenizer.h

Line 45 in 316f5ac

// #define VERBOSE

It will give a lot of details so will want to have a minimal example

twoertwein · 2021-02-25T01:25:25Z

It doesn't print much for pytest -s -k test_encoding_surrogatepass[c_high] pandas/tests/io/parser/common/test_common_basic.py:

_tokenize_helper: Asked to tokenize 2 rows, datapos=0, datalen=0
parser_buffer_bytes self->cb_io: nbytes=262144, datalen: 13083248, status=0
free_if_not_null 0x20bef00
free_if_not_null (nil)
free_if_not_null 0x2476220
free_if_not_null 0x254dce0
free_if_not_null 0x24c3b10
free_if_not_null 0x255bd30
free_if_not_null 0x24bccc0
UnicodeEncodeError: 'utf-8' codec can't encode character '\udf7f' in position 0: surrogates not allowed

\udf7f is the string after surrogatepass. Maybe my test just uses a weird example? Based on the error it seems that the c-engine tries to encode \udf7f (convert the string back to bytes) and fails during that - but it shouldn't need to do that (at least the python engine doesn't do that).

WillAyd · 2021-02-25T22:42:39Z

Debugged a little bit and noticed that the C extensions fails very early on before tokenization even happens. This branch gets hit when just trying to interact with the file handle:

pandas/pandas/_libs/src/parser/io.c

Line 188 in 10034a4

if (result == NULL) {

Not sure why the read() call is failing a few lines up but providing in case helpful

twoertwein · 2021-02-26T00:56:20Z

thank you for looking into it! I don't "feel at home" debugging the c extensions ;) but I will try to stare a bit more at it.

I assume

args = Py_BuildValue("(i)", nbytes);
func = PyObject_GetAttrString(src->obj, "read");
result = PyObject_CallObject(func, args);

should correspond to

obj = open(...)  # done in python
obj.read(nbytes)

I'm surprised it tries to read a specific amount of bytes (maybe the variable name is just misleading), the C-engine is always given a string buffer.

WillAyd · 2021-02-26T02:33:39Z

The file should already be opened at this point - this part of the C code is really just calling read on the handle. I *think* the issue is before the extension even gets called

…

Sent from my iPhone

On Feb 25, 2021, at 4:56 PM, Torsten Wörtwein ***@***.***> wrote: thank you for looking into it! I don't "feel at home" debugging the c extensions ;) but I will try to stare a bit more at it. I assume args = Py_BuildValue("(i)", nbytes); func = PyObject_GetAttrString(src->obj, "read"); result = PyObject_CallObject(func, args); should correspond to obj = open(...) # done in python obj.read(nbytes) I'm surprised it tries to read a specific amount of bytes (maybe the variable name is just misleading), the C-engine is always given a string buffer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

WillAyd · 2021-02-26T16:38:45Z

Ah sorry I was wrong before. I think the issue actually stems from this line in the extension

pandas/pandas/_libs/src/parser/io.c

Line 194 in 859a787

tmp = PyUnicode_AsUTF8String(result);

According to the Python docs the error handling for that function is always "strict"

https://docs.python.org/3/c-api/unicode.html?highlight=pyunicode_asutf8string#c.PyUnicode_AsUTF8String

WillAyd · 2021-02-26T16:52:51Z

Changing that line to tmp = PyUnicode_AsEncodedString(result, "utf-8", "surrogatepass"); at least moves the failure elsewhere (obviously wouldn't want to hard code the errors argument, but providing now as demo)

WillAyd · 2021-02-26T17:07:26Z

OK this diff should get your test to pass:

diff --git a/pandas/_libs/parsers.pyx b/pandas/_libs/parsers.pyx
index c4d98ccb8..c403eaf18 100644
--- a/pandas/_libs/parsers.pyx
+++ b/pandas/_libs/parsers.pyx
@@ -632,7 +632,7 @@ cdef class TextReader:
             char *word
             object name, old_name
             uint64_t hr, data_line = 0
-            char *errors = "strict"
+            char *errors = "surrogatepass"
             StringPath path = _string_path(self.c_encoding)
             list header = []
             set unnamed_cols = set()
@@ -673,11 +673,8 @@ cdef class TextReader:
                 for i in range(field_count):
                     word = self.parser.words[start + i]
 
-                    if path == UTF8:
-                        name = PyUnicode_FromString(word)
-                    elif path == ENCODED:
-                        name = PyUnicode_Decode(word, strlen(word),
-                                                self.c_encoding, errors)
+                    name = PyUnicode_Decode(word, strlen(word),
+                                            self.c_encoding, errors)
 
                     # We use this later when collecting placeholder names.
                     old_name = name
diff --git a/pandas/_libs/src/parser/io.c b/pandas/_libs/src/parser/io.c
index 51504527d..6abce4ce5 100644
--- a/pandas/_libs/src/parser/io.c
+++ b/pandas/_libs/src/parser/io.c
@@ -191,7 +191,7 @@ void *buffer_rd_bytes(void *source, size_t nbytes, size_t *bytes_read,
         *status = CALLING_READ_FAILED;
         return NULL;
     } else if (!PyBytes_Check(result)) {
-        tmp = PyUnicode_AsUTF8String(result);
+      tmp = PyUnicode_AsEncodedString(result, "utf-8", "surrogatepass");
         Py_DECREF(result);
         if (tmp == NULL) {
             PyGILState_Release(state);

These are just hard coded arguments so will need to figure out a way to pass them along through the parser, but should hopefully give you the gist of what we need to do.

Separately I think we might want to audit our uses of PyUnicode_FromString and replace with PyUnicode_Decode if we wanted to have generic support for the errors argument. A lot of the factorization in the parser module seem to be using the former which I think just defaults to strict error handling

twoertwein · 2021-02-26T17:39:50Z

thank you @WillAyd very very much! That is super helpful!

I will look into how I can best incorporate that :)

twoertwein · 2021-02-27T23:27:53Z

Now it should forward encding_errors :) I replaced only enough PyUnicode_FromString with PyUnicode_Decode to get (most) tests working.

When running pytest -k test_encoding_surrogatepass[c_high] pandas/tests/io/parser/common/test_common_basic.py
self.encoding_errors becomes b'' at some time after the TextReader is initialized.

WillAyd · 2021-03-01T23:58:27Z

I think the C / Cython edits are reasonable. It looks like the test is failing due to passing keyword arguments from Python down to Cython. I think this is because encoding_errors isn't specified as an option in pandas/io/parsers/readers.py

twoertwein · 2021-03-02T01:14:33Z

encoding_errors is parsed as part of the kwargs. Adding print() into the cython code shows that it is passed to the ctython code. I think there might be something wrong with either setting the encoding_errors attribute (it has the intended value while being in the constructor but afterwards it it empty/pointing to a random array) or some other code resets it after the constructor.

twoertwein · 2021-03-02T01:21:38Z

could it be that the attribute is garbage collected after the constructor because no pure python code references it?

WillAyd · 2021-03-02T02:07:08Z

Actually looking at this again I think you need to update the declarations in tokenizer.h that are being imported into parsers.pyx . There might be some mangling of the signature since the header files don’t match what is actually being referenced

twoertwein · 2021-03-02T04:54:35Z

This makes it work (at least locally for me):
e19a0c6#diff-09751231cb113bbb35c3bf29ffe2302664d76cd6667aa3edf44f3d886f083bacR382-R383
but I assume this is not an elegant solution. If I had to guess why it work is that python now knows that there is still a reference to the encoding_errors object.

twoertwein · 2021-03-02T21:37:07Z

I think I should call Py_DECREF but calling Py_DECREF(PyBytes_FromString(self.encoding_errors)) in close() causes a segfault when calling Py_DECREF the first time. But printing self.encoding_errors before Py_DECREF prints the expected content.

WillAyd · 2021-03-02T21:41:56Z

I don’t think you need to reference count methods that are cimported from Cython. Will try to look later today or tomorrow, but this is really close

…

Sent from my iPhone

On Mar 2, 2021, at 1:37 PM, Torsten Wörtwein ***@***.***> wrote: I think I should call Py_DECREF but calling Py_DECREF(PyBytes_FromString(self.encoding_errors)) in close() causes a segfault when calling Py_DECREF the first time. But printing self.encoding_errors before Py_DECREF prints the expected content. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jreback · 2021-03-02T22:07:01Z

pandas/_libs/parsers.pyx

+        if isinstance(encoding_errors, str):
+            encoding_errors = encoding_errors.encode("utf-8")
+        Py_INCREF(encoding_errors)
+        self.encoding_errors = PyBytes_AsString(encoding_errors)


can you assert the valid values here that are allowed (or if this is not validated at a higher level could raise ValueError on an illegal value). do we have tests for same?

I will add a test in get_handle. That will fire for the python/c-engine.

pandas/_libs/parsers.pyx

twoertwein · 2021-03-08T15:45:11Z

pandas/io/common.py

+        "surrogateescape",
+    ):
+        raise ValueError(
+            f"Invalide value for `encoding_errors` ({errors}). Please see "


Is it okay to rename get_handle's errors to encoding_errors? get_handle is technically not a private function.

instead of enumerating all possible error codes, I could also use codecs.lookup_error and then raise a different error message. Would that be better?

@jreback Do you have thoughts about using codecs.lookup_error(errors) to then either let it directly error (LookupError) or catch its error and raise a different error? Using codecs.lookup_error would make it more future-proof and it is also similar to how the new XML code detects bad encodings

pandas/pandas/io/formats/xml.py

Line 176 in f290b65

codecs.lookup(self.encoding)

I could create a PR to call codecs.lookup(encoding) and codecs.lookup_error(errors) inside get_handle and if intended catch their errors and raise a more user-friendly error.

jreback · 2021-03-09T13:49:32Z

lgtm. do need to update anything in : https://pandas.pydata.org/pandas-docs/dev/user_guide/io.html#io-read-csv-table ?

cc @WillAyd if any more comments.

WillAyd

lgtm - not an easy change so kudos for seeing this through

jreback · 2021-03-09T22:03:59Z

thanks @twoertwein very nice!

twoertwein · 2021-03-09T22:45:30Z

not an easy change so kudos for seeing this through

@WillAyd Thank you very much for your help! I would have been too frustrated without your help :)

jreback requested changes Feb 12, 2021

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

jreback added IO CSV read_csv, to_csv IO JSON read_json, to_json, json_normalize Unicode Unicode strings labels Feb 12, 2021

twoertwein changed the title ~~ENH: 'errors' argument for read_csv/json~~ ENH: 'encoding_errors' argument for read_csv/json Feb 12, 2021

twoertwein mentioned this pull request Feb 19, 2021

ENH: Add encoding errors option in pandas.read_csv #39017

Closed

jreback added this to the 1.3 milestone Mar 2, 2021

jreback requested changes Mar 2, 2021

View reviewed changes

twoertwein commented Mar 8, 2021

View reviewed changes

twoertwein added 2 commits March 8, 2021 19:56

ENH: 'encoding_errors' argument for read_csv/json

d6148fe

fix typo in test

029c61c

jreback approved these changes Mar 9, 2021

View reviewed changes

WillAyd approved these changes Mar 9, 2021

View reviewed changes

jreback merged commit 27e0330 into pandas-dev:master Mar 9, 2021

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 11, 2021

ENH: 'encoding_errors' argument for read_csv/json (pandas-dev#39777)

f3f5c51

mattip mentioned this pull request Dec 14, 2021

BUG: doing a class method lookup in __dealloc__ is dangerous #44879

Merged

4 tasks

MarcoGorelli mentioned this pull request Jan 3, 2022

STYLE add vulture hook #45173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: 'encoding_errors' argument for read_csv/json #39777

ENH: 'encoding_errors' argument for read_csv/json #39777

twoertwein commented Feb 12, 2021 •

edited

Loading

jreback left a comment

WillAyd commented Feb 12, 2021

jreback commented Feb 12, 2021

WillAyd commented Feb 12, 2021 via email

twoertwein commented Feb 24, 2021 •

edited

Loading

WillAyd commented Feb 24, 2021

twoertwein commented Feb 25, 2021

WillAyd commented Feb 25, 2021

twoertwein commented Feb 26, 2021

WillAyd commented Feb 26, 2021 via email

WillAyd commented Feb 26, 2021

WillAyd commented Feb 26, 2021

WillAyd commented Feb 26, 2021

twoertwein commented Feb 26, 2021

twoertwein commented Feb 27, 2021

WillAyd commented Mar 1, 2021

twoertwein commented Mar 2, 2021

twoertwein commented Mar 2, 2021

WillAyd commented Mar 2, 2021

twoertwein commented Mar 2, 2021

twoertwein commented Mar 2, 2021

WillAyd commented Mar 2, 2021 via email

jreback Mar 2, 2021

twoertwein Mar 3, 2021

twoertwein Mar 8, 2021

twoertwein Mar 8, 2021

twoertwein Mar 9, 2021

jreback commented Mar 9, 2021

WillAyd left a comment

jreback commented Mar 9, 2021

twoertwein commented Mar 9, 2021

ENH: 'encoding_errors' argument for read_csv/json #39777

ENH: 'encoding_errors' argument for read_csv/json #39777

Conversation

twoertwein commented Feb 12, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Feb 12, 2021

jreback commented Feb 12, 2021

WillAyd commented Feb 12, 2021 via email

twoertwein commented Feb 24, 2021 • edited Loading

WillAyd commented Feb 24, 2021

twoertwein commented Feb 25, 2021

WillAyd commented Feb 25, 2021

twoertwein commented Feb 26, 2021

WillAyd commented Feb 26, 2021 via email

WillAyd commented Feb 26, 2021

WillAyd commented Feb 26, 2021

WillAyd commented Feb 26, 2021

twoertwein commented Feb 26, 2021

twoertwein commented Feb 27, 2021

WillAyd commented Mar 1, 2021

twoertwein commented Mar 2, 2021

twoertwein commented Mar 2, 2021

WillAyd commented Mar 2, 2021

twoertwein commented Mar 2, 2021

twoertwein commented Mar 2, 2021

WillAyd commented Mar 2, 2021 via email

jreback Mar 2, 2021

Choose a reason for hiding this comment

twoertwein Mar 3, 2021

Choose a reason for hiding this comment

twoertwein Mar 8, 2021

Choose a reason for hiding this comment

twoertwein Mar 8, 2021

Choose a reason for hiding this comment

twoertwein Mar 9, 2021

Choose a reason for hiding this comment

jreback commented Mar 9, 2021

WillAyd left a comment

Choose a reason for hiding this comment

jreback commented Mar 9, 2021

twoertwein commented Mar 9, 2021

twoertwein commented Feb 12, 2021 •

edited

Loading

twoertwein commented Feb 24, 2021 •

edited

Loading