Skip to content

ENH: 'encoding_errors' argument for read_csv/json #39777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 9, 2021
Merged

ENH: 'encoding_errors' argument for read_csv/json #39777

merged 2 commits into from
Mar 9, 2021

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Feb 12, 2021

Should encoding_errors be added to to_csv and should errors in to_csv show a DeprecationWarning (for consistent naming)?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. cc @WillAyd had some options on the naming

@jreback jreback added IO CSV read_csv, to_csv IO JSON read_json, to_json, json_normalize Unicode Unicode strings labels Feb 12, 2021
@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2021

I think errors is fine since it aligns with the stdlib. I do remember a similar conversation where I asked for something besides errors as a parameter, but I think that was scoped to very rare cases with old .xls file types which was a little different

@jreback
Copy link
Contributor

jreback commented Feb 12, 2021

I think errors is fine since it aligns with the stdlib. I do remember a similar conversation where I asked for something besides errors as a parameter, but I think that was scoped to very rare cases with old .xls file types which was a little different

hmm my concern here is errors could mean errros in the lines themselves

though we have

   error_bad_lines=True,
    warn_bad_lines=True,

for these.

@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2021 via email

@twoertwein twoertwein changed the title ENH: 'errors' argument for read_csv/json ENH: 'encoding_errors' argument for read_csv/json Feb 12, 2021
@twoertwein
Copy link
Member Author

twoertwein commented Feb 24, 2021

I don't understand why the C-engine doesn't work. I think it either a) gets the original byte sequence and cannot convert it to UTF-8 or b) it gets the corrected string ("corrected" = "surrogatepass") but it doesn't like that?

edit: probably the second case as "replace" seems to work.

@WillAyd
Copy link
Member

WillAyd commented Feb 24, 2021

Is the error happening in the tokenizer perhaps? You could try to enable more verbose debugging of that if you uncomment this line and rebuild the extensions:

// #define VERBOSE

It will give a lot of details so will want to have a minimal example

@twoertwein
Copy link
Member Author

It doesn't print much for pytest -s -k test_encoding_surrogatepass[c_high] pandas/tests/io/parser/common/test_common_basic.py:

_tokenize_helper: Asked to tokenize 2 rows, datapos=0, datalen=0
parser_buffer_bytes self->cb_io: nbytes=262144, datalen: 13083248, status=0
free_if_not_null 0x20bef00
free_if_not_null (nil)
free_if_not_null 0x2476220
free_if_not_null 0x254dce0
free_if_not_null 0x24c3b10
free_if_not_null 0x255bd30
free_if_not_null 0x24bccc0
UnicodeEncodeError: 'utf-8' codec can't encode character '\udf7f' in position 0: surrogates not allowed

\udf7f is the string after surrogatepass. Maybe my test just uses a weird example? Based on the error it seems that the c-engine tries to encode \udf7f (convert the string back to bytes) and fails during that - but it shouldn't need to do that (at least the python engine doesn't do that).

@WillAyd
Copy link
Member

WillAyd commented Feb 25, 2021

Debugged a little bit and noticed that the C extensions fails very early on before tokenization even happens. This branch gets hit when just trying to interact with the file handle:

if (result == NULL) {

Not sure why the read() call is failing a few lines up but providing in case helpful

@twoertwein
Copy link
Member Author

thank you for looking into it! I don't "feel at home" debugging the c extensions ;) but I will try to stare a bit more at it.

I assume

args = Py_BuildValue("(i)", nbytes);
func = PyObject_GetAttrString(src->obj, "read");
result = PyObject_CallObject(func, args);

should correspond to

obj = open(...)  # done in python
obj.read(nbytes)

I'm surprised it tries to read a specific amount of bytes (maybe the variable name is just misleading), the C-engine is always given a string buffer.

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2021 via email

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2021

Ah sorry I was wrong before. I think the issue actually stems from this line in the extension

tmp = PyUnicode_AsUTF8String(result);

According to the Python docs the error handling for that function is always "strict"

https://docs.python.org/3/c-api/unicode.html?highlight=pyunicode_asutf8string#c.PyUnicode_AsUTF8String

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2021

Changing that line to tmp = PyUnicode_AsEncodedString(result, "utf-8", "surrogatepass"); at least moves the failure elsewhere (obviously wouldn't want to hard code the errors argument, but providing now as demo)

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2021

OK this diff should get your test to pass:

diff --git a/pandas/_libs/parsers.pyx b/pandas/_libs/parsers.pyx
index c4d98ccb8..c403eaf18 100644
--- a/pandas/_libs/parsers.pyx
+++ b/pandas/_libs/parsers.pyx
@@ -632,7 +632,7 @@ cdef class TextReader:
             char *word
             object name, old_name
             uint64_t hr, data_line = 0
-            char *errors = "strict"
+            char *errors = "surrogatepass"
             StringPath path = _string_path(self.c_encoding)
             list header = []
             set unnamed_cols = set()
@@ -673,11 +673,8 @@ cdef class TextReader:
                 for i in range(field_count):
                     word = self.parser.words[start + i]
 
-                    if path == UTF8:
-                        name = PyUnicode_FromString(word)
-                    elif path == ENCODED:
-                        name = PyUnicode_Decode(word, strlen(word),
-                                                self.c_encoding, errors)
+                    name = PyUnicode_Decode(word, strlen(word),
+                                            self.c_encoding, errors)
 
                     # We use this later when collecting placeholder names.
                     old_name = name
diff --git a/pandas/_libs/src/parser/io.c b/pandas/_libs/src/parser/io.c
index 51504527d..6abce4ce5 100644
--- a/pandas/_libs/src/parser/io.c
+++ b/pandas/_libs/src/parser/io.c
@@ -191,7 +191,7 @@ void *buffer_rd_bytes(void *source, size_t nbytes, size_t *bytes_read,
         *status = CALLING_READ_FAILED;
         return NULL;
     } else if (!PyBytes_Check(result)) {
-        tmp = PyUnicode_AsUTF8String(result);
+      tmp = PyUnicode_AsEncodedString(result, "utf-8", "surrogatepass");
         Py_DECREF(result);
         if (tmp == NULL) {
             PyGILState_Release(state);

These are just hard coded arguments so will need to figure out a way to pass them along through the parser, but should hopefully give you the gist of what we need to do.

Separately I think we might want to audit our uses of PyUnicode_FromString and replace with PyUnicode_Decode if we wanted to have generic support for the errors argument. A lot of the factorization in the parser module seem to be using the former which I think just defaults to strict error handling

@twoertwein
Copy link
Member Author

thank you @WillAyd very very much! That is super helpful!

I will look into how I can best incorporate that :)

@twoertwein
Copy link
Member Author

Now it should forward encding_errors :) I replaced only enough PyUnicode_FromString with PyUnicode_Decode to get (most) tests working.

When running pytest -k test_encoding_surrogatepass[c_high] pandas/tests/io/parser/common/test_common_basic.py
self.encoding_errors becomes b'' at some time after the TextReader is initialized.

@WillAyd
Copy link
Member

WillAyd commented Mar 1, 2021

I think the C / Cython edits are reasonable. It looks like the test is failing due to passing keyword arguments from Python down to Cython. I think this is because encoding_errors isn't specified as an option in pandas/io/parsers/readers.py

@twoertwein
Copy link
Member Author

encoding_errors is parsed as part of the kwargs. Adding print() into the cython code shows that it is passed to the ctython code. I think there might be something wrong with either setting the encoding_errors attribute (it has the intended value while being in the constructor but afterwards it it empty/pointing to a random array) or some other code resets it after the constructor.

@twoertwein
Copy link
Member Author

could it be that the attribute is garbage collected after the constructor because no pure python code references it?

@WillAyd
Copy link
Member

WillAyd commented Mar 2, 2021

Actually looking at this again I think you need to update the declarations in tokenizer.h that are being imported into parsers.pyx . There might be some mangling of the signature since the header files don’t match what is actually being referenced

@twoertwein
Copy link
Member Author

This makes it work (at least locally for me):
e19a0c6#diff-09751231cb113bbb35c3bf29ffe2302664d76cd6667aa3edf44f3d886f083bacR382-R383
but I assume this is not an elegant solution. If I had to guess why it work is that python now knows that there is still a reference to the encoding_errors object.

@twoertwein
Copy link
Member Author

I think I should call Py_DECREF but calling Py_DECREF(PyBytes_FromString(self.encoding_errors)) in close() causes a segfault when calling Py_DECREF the first time. But printing self.encoding_errors before Py_DECREF prints the expected content.

@WillAyd
Copy link
Member

WillAyd commented Mar 2, 2021 via email

@jreback jreback added this to the 1.3 milestone Mar 2, 2021
if isinstance(encoding_errors, str):
encoding_errors = encoding_errors.encode("utf-8")
Py_INCREF(encoding_errors)
self.encoding_errors = PyBytes_AsString(encoding_errors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you assert the valid values here that are allowed (or if this is not validated at a higher level could raise ValueError on an illegal value). do we have tests for same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a test in get_handle. That will fire for the python/c-engine.

"surrogateescape",
):
raise ValueError(
f"Invalide value for `encoding_errors` ({errors}). Please see "
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to rename get_handle's errors to encoding_errors? get_handle is technically not a private function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of enumerating all possible error codes, I could also use codecs.lookup_error and then raise a different error message. Would that be better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Do you have thoughts about using codecs.lookup_error(errors) to then either let it directly error (LookupError) or catch its error and raise a different error? Using codecs.lookup_error would make it more future-proof and it is also similar to how the new XML code detects bad encodings

codecs.lookup(self.encoding)

I could create a PR to call codecs.lookup(encoding) and codecs.lookup_error(errors) inside get_handle and if intended catch their errors and raise a more user-friendly error.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2021

lgtm. do need to update anything in : https://pandas.pydata.org/pandas-docs/dev/user_guide/io.html#io-read-csv-table ?

cc @WillAyd if any more comments.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - not an easy change so kudos for seeing this through

@jreback jreback merged commit 27e0330 into pandas-dev:master Mar 9, 2021
@jreback
Copy link
Contributor

jreback commented Mar 9, 2021

thanks @twoertwein very nice!

@twoertwein
Copy link
Member Author

not an easy change so kudos for seeing this through

@WillAyd Thank you very much for your help! I would have been too frustrated without your help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO JSON read_json, to_json, json_normalize Unicode Unicode strings
Projects
None yet
3 participants