-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: 'encoding_errors' argument for read_csv/json #39777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. cc @WillAyd had some options on the naming
I think |
hmm my concern here is though we have
for these. |
Ah that’s a valid point. In that case I _do_ think something like `encoding_errors` is explicit, so maybe the best option
|
I don't understand why the C-engine doesn't work. I think it either a) gets the original byte sequence and cannot convert it to UTF-8 or b) it gets the corrected string ("corrected" = "surrogatepass") but it doesn't like that? edit: probably the second case as |
Is the error happening in the tokenizer perhaps? You could try to enable more verbose debugging of that if you uncomment this line and rebuild the extensions:
It will give a lot of details so will want to have a minimal example |
It doesn't print much for
|
Debugged a little bit and noticed that the C extensions fails very early on before tokenization even happens. This branch gets hit when just trying to interact with the file handle: pandas/pandas/_libs/src/parser/io.c Line 188 in 10034a4
Not sure why the |
thank you for looking into it! I don't "feel at home" debugging the c extensions ;) but I will try to stare a bit more at it. I assume
should correspond to
I'm surprised it tries to read a specific amount of bytes (maybe the variable name is just misleading), the C-engine is always given a string buffer. |
The file should already be opened at this point - this part of the C code is really just calling read on the handle. I *think* the issue is before the extension even gets called
…Sent from my iPhone
On Feb 25, 2021, at 4:56 PM, Torsten Wörtwein ***@***.***> wrote:
thank you for looking into it! I don't "feel at home" debugging the c extensions ;) but I will try to stare a bit more at it.
I assume
args = Py_BuildValue("(i)", nbytes);
func = PyObject_GetAttrString(src->obj, "read");
result = PyObject_CallObject(func, args);
should correspond to
obj = open(...) # done in python
obj.read(nbytes)
I'm surprised it tries to read a specific amount of bytes (maybe the variable name is just misleading), the C-engine is always given a string buffer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Ah sorry I was wrong before. I think the issue actually stems from this line in the extension pandas/pandas/_libs/src/parser/io.c Line 194 in 859a787
According to the Python docs the error handling for that function is always "strict" |
Changing that line to |
OK this diff should get your test to pass: diff --git a/pandas/_libs/parsers.pyx b/pandas/_libs/parsers.pyx
index c4d98ccb8..c403eaf18 100644
--- a/pandas/_libs/parsers.pyx
+++ b/pandas/_libs/parsers.pyx
@@ -632,7 +632,7 @@ cdef class TextReader:
char *word
object name, old_name
uint64_t hr, data_line = 0
- char *errors = "strict"
+ char *errors = "surrogatepass"
StringPath path = _string_path(self.c_encoding)
list header = []
set unnamed_cols = set()
@@ -673,11 +673,8 @@ cdef class TextReader:
for i in range(field_count):
word = self.parser.words[start + i]
- if path == UTF8:
- name = PyUnicode_FromString(word)
- elif path == ENCODED:
- name = PyUnicode_Decode(word, strlen(word),
- self.c_encoding, errors)
+ name = PyUnicode_Decode(word, strlen(word),
+ self.c_encoding, errors)
# We use this later when collecting placeholder names.
old_name = name
diff --git a/pandas/_libs/src/parser/io.c b/pandas/_libs/src/parser/io.c
index 51504527d..6abce4ce5 100644
--- a/pandas/_libs/src/parser/io.c
+++ b/pandas/_libs/src/parser/io.c
@@ -191,7 +191,7 @@ void *buffer_rd_bytes(void *source, size_t nbytes, size_t *bytes_read,
*status = CALLING_READ_FAILED;
return NULL;
} else if (!PyBytes_Check(result)) {
- tmp = PyUnicode_AsUTF8String(result);
+ tmp = PyUnicode_AsEncodedString(result, "utf-8", "surrogatepass");
Py_DECREF(result);
if (tmp == NULL) {
PyGILState_Release(state); These are just hard coded arguments so will need to figure out a way to pass them along through the parser, but should hopefully give you the gist of what we need to do. Separately I think we might want to audit our uses of |
thank you @WillAyd very very much! That is super helpful! I will look into how I can best incorporate that :) |
Now it should forward When running |
I think the C / Cython edits are reasonable. It looks like the test is failing due to passing keyword arguments from Python down to Cython. I think this is because encoding_errors isn't specified as an option in pandas/io/parsers/readers.py |
|
could it be that the attribute is garbage collected after the constructor because no pure python code references it? |
Actually looking at this again I think you need to update the declarations in tokenizer.h that are being imported into parsers.pyx . There might be some mangling of the signature since the header files don’t match what is actually being referenced |
This makes it work (at least locally for me): |
I think I should call |
I don’t think you need to reference count methods that are cimported from Cython. Will try to look later today or tomorrow, but this is really close
…Sent from my iPhone
On Mar 2, 2021, at 1:37 PM, Torsten Wörtwein ***@***.***> wrote:
I think I should call Py_DECREF but calling Py_DECREF(PyBytes_FromString(self.encoding_errors)) in close() causes a segfault when calling Py_DECREF the first time. But printing self.encoding_errors before Py_DECREF prints the expected content.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
if isinstance(encoding_errors, str): | ||
encoding_errors = encoding_errors.encode("utf-8") | ||
Py_INCREF(encoding_errors) | ||
self.encoding_errors = PyBytes_AsString(encoding_errors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you assert the valid values here that are allowed (or if this is not validated at a higher level could raise ValueError on an illegal value). do we have tests for same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a test in get_handle
. That will fire for the python/c-engine.
pandas/io/common.py
Outdated
"surrogateescape", | ||
): | ||
raise ValueError( | ||
f"Invalide value for `encoding_errors` ({errors}). Please see " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to rename get_handle
's errors
to encoding_errors
? get_handle
is technically not a private function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of enumerating all possible error codes, I could also use codecs.lookup_error
and then raise a different error message. Would that be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Do you have thoughts about using codecs.lookup_error(errors)
to then either let it directly error (LookupError) or catch its error and raise a different error? Using codecs.lookup_error
would make it more future-proof and it is also similar to how the new XML code detects bad encodings
pandas/pandas/io/formats/xml.py
Line 176 in f290b65
codecs.lookup(self.encoding) |
I could create a PR to call codecs.lookup(encoding)
and codecs.lookup_error(errors)
inside get_handle
and if intended catch their errors and raise a more user-friendly error.
lgtm. do need to update anything in : https://pandas.pydata.org/pandas-docs/dev/user_guide/io.html#io-read-csv-table ? cc @WillAyd if any more comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - not an easy change so kudos for seeing this through
thanks @twoertwein very nice! |
@WillAyd Thank you very much for your help! I would have been too frustrated without your help :) |
Should
encoding_errors
be added toto_csv
and shoulderrors
into_csv
show aDeprecationWarning
(for consistent naming)?