-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
to_json raises 'Unhandled numpy dtype 15' with complex values. #12554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the fundamental problem here is that JSON has no native way to represent complex values. I suppose we could encode complex values in some way here (e.g., by converting to strings or dictionaries with keys real/imaginary), but I'm not sure how useful that would be. |
this is not tested, and AFAIK not in the JSON standard. What would JSON output look like for this? |
Given that json is, after all, javascript object notation, one choice might be to implicitly rely on a javascript library that handles complex numbers. Three candidates are math.js, numbers.js and numericjs. math.js and numbers.js are Apache licensed. numericjs is MIT licensed. numbers.js depends on node.js. I did not find lists of dependencies for math.js. I did find a statement that numericjs has no dependencies. math.js names complex values as Complex objects, but with a lowercase math.complex as a factory (documentation):
numbers.js names complex values as Complex objects (test file):
The documentation is sparse. numericjs names complex values as x,y coordinates (documentation):
On cursory examination of the repos, math.js seems the most mature. So my proposal would be to actually expect the output: Once math.js has been imported, then that json becomes interpretable in javascript, but that's a side concern for pandas. It also seems relatively clear for human consumption, and relatively easy to implement in python deserialization. The major downside I see is that any naive interpreter (e.g. a json validator) will fail. Given that there is a downside, perhaps this could be implemented as an option to to_json, with a default of turning these values into strings, perhaps with a warning? E.g.
or
The first of those would be easier to DTRT and robustly re-serialize as a complex value anyway. |
(grumble... why does github put "Close and comment" next to "Comment", instead of e.g. putting "Preview" there, and putting "Close and Comment" on the other side or whatever?) Sorry. |
@gregory-marton I think you are confusing the |
see #12213 for some related commentary. |
@jreback not confusing; intentionally confounding as a way of getting to a readable and sensible standard proposal. |
In particular, especially in the case of stringifying the value, notating it as "math.complex(3, -1)" makes the intent very clear, and confusion with something else (e.g. a version string) unlikely. I look to javascript on a historical basis, and because using an existing library's notation might aid buy-in. |
@gregory-marton Even in JavaScript, it's a bad idea to use |
@gregory-marton Now that I read your reply more carefully, I see that you do propose using strings. That seems fine to me. I would definitely prefer |
@shoyer fair enough. Non-string proposal withdrawn. |
@shoyer the downside I see with using just '3-1j' or '3 - 1j' is that those are not crazy strings to have in other contexts. Consider a product code or version string. That gets worse when you allow '3.2-1.j' etc. It would be safer to automatically make the decision to parse "math.complex(3, -1)" as a complex value in python than it would be to parse "3-1j" that way. As for requiring spaces, I know that python is the language of significant whitespace, but that's not a sort of whitespace that I would imagine standing out to a casual reader as required. |
@gregory-marton well that's the issue with JSON its type-less and you have to infer things. You can have a look at how / if there are already any standards for how to represent complex numbers (e.g. for example in dates there are 2 competing ones, an integer epoch and as an ISO string). |
@jreback Good point! I should have looked. math.js's complex numbers documentation has a json section. They encode complex values as
That certainly seems like an option, as dictionaries are not allowed in DataFrames (right?) So then the desired output would be
|
|
@jreback well, there goes that idea, then. Other suggestions? Or is that a special case pandas would be willing to live with? |
well you can certainly do this. I think the default should be to stringify (like we do for everything else that cannot be handled). The issue here is that
for each element. |
So for json.dumps, I can provide a cls=... to specify how I want stuff converted if there is no default. That's analogous to the default_handler option to So I'm trying to write an encoder that will dtrt without my having to apply the above transformation separately. So my first guess is:
But I still get
Thoughts on next steps? I understand applying the function manually is a workaround. But it would be nice if default_handler could apply when a numpy dtype happens to be unhandled. But also, I'm somewhat new to this, and may not have gotten my guess right. |
this is a bit tricky. The problem is that since this is a numpy dtype it goes down a certain path (in the c-code), then can't figure out what to do and raises an error (and the default-handler is not called). So that's a bug, that when fixed will allow the default_handler to succeed. |
Should I file that bug separately, or will this one serve? |
Perhaps I should file that with numpy, rather than pandas? |
has nothing to do with numpy, code is here |
Perhaps it should omit raising the RuntimeError, just return JT_INVALID, and it should be write's responsibility to check the return code and invoke the default_handler if available, which should raise its own error, or if there is no default_handler, to raise a RuntimeError then? I think I could write that patch. |
yep exactly. rather it should return a code that allows it to fall out of that loop, then treat it like an iterable (not 100% sure if that would work, but give it a try). Once the iterable has it it should call the defaultHandler I misread a bit of what you pointed to. This should ALL be in the c-code. That's where everything is handled. |
Thank you for misunderstanding me well. Just removing the RuntimeError creates interesting json:
That's a json string representing a python string that contains a stringified dictionary, with the first column replaced by something very strange. But of course one needs to do more. This is as far as I got today:
I verified that the test fails the way it should without changes. With the shown set of changes, it gives me a segfault, which I would have started debugging, but el capitan is stupid. Tomorrow I'll try on a different machine where debugging isn't quite such a pita. In the meantime, I'd love some early feedback. |
So where I got with this today is that I have the default handler getting called for each element of the bad-dtype column. Unfortunately it's getting called with the whole DataFrame, instead of with the individual cell value. Even then, I'd like the dispatch only to go to the defaultHandler if the type is not one of the ones I can handle internally, so that remains to be done too. If one of you knows how I can get the individual value as a PyObject instead of getting the whole df, that'd be a great hint. |
I feel like what I want is:
but I'm not sure where to get my hands on the relevant PyArrayObject instead of the whole dataframe. |
I left in a huff last time after having the following conversation with gdb:
I have otherwise made no significant further progress. I'm going to stop working on this for awhile. If someone wants to jump in with some explanation of the right way to approach the desired behavior, I might take another shot, but stuff like the above is just ... discouraging. Out of curiosity, how current and how strong are the use cases that drive the decision to write this in C? In what client contexts are serialization and deserialization the most performance-sensitive tasks, more important than flexibility (and perhaps correctness, if the above is an indication)? Is there a considered design behind the structure of the C code (long functions, use of goto, non-use of boost), and if so, where can I read about it? |
@gregory-marton that's why I cc @Komnomnomnom . He is the original author. The style is typical for a c-style when its pretty low level like this. This is how numpy does lots of things, the gotos are almost all for error handling. Further you can't use much newer type things (e.g. boost). These things I believe all compile under c99. As far as perf. This is of course why its written in c. In fact this is a pretty big bottleneck in general for JSON. As far as correctness, this is simply an unsupported feature (as are a couple of other things). The |
Correctness: the surprising overlap of memory addresses is not something I've diagnosed, and it may in fact be related to one of my attempts to implement the requested feature, and it may even be working as intended. But cryptic behavior of that sort may make it very easy indeed to introduce bugs, if in fact the interaction doesn't point to one already. Some of the gotos appear to be for ITERABLES processing, which I suspect could be delineated more clearly, and which is precisely where my understanding fell down. One could clarify the code without sacrificing performance by breaking up large functions into more digestible pieces with clearer flow and more naming to guide a new developer regarding intent. Re: perf and language, I didn't realize that C99 support (and perhaps beyond) was an explicit pandas goal, sorry. I don't want to derail this ticket with further discussion. Thanks. Thank you for cc'ing @Komnomnomnom. I look forward to any input. |
@gregory-marton the c code is the ujson library highly customised / subverted to iterate through numpy and pandas objects. Because it's trying to deal directly with c datatypes whilst sticking to the ujson logic things get a little... tricky. Dealing with the numpy C data types directly at the top of that function is deliberate and it's unwise to let them be handled further down the function. That said it should be straightforward to dispatch to the default handler for unsupported dtypes. Happy to take a look if you're fed up with it, mind if I nab your test code? |
Please do! I put my work-in-progress, including tests, into the patchfile at the bottom of comment #12554 (comment) |
@gregory-marton just added some JSON fixes (#12802, #12878) which should take care of this once merged. Also fyi have a look at In [31]: lst = [ 9, DataFrame({'a': [1, 2.3, (4, -5)], 'b': [float('nan'), None, 'N/A']}), np.array([1,2,3])]
In [32]: pandas.io.json.dumps(lst)
Out[32]: '[9,{"a":{"0":1,"1":2.3,"2":[4,-5]},"b":{"0":null,"1":null,"2":"N\\/A"}},[1,2,3]]' Hopefully this will be exposed at the top level at some point (#9147) |
Code Sample, a copy-pastable example if possible
import pandas
df = pandas.DataFrame({'a': [complex(3, -1)]})
df.to_json()
Expected Output
'{"a":{"0":3-1j}}'
Actual output:
output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: