-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
df.to_json segfaults with categorical index #10317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
yep, not implemented ATM. pull-requests welcome. This would require delving into the c-code, but its not that difficult. The reason this fails is that |
With respect for all the work you do, avoiding a segfault is not an enhancement. If you don't have near term plans for support serializing DataFrames with categorical indices (my example was the result of using pandas.core.algorithms.quantile), you should add a guard clause that throws an exception indicating this fact, not let the code run through to segfault in C. This is especially true because earlier versions of Pandas (0.14 in my experience) did allow categorical indices, so a simple package update can introduce a segfault. |
Looking a little deeper, I can see that there is a guard clause, but it is not working. in io/json.py, _format_axes() for FrameWriter, which relies on this line: if not self.obj.index.is_unique and self.orient in ('index', 'columns'):
raise ValueError(" .... ") However, the CategoricalIndex in the example above returns True for is_unique, and because the 'split' orientation is not addressed in the guard clause, it runs past this protection and into C where it segfaults. |
@sborgeson I marked this as a bug, then enhancment tag is because I think this may need to be restructured a bit. as I said happy to take a patch, even it raises a All that said, Further when upgrading multiple major versions it is important to review the release notes and test. Which is how I am sure you found this :) So going to release 0.16.2 in next few days, if you'd like to but a simple patch up would be gr8 to get it in. |
As I seem to have already demonstrated, I'm not sure I know enough about the latest index types to reliably patch this. I was thinking the clause would be something like (in io/json.py, _format_axes() for FrameWriter): if type(self.obj.index) == pandas.core.index.CategoricalIndex: raise ValueError( ... ) But I've looked at core/index.py and I see that there are several index types, especially MultiIndex that might also require protection. I also noticed that the ValueError that is thrown for orient='index' comes out of the C code rather than the guard clauses in Python, so I'm no longer convinced I understand the intention of guard clauses I was looking at vs. the error handling in C. I think I'll have to wait for the benevolent actions of a developer more familiar with the code base than I am. So thanks for wrangling the bugs. |
fixed in #10321 note that I am going to keep this open, as this should actually be handled in the c-code (maybe) |
I've got a fix for the C-code which passes the same tests as #10321. |
@evanpw ohh, excellent. pull in my tests and put up your fix.! |
Bug in to_json causing segfault with a CategoricalIndex (GH #10317)
closed by #10322 @sborgeson this is now merged into master and will be in 0.16.2. thanks for the report! |
This was a very efficient process. Thanks for your work on such a great set of tools. |
DataFrame.to_json is reliably segfaulting python when the DataFrame has an index of type CategoricalIndex.
If I call with orient='index', I get a value error instead:
For what it's worth, my work around, which is acceptable in my application, is to convert my index to strings:
Windows config:
INSTALLED VERSIONS
commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.16.1
nose: None
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.15.1
statsmodels: None
IPython: 2.1.0
sphinx: None
patsy: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.3
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
linux config:
INSTALLED VERSIONS
commit: None
python: 2.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-274.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.1
nose: None
Cython: None
numpy: 1.9.2
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.2.0
numexpr: 2.4.3
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
The text was updated successfully, but these errors were encountered: