-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pandas Corrupting PyArrow Integer Object Nulls #23786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you provide a complete example (including writing Pandas master supports nullable integer columns: http://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#optional-integer-na-support. |
This is reproducible without
After the release of |
Slightly smaller example
I think this is a duplicate of #5541; We have a few others touching on this subject. |
Closing in favor of #5541. The fix there would be to make DataFrame.replace aware of extension arrays. |
Code Sample, a copy-pastable example if possible
Problem description
When loading a parquet through
pyarrow
and converting topandas
, there is an option to use integer object nulls. This works as long as no action is performed on the data via pandas.In this example, a parquet is loaded with a nullable int column. A DataFrame is resolved from this parquet with the
integer_object_nulls
flag set. For all intents and purposes, the DataFrame now supports nullable ints. Pandas is used to replacebad_data
withgood_data
in thedata
column, and in doing so, upgrades the int to a float and replaces object nulls with NaN.This shows pandas has the ability to hold onto nullable ints, but any subsequent operation on that dataframe corrupts the data. Pandas should respect the
integer_object_nulls
flag.I'm not hopeful there is an official solution (after days of searching i've only found explanations why it works this way but no suggestions on how to move forward other than corrupting your own data), but if there there is a hidden feature, a workaround, or a hack, please let me know. We don't have the option not to use nullable ints, so if this cannot be achieved in pandas I'll need to pivot to a different library.
I appreciate all of the work done and Pandas and apologize up front if I've come across as negative. I'm a bit frustrated.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 18.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.1.0
Cython: None
numpy: 1.15.2
scipy: 1.1.0
pyarrow: 0.11.1
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.3
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: