-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: parquet serialization/deserialization adds all dict keys into column #56842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update: this seems to be a behavior of pyarrow, because in fastparquet example above seems to work correctly. |
cc @jorisvandenbossche - any insights here? |
The reason for this behaviour is that in the conversion from pandas to Arrow, the default for dictionaries is to convert this to a So the default behaviour:
Introducing all None (null) values for the keys that weren't initially present. You can override the default conversion by providing the Arrow schema that the pandas.DataFrame should be converted to:
The default conversion of a map type from arrow to pandas, however, then gives you tuples and not dicts (that's because the spec allows duplicate keys, which wouldn't be representable by dicts). This can be overridden with a keyword, however in the
See https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas for the keywords that you can specify to control the conversion from pandas to Arrow.
It seems that the reason this roundtrips better with fastparquet, is because they store the python dictionaries serialized as JSON text, and then load that again as dicts. |
Existing issue for this: #49236 |
I don't see a rationale for structs to be default. See, this serialization is not reversible - I can't say if key way not present or just value was None. |
Maps are more restrictive than struct in certain ways: all values should have the same type (Arrow could do a first pass through the data to check that, and then base a decision on that, but that would also make it less predictable and data dependent). My feeling is also that in general structs are more common. In the end, the problem is that pandas does not (yet) have a proper struct and map type of its own, and so you have to use an object dtype with actual python dictionaries, and so the conversion to pyarrow always has to guess. (you can actually use the experimental ArrowDtype (https://pandas.pydata.org/docs/user_guide/pyarrow.html) to have a map dtype in pandas, but I don't know if there are much operations supported already specifically for maps)
I certainly understand that it is annoying to have to figure that you have data loss and why, but it is not necessarily that easy to know when to warn for this (to know the intention of the user). Maybe pyarrow could warn specifically in the case if the keys from the first row are not the same in all subsequent rows (I don't know how common this would be in cases you want it to be the default struct, because then that warning would be annoying). But in any case, this is a discussion for pyarrow -> https://github.com/apache/arrow/issues |
This comment has been minimized.
This comment has been minimized.
see second comment in thread. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I have a column of type
dict[str, int]
, If I save and load the dataframe to parquet, every entry in column is filled with all keys.So there are two problems: 1. it does not faithfully represents what was saved 2. it blows up because there are many keys that are resent in one-two rows.
Maybe relevant (not sure): #55776
Expected Behavior
Saved and loaded dataframes are identical.
Installed Versions
INSTALLED VERSIONS
commit : a671b5a
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:28 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.4
numpy : 1.24.3
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.0
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader : None
bs4 : 4.11.1
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.0
gcsfs : None
matplotlib : 3.8.2
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 14.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: