Skip to content

BUG: parquet serialization/deserialization adds all dict keys into column #56842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
arogozhnikov opened this issue Jan 12, 2024 · 8 comments
Open
2 of 3 tasks
Labels
Arrow pyarrow functionality IO Parquet parquet, feather Usage Question

Comments

@arogozhnikov
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({'dictcol': [{'a': 1}, {'b': 2}, {'c': None}]}).to_parquet('/tmp/data.pqt')
pd.read_parquet('/tmp/data.pqt')
# loaded dataframe contains all keys in every row

Issue Description

I have a column of type dict[str, int], If I save and load the dataframe to parquet, every entry in column is filled with all keys.

So there are two problems: 1. it does not faithfully represents what was saved 2. it blows up because there are many keys that are resent in one-two rows.

Maybe relevant (not sure): #55776

Expected Behavior

Saved and loaded dataframes are identical.

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:28 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.4
numpy : 1.24.3
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.0
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader : None
bs4 : 4.11.1
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.0
gcsfs : None
matplotlib : 3.8.2
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 14.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None

@arogozhnikov arogozhnikov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2024
@arogozhnikov
Copy link
Author

Update: this seems to be a behavior of pyarrow, because in fastparquet example above seems to work correctly.

@rhshadrach rhshadrach added the Arrow pyarrow functionality label Jan 16, 2024
@rhshadrach
Copy link
Member

cc @jorisvandenbossche - any insights here?

@rhshadrach rhshadrach added the IO Parquet parquet, feather label Jan 16, 2024
@jorisvandenbossche
Copy link
Member

The reason for this behaviour is that in the conversion from pandas to Arrow, the default for dictionaries is to convert this to a struct type, while you probably want a map type.
(a struct has a fixed set of keys present on each row (and so the conversion unifies the keys for all rows), while a map has a fixed type for the key and value (like dict[str, int]), but then can have variable keys)

So the default behaviour:

In [17]: df = pd.DataFrame({'dictcol': [{'a': 1}, {'b': 2}, {'c': None}]})

In [18]: df
Out[18]: 
       dictcol
0     {'a': 1}
1     {'b': 2}
2  {'c': None}

In [20]: df.to_parquet('/tmp/data.pqt')

In [21]: pd.read_parquet('/tmp/data.pqt')
Out[21]: 
                             dictcol
0   {'a': 1.0, 'b': None, 'c': None}
1   {'a': None, 'b': 2.0, 'c': None}
2  {'a': None, 'b': None, 'c': None}

Introducing all None (null) values for the keys that weren't initially present.

You can override the default conversion by providing the Arrow schema that the pandas.DataFrame should be converted to:

In [22]: df.to_parquet('/tmp/data2.pqt', schema=pa.schema([("dictcol", pa.map_(pa.string(), pa.int64()))]))

In [23]: pd.read_parquet('/tmp/data2.pqt')
Out[23]: 
       dictcol
0   [(a, 1.0)]
1   [(b, 2.0)]
2  [(c, None)]

The default conversion of a map type from arrow to pandas, however, then gives you tuples and not dicts (that's because the spec allows duplicate keys, which wouldn't be representable by dicts). This can be overridden with a keyword, however in the pandas.read_parquet, it seems we currently don't allow to pass kwargs to the pyarrow.Table.to_pandas call, only to the parquet reading (we should somehow allow this, I think). So therefore illustrating this by reading in two steps (reading Parquet + converting to pandas):

In [25]: import pyarrow.parquet as pq

In [26]: table = pq.read_table("/tmp/data.pqt")

In [27]: table.to_pandas(maps_as_pydicts="strict")
In [27]: 
       dictcol
0   {'a': 1.0}
1   {'b': 2.0}
2  {'c': None}

See https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas for the keywords that you can specify to control the conversion from pandas to Arrow.

Update: this seems to be a behavior of pyarrow, because in fastparquet example above seems to work correctly.

It seems that the reason this roundtrips better with fastparquet, is because they store the python dictionaries serialized as JSON text, and then load that again as dicts.

@jorisvandenbossche
Copy link
Member

it seems we currently don't allow to pass kwargs to the pyarrow.Table.to_pandas call, only to the parquet reading (we should somehow allow this, I think).

Existing issue for this: #49236

@jorisvandenbossche jorisvandenbossche added Usage Question and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jan 17, 2024
@arogozhnikov
Copy link
Author

I don't see a rationale for structs to be default.
But well, ok, if that should be default for some reason - then some kind of warning should appear.

See, this serialization is not reversible - I can't say if key way not present or just value was None.
It is also subtle: I've noticed this behavior after been using parquets for serialization for a year.

@jorisvandenbossche
Copy link
Member

I don't see a rationale for structs to be default.

Maps are more restrictive than struct in certain ways: all values should have the same type (Arrow could do a first pass through the data to check that, and then base a decision on that, but that would also make it less predictable and data dependent). My feeling is also that in general structs are more common.
Anyway, there are two options and pyarrow has to choose some default, so there is always going to be a group that wants the other default ..

In the end, the problem is that pandas does not (yet) have a proper struct and map type of its own, and so you have to use an object dtype with actual python dictionaries, and so the conversion to pyarrow always has to guess.

(you can actually use the experimental ArrowDtype (https://pandas.pydata.org/docs/user_guide/pyarrow.html) to have a map dtype in pandas, but I don't know if there are much operations supported already specifically for maps)

then some kind of warning should appear.

I certainly understand that it is annoying to have to figure that you have data loss and why, but it is not necessarily that easy to know when to warn for this (to know the intention of the user). Maybe pyarrow could warn specifically in the case if the keys from the first row are not the same in all subsequent rows (I don't know how common this would be in cases you want it to be the default struct, because then that warning would be annoying). But in any case, this is a discussion for pyarrow -> https://github.com/apache/arrow/issues

@shreyanshsaha

This comment has been minimized.

@arogozhnikov
Copy link
Author

One workaround is to use 'fastparquet' as the engine

see second comment in thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO Parquet parquet, feather Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants