BUG: parquet serialization/deserialization adds all dict keys into column #56842

arogozhnikov · 2024-01-12T03:42:42Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({'dictcol': [{'a': 1}, {'b': 2}, {'c': None}]}).to_parquet('/tmp/data.pqt')
pd.read_parquet('/tmp/data.pqt')
# loaded dataframe contains all keys in every row

Issue Description

I have a column of type dict[str, int], If I save and load the dataframe to parquet, every entry in column is filled with all keys.

So there are two problems: 1. it does not faithfully represents what was saved 2. it blows up because there are many keys that are resent in one-two rows.

Maybe relevant (not sure): #55776

Expected Behavior

Saved and loaded dataframes are identical.

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:28 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.4
numpy : 1.24.3
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.0
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader : None
bs4 : 4.11.1
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.0
gcsfs : None
matplotlib : 3.8.2
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 14.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

arogozhnikov · 2024-01-12T03:52:43Z

Update: this seems to be a behavior of pyarrow, because in fastparquet example above seems to work correctly.

rhshadrach · 2024-01-16T22:36:06Z

cc @jorisvandenbossche - any insights here?

jorisvandenbossche · 2024-01-17T15:39:34Z

The reason for this behaviour is that in the conversion from pandas to Arrow, the default for dictionaries is to convert this to a struct type, while you probably want a map type.
(a struct has a fixed set of keys present on each row (and so the conversion unifies the keys for all rows), while a map has a fixed type for the key and value (like dict[str, int]), but then can have variable keys)

So the default behaviour:

In [17]: df = pd.DataFrame({'dictcol': [{'a': 1}, {'b': 2}, {'c': None}]})

In [18]: df
Out[18]: 
       dictcol
0     {'a': 1}
1     {'b': 2}
2  {'c': None}

In [20]: df.to_parquet('/tmp/data.pqt')

In [21]: pd.read_parquet('/tmp/data.pqt')
Out[21]: 
                             dictcol
0   {'a': 1.0, 'b': None, 'c': None}
1   {'a': None, 'b': 2.0, 'c': None}
2  {'a': None, 'b': None, 'c': None}

Introducing all None (null) values for the keys that weren't initially present.

You can override the default conversion by providing the Arrow schema that the pandas.DataFrame should be converted to:

In [22]: df.to_parquet('/tmp/data2.pqt', schema=pa.schema([("dictcol", pa.map_(pa.string(), pa.int64()))]))

In [23]: pd.read_parquet('/tmp/data2.pqt')
Out[23]: 
       dictcol
0   [(a, 1.0)]
1   [(b, 2.0)]
2  [(c, None)]

The default conversion of a map type from arrow to pandas, however, then gives you tuples and not dicts (that's because the spec allows duplicate keys, which wouldn't be representable by dicts). This can be overridden with a keyword, however in the pandas.read_parquet, it seems we currently don't allow to pass kwargs to the pyarrow.Table.to_pandas call, only to the parquet reading (we should somehow allow this, I think). So therefore illustrating this by reading in two steps (reading Parquet + converting to pandas):

In [25]: import pyarrow.parquet as pq

In [26]: table = pq.read_table("/tmp/data.pqt")

In [27]: table.to_pandas(maps_as_pydicts="strict")
In [27]: 
       dictcol
0   {'a': 1.0}
1   {'b': 2.0}
2  {'c': None}

See https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas for the keywords that you can specify to control the conversion from pandas to Arrow.

Update: this seems to be a behavior of pyarrow, because in fastparquet example above seems to work correctly.

It seems that the reason this roundtrips better with fastparquet, is because they store the python dictionaries serialized as JSON text, and then load that again as dicts.

jorisvandenbossche · 2024-01-17T15:40:38Z

it seems we currently don't allow to pass kwargs to the pyarrow.Table.to_pandas call, only to the parquet reading (we should somehow allow this, I think).

Existing issue for this: #49236

arogozhnikov · 2024-01-19T03:40:03Z

I don't see a rationale for structs to be default.
But well, ok, if that should be default for some reason - then some kind of warning should appear.

See, this serialization is not reversible - I can't say if key way not present or just value was None.
It is also subtle: I've noticed this behavior after been using parquets for serialization for a year.

jorisvandenbossche · 2024-01-19T06:56:12Z

I don't see a rationale for structs to be default.

Maps are more restrictive than struct in certain ways: all values should have the same type (Arrow could do a first pass through the data to check that, and then base a decision on that, but that would also make it less predictable and data dependent). My feeling is also that in general structs are more common.
Anyway, there are two options and pyarrow has to choose some default, so there is always going to be a group that wants the other default ..

In the end, the problem is that pandas does not (yet) have a proper struct and map type of its own, and so you have to use an object dtype with actual python dictionaries, and so the conversion to pyarrow always has to guess.

(you can actually use the experimental ArrowDtype (https://pandas.pydata.org/docs/user_guide/pyarrow.html) to have a map dtype in pandas, but I don't know if there are much operations supported already specifically for maps)

then some kind of warning should appear.

I certainly understand that it is annoying to have to figure that you have data loss and why, but it is not necessarily that easy to know when to warn for this (to know the intention of the user). Maybe pyarrow could warn specifically in the case if the keys from the first row are not the same in all subsequent rows (I don't know how common this would be in cases you want it to be the default struct, because then that warning would be annoying). But in any case, this is a discussion for pyarrow -> https://github.com/apache/arrow/issues

arogozhnikov · 2025-03-18T09:44:58Z

One workaround is to use 'fastparquet' as the engine

see second comment in thread.

arogozhnikov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2024

rhshadrach added the Arrow pyarrow functionality label Jan 16, 2024

rhshadrach added the IO Parquet parquet, feather label Jan 16, 2024

jorisvandenbossche added Usage Question and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jan 17, 2024

jorisvandenbossche mentioned this issue Feb 7, 2024

[Python] Dictionary values are not round-tripping properly from and to pandas apache/arrow#38489

Open

kleinhenz mentioned this issue Aug 29, 2024

ENH: expose to_pandas_kwargs in read_parquet with pyarrow backend #59654

Merged

5 tasks

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: parquet serialization/deserialization adds all dict keys into column #56842

BUG: parquet serialization/deserialization adds all dict keys into column #56842

arogozhnikov commented Jan 12, 2024

INSTALLED VERSIONS

arogozhnikov commented Jan 12, 2024

rhshadrach commented Jan 16, 2024

jorisvandenbossche commented Jan 17, 2024

jorisvandenbossche commented Jan 17, 2024

arogozhnikov commented Jan 19, 2024

jorisvandenbossche commented Jan 19, 2024

This comment has been minimized.

arogozhnikov commented Mar 18, 2025

BUG: parquet serialization/deserialization adds all dict keys into column #56842

BUG: parquet serialization/deserialization adds all dict keys into column #56842

Comments

arogozhnikov commented Jan 12, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

arogozhnikov commented Jan 12, 2024

rhshadrach commented Jan 16, 2024

jorisvandenbossche commented Jan 17, 2024

jorisvandenbossche commented Jan 17, 2024

arogozhnikov commented Jan 19, 2024

jorisvandenbossche commented Jan 19, 2024

This comment has been minimized.

arogozhnikov commented Mar 18, 2025