-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: using use_nullable_dtypes=True
in read_parquet
slows performance on large dataframes
#47345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@timlod thanks for the report! Doing a quick profile of both read_parquet calls using your reproducer, it seems that it is mostly the string column that is causing the slowdown. What is happening under the hood for that column:
This |
Interesting - I hadn't thought of checking which one caused it, I guess I naively assumed it was the 'wrapping' of the NA itself that caused the overhead. If it's relatively straightforward (ie. Python code only) I'd be happy to take a look at optimizing this. Would it be possible to to just call |
To bypass the types_mapper = {
pa.int64(): pd.Int64Dtype(),
pa.int16(): pd.Int8Dtype(),
pa.bool_(): pd.BooleanDtype(),
# add more types here, but leave out pa.string()
}.get
df = pq.read_table("test.parquet").to_pandas(types_mapper=types_mapper) Asking pyarrow to convert to I did one quick check and used result = lib.convert_nans_to_NA(scalars) in result = lib.ensure_string_array(
scalars, na_value=StringDtype.na_value, copy=copy
) to mitigate string validation and only convert There is some performance improvement, but still a lot slower than using |
Not using the nullable string type could be a possible work-around yes, but eventually we should simply fix this, so it shouldn't matter if you also use the nullable string type or not.
Yeah, so there are still some other places with overhead. It starts with inside pandas/pandas/core/arrays/string_.py Line 201 in fc68a9a
and Now, as you have seen, only updating this one places still leaves a lot of the overhead, and that is because we do yet another (unnecessary) validation at the end of pandas/pandas/core/arrays/string_.py Lines 204 to 208 in fc68a9a
This is generally needed because the pyarrow StringArray could consist of multiple chunks. However, we should 1) avoid this concat step if If you would be interested in doing a PR, I can certainly give a helping hand. |
I'd be interested in doing a PR. As mentioned above, I already tried replacing I wasn't aware that pandas/pandas/core/arrays/_mixins.py Lines 214 to 227 in fc68a9a
But here I don't see any additional validation happening. Avoiding concatenation when pandas/pandas/core/arrays/numeric.py Lines 106 to 114 in fc68a9a
Edit: OK, I figured out where it goes wrong. Indeed the code from pandas/pandas/core/arrays/numpy_.py Lines 112 to 113 in fc68a9a
results in again calling the StringArray constructor, which will validate the data unnecessarily.
In general, this seems very redundant - we create The crux here lies within bypassing the constructor, or at least the validation happening there, in |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Generate data
The effect becomes visible for large
n
, ie. >1000000.Write it out with
pyarrow
:Speed test
997 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
632 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that the
use_nullable_dtypes=False
returns a dataframe with wrong types, ie. [float64, float64, object, object].I would expect the version with True to run faster - after all it doesn't have to cast any types. The opposite is the case.
Installed Versions
INSTALLED VERSIONS
commit : 4bfe3d0
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-9-amd64
Version : #1 SMP Debian 5.10.70-1 (2021-09-30)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 20.3.4
setuptools : 44.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : 0.8.1
fsspec : 2022.5.0
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
Prior Performance
No response
The text was updated successfully, but these errors were encountered: