Skip to content

PERF: using use_nullable_dtypes=True in read_parquet slows performance on large dataframes #47345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
timlod opened this issue Jun 14, 2022 · 5 comments · Fixed by #47781
Closed
3 tasks done
Labels
IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance

Comments

@timlod
Copy link
Contributor

timlod commented Jun 14, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Generate data

The effect becomes visible for large n, ie. >1000000.

import pandas as pd

n = 10000000
df = pd.DataFrame(
    {
        "a": np.random.choice([0, 1, None], n),
        "b": np.random.choice([0, 10, 90129, None], n),
        "c": np.random.choice([True, False, None], n),
        "d": np.random.choice(["a", "b", None], n),
    },
    dtype=object,
)
df["a"] = df["a"].astype("Int8")
df = df.convert_dtypes()
# To check that types set correctly
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 4 columns):
 #   Column  Dtype  
---  ------  -----  
 0   a       Int8   
 1   b       Int64  
 2   c       boolean
 3   d       string 
dtypes: Int64(1), Int8(1), boolean(1), string(1)
memory usage: 200.3 MB

Write it out with pyarrow:

import pyarrow as pa
from pyarrow import parquet as pq

table = pa.Table.from_pandas(df)
# Remove metadata to prevent automatic use of nullable types
table = table.replace_schema_metadata()
pq.write_table(table, "test.parquet")

Speed test

%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=True)

997 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=False)

632 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that the use_nullable_dtypes=False returns a dataframe with wrong types, ie. [float64, float64, object, object].
I would expect the version with True to run faster - after all it doesn't have to cast any types. The opposite is the case.

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-9-amd64
Version : #1 SMP Debian 5.10.70-1 (2021-09-30)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 20.3.4
setuptools : 44.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : 0.8.1
fsspec : 2022.5.0
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

Prior Performance

No response

@timlod timlod added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jun 14, 2022
@phofl phofl added IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Jun 14, 2022
@jorisvandenbossche jorisvandenbossche removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 22, 2022
@jorisvandenbossche
Copy link
Member

@timlod thanks for the report!

Doing a quick profile of both read_parquet calls using your reproducer, it seems that it is mostly the string column that is causing the slowdown.

What is happening under the hood for that column:

In [10]: arr = pa.array(np.random.choice(["a", "b", None], 10000000))

In [12]: pd.StringDtype().__from_arrow__(arr)
Out[12]: 
<StringArray>
[ 'a', <NA>, <NA>,  'a',  'a',  'a',  'a',  'a',  'b',  'a',
 ...
  'b',  'a',  'b',  'a',  'a', <NA>, <NA>,  'a',  'b', <NA>]
Length: 10000000, dtype: string

In [13]: %timeit pd.StringDtype().__from_arrow__(arr)
556 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit arr.to_pandas()
174 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This StringDtype.__from_arrow__ has a lot of overhead, if you compare it to the conversion to an object array by pyarrow itself (the arr.to_pandas()). It should certainly be possible to optimize __from_arrow__ in this case (it seems to be validating each value (as we need to do for a generic object dtype array), while in case of a pyarrow string array, we already know that it only contains strings)

@timlod
Copy link
Contributor Author

timlod commented Jun 23, 2022

Interesting - I hadn't thought of checking which one caused it, I guess I naively assumed it was the 'wrapping' of the NA itself that caused the overhead.
Do you think this would warrant not using the nullable string type here, or would there be room to allow to choose which nullable types to use (perhaps something like the arguments in df.convert_dtypes)?
I hadn't used StringDtype before, because object with None works fine.

If it's relatively straightforward (ie. Python code only) I'd be happy to take a look at optimizing this. Would it be possible to to just call .to_pandas in this case, if arrow already handles these better?

@timlod
Copy link
Contributor Author

timlod commented Jun 23, 2022

To bypass the StringDtype, I just use pyarrow directly now with a types_mapper:

types_mapper = {
    pa.int64(): pd.Int64Dtype(),
    pa.int16(): pd.Int8Dtype(),
    pa.bool_(): pd.BooleanDtype(),
    # add more types here, but leave out pa.string()
}.get
df = pq.read_table("test.parquet").to_pandas(types_mapper=types_mapper)

Asking pyarrow to convert to StringDtype incurs the same performance hit we see in pandas - I see that pandas essentially does what I'm doing here, just including the StringDtype conversion (hence my question to use to_pandas() can be answered with no).

I did one quick check and used

            result = lib.convert_nans_to_NA(scalars)

in StringArray._from_sequence instead of

            result = lib.ensure_string_array(
                scalars, na_value=StringDtype.na_value, copy=copy
            )

to mitigate string validation and only convert None to NA.

There is some performance improvement, but still a lot slower than using object strings.
I'm not sure how I could optimize this further without knowing more about the internals here.

@jorisvandenbossche
Copy link
Member

Do you think this would warrant not using the nullable string type here, or would there be room to allow to choose which nullable types to use (perhaps something like the arguments in df.convert_dtypes)?

Not using the nullable string type could be a possible work-around yes, but eventually we should simply fix this, so it shouldn't matter if you also use the nullable string type or not.
Personally I am a bit wary of adding additional arguments to choose which nullable dtypes you want and which not, as that becomes quite a lot of arguments when the number of nullable dtypes keep growing (and one that we would need to add to several read_.. functions or other IO functions).

I did one quick check and used lib.convert_nans_to_NA in StringArray._from_sequence instead of lib.ensure_string_array to mitigate string validation and only convert None to NA. There is some performance improvement, but still a lot slower than using object strings.

Yeah, so there are still some other places with overhead. It starts with inside __from_arrow__ where we call _from_sequence:

str_arr = StringArray._from_sequence(np.array(arr))

and _from_sequence then calling lib.ensure_string_array which gives overhead.
Ideally we should have a StringArray private constructor that avoids all overhead (or only does None -> pd.NA conversion with convert_nans_to_NA).

Now, as you have seen, only updating this one places still leaves a lot of the overhead, and that is because we do yet another (unnecessary) validation at the end of __from_arrow__ when concatenating the resulting string arrays:

if results:
return StringArray._concat_same_type(results)
else:
return StringArray(np.array([], dtype="object"))

This is generally needed because the pyarrow StringArray could consist of multiple chunks. However, we should 1) avoid this concat step if len(results) == 1, and 2) in general there should also be no need for validation when concatting (as all original arrays that are being concatted are already string arrays).

If you would be interested in doing a PR, I can certainly give a helping hand.

@timlod
Copy link
Contributor Author

timlod commented Jul 18, 2022

I'd be interested in doing a PR.

As mentioned above, I already tried replacing lib.ensure_string_array in favour of convert_nans_to_NA (btw, your reply sounds like this may not be necessary?).

I wasn't aware that ._concat_same_type may do another round of validations - in fact, I'm not sure of the exact source for StringArray. My best guess would be that it's this:

@classmethod
@doc(ExtensionArray._concat_same_type)
def _concat_same_type(
cls: type[NDArrayBackedExtensionArrayT],
to_concat: Sequence[NDArrayBackedExtensionArrayT],
axis: int = 0,
) -> NDArrayBackedExtensionArrayT:
dtypes = {str(x.dtype) for x in to_concat}
if len(dtypes) != 1:
raise ValueError("to_concat must have the same dtype (tz)", dtypes)
new_values = [x._ndarray for x in to_concat]
new_arr = np.concatenate(new_values, axis=axis)
return to_concat[0]._from_backing_data(new_arr)

But here I don't see any additional validation happening.

Avoiding concatenation when len(results) == 1 is straightforward enough, like here:

if not results:
return array_class(
np.array([], dtype=self.numpy_dtype), np.array([], dtype=np.bool_)
)
elif len(results) == 1:
# avoid additional copy in _concat_same_type
return results[0]
else:
return array_class._concat_same_type(results)

Edit:

OK, I figured out where it goes wrong. Indeed the code from _mixins.py is applied. The issue is that

def _from_backing_data(self, arr: np.ndarray) -> PandasArray:
return type(self)(arr)

results in again calling the StringArray constructor, which will validate the data unnecessarily.

In general, this seems very redundant - we create StringArrays for each chunk of data, concatenate their ndarray storage, and create a StringArray again.

The crux here lies within bypassing the constructor, or at least the validation happening there, in _from_backing_data.
I'll start thinking of/implementing a fix - if you already have a suggestion, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants