PERF: using `use_nullable_dtypes=True` in `read_parquet` slows performance on large dataframes #47345

timlod · 2022-06-14T14:13:29Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Generate data

The effect becomes visible for large n, ie. >1000000.

import pandas as pd

n = 10000000
df = pd.DataFrame(
    {
        "a": np.random.choice([0, 1, None], n),
        "b": np.random.choice([0, 10, 90129, None], n),
        "c": np.random.choice([True, False, None], n),
        "d": np.random.choice(["a", "b", None], n),
    },
    dtype=object,
)
df["a"] = df["a"].astype("Int8")
df = df.convert_dtypes()
# To check that types set correctly
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 4 columns):
 #   Column  Dtype  
---  ------  -----  
 0   a       Int8   
 1   b       Int64  
 2   c       boolean
 3   d       string 
dtypes: Int64(1), Int8(1), boolean(1), string(1)
memory usage: 200.3 MB

Write it out with `pyarrow`:

import pyarrow as pa
from pyarrow import parquet as pq

table = pa.Table.from_pandas(df)
# Remove metadata to prevent automatic use of nullable types
table = table.replace_schema_metadata()
pq.write_table(table, "test.parquet")

Speed test

%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=True)

997 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=False)

632 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that the use_nullable_dtypes=False returns a dataframe with wrong types, ie. [float64, float64, object, object].
I would expect the version with True to run faster - after all it doesn't have to cast any types. The opposite is the case.

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-9-amd64
Version : #1 SMP Debian 5.10.70-1 (2021-09-30)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 20.3.4
setuptools : 44.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : 0.8.1
fsspec : 2022.5.0
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2022-06-22T21:16:50Z

@timlod thanks for the report!

Doing a quick profile of both read_parquet calls using your reproducer, it seems that it is mostly the string column that is causing the slowdown.

What is happening under the hood for that column:

In [10]: arr = pa.array(np.random.choice(["a", "b", None], 10000000))

In [12]: pd.StringDtype().__from_arrow__(arr)
Out[12]: 
<StringArray>
[ 'a', <NA>, <NA>,  'a',  'a',  'a',  'a',  'a',  'b',  'a',
 ...
  'b',  'a',  'b',  'a',  'a', <NA>, <NA>,  'a',  'b', <NA>]
Length: 10000000, dtype: string

In [13]: %timeit pd.StringDtype().__from_arrow__(arr)
556 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit arr.to_pandas()
174 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This StringDtype.__from_arrow__ has a lot of overhead, if you compare it to the conversion to an object array by pyarrow itself (the arr.to_pandas()). It should certainly be possible to optimize __from_arrow__ in this case (it seems to be validating each value (as we need to do for a generic object dtype array), while in case of a pyarrow string array, we already know that it only contains strings)

timlod · 2022-06-23T08:18:43Z

Interesting - I hadn't thought of checking which one caused it, I guess I naively assumed it was the 'wrapping' of the NA itself that caused the overhead.
Do you think this would warrant not using the nullable string type here, or would there be room to allow to choose which nullable types to use (perhaps something like the arguments in df.convert_dtypes)?
I hadn't used StringDtype before, because object with None works fine.

If it's relatively straightforward (ie. Python code only) I'd be happy to take a look at optimizing this. Would it be possible to to just call .to_pandas in this case, if arrow already handles these better?

timlod · 2022-06-23T08:35:42Z

To bypass the StringDtype, I just use pyarrow directly now with a types_mapper:

types_mapper = {
    pa.int64(): pd.Int64Dtype(),
    pa.int16(): pd.Int8Dtype(),
    pa.bool_(): pd.BooleanDtype(),
    # add more types here, but leave out pa.string()
}.get
df = pq.read_table("test.parquet").to_pandas(types_mapper=types_mapper)

Asking pyarrow to convert to StringDtype incurs the same performance hit we see in pandas - I see that pandas essentially does what I'm doing here, just including the StringDtype conversion (hence my question to use to_pandas() can be answered with no).

I did one quick check and used

            result = lib.convert_nans_to_NA(scalars)

in StringArray._from_sequence instead of

            result = lib.ensure_string_array(
                scalars, na_value=StringDtype.na_value, copy=copy
            )

to mitigate string validation and only convert None to NA.

There is some performance improvement, but still a lot slower than using object strings.
I'm not sure how I could optimize this further without knowing more about the internals here.

jorisvandenbossche · 2022-07-17T15:19:06Z

Do you think this would warrant not using the nullable string type here, or would there be room to allow to choose which nullable types to use (perhaps something like the arguments in df.convert_dtypes)?

Not using the nullable string type could be a possible work-around yes, but eventually we should simply fix this, so it shouldn't matter if you also use the nullable string type or not.
Personally I am a bit wary of adding additional arguments to choose which nullable dtypes you want and which not, as that becomes quite a lot of arguments when the number of nullable dtypes keep growing (and one that we would need to add to several read_.. functions or other IO functions).

I did one quick check and used lib.convert_nans_to_NA in StringArray._from_sequence instead of lib.ensure_string_array to mitigate string validation and only convert None to NA. There is some performance improvement, but still a lot slower than using object strings.

Yeah, so there are still some other places with overhead. It starts with inside __from_arrow__ where we call _from_sequence:

pandas/pandas/core/arrays/string_.py

Line 201 in fc68a9a

str_arr = StringArray._from_sequence(np.array(arr))

and _from_sequence then calling lib.ensure_string_array which gives overhead.
Ideally we should have a StringArray private constructor that avoids all overhead (or only does None -> pd.NA conversion with convert_nans_to_NA).

Now, as you have seen, only updating this one places still leaves a lot of the overhead, and that is because we do yet another (unnecessary) validation at the end of __from_arrow__ when concatenating the resulting string arrays:

pandas/pandas/core/arrays/string_.py

Lines 204 to 208 in fc68a9a

    
           if results: 
        
               return StringArray._concat_same_type(results) 
        
           else: 
        
               return StringArray(np.array([], dtype="object"))

This is generally needed because the pyarrow StringArray could consist of multiple chunks. However, we should 1) avoid this concat step if len(results) == 1, and 2) in general there should also be no need for validation when concatting (as all original arrays that are being concatted are already string arrays).

If you would be interested in doing a PR, I can certainly give a helping hand.

timlod · 2022-07-18T14:44:19Z

I'd be interested in doing a PR.

As mentioned above, I already tried replacing lib.ensure_string_array in favour of convert_nans_to_NA (btw, your reply sounds like this may not be necessary?).

I wasn't aware that ._concat_same_type may do another round of validations - in fact, I'm not sure of the exact source for StringArray. My best guess would be that it's this:

pandas/pandas/core/arrays/_mixins.py

Lines 214 to 227 in fc68a9a

    
           @classmethod 
        
           @doc(ExtensionArray._concat_same_type) 
        
           def _concat_same_type( 
        
               cls: type[NDArrayBackedExtensionArrayT], 
        
               to_concat: Sequence[NDArrayBackedExtensionArrayT], 
        
               axis: int = 0, 
        
           ) -> NDArrayBackedExtensionArrayT: 
        
               dtypes = {str(x.dtype) for x in to_concat} 
        
               if len(dtypes) != 1: 
        
                   raise ValueError("to_concat must have the same dtype (tz)", dtypes) 
        
               new_values = [x._ndarray for x in to_concat] 
        
               new_arr = np.concatenate(new_values, axis=axis) 
        
               return to_concat[0]._from_backing_data(new_arr)

But here I don't see any additional validation happening.

Avoiding concatenation when len(results) == 1 is straightforward enough, like here:

pandas/pandas/core/arrays/numeric.py

Lines 106 to 114 in fc68a9a

    
           if not results: 
        
               return array_class( 
        
                   np.array([], dtype=self.numpy_dtype), np.array([], dtype=np.bool_) 
        
               ) 
        
           elif len(results) == 1: 
        
               # avoid additional copy in _concat_same_type 
        
               return results[0] 
        
           else: 
        
               return array_class._concat_same_type(results)

Edit:

OK, I figured out where it goes wrong. Indeed the code from _mixins.py is applied. The issue is that

pandas/pandas/core/arrays/numpy_.py

Lines 112 to 113 in fc68a9a

    
           def _from_backing_data(self, arr: np.ndarray) -> PandasArray: 
        
               return type(self)(arr)

results in again calling the StringArray constructor, which will validate the data unnecessarily.

In general, this seems very redundant - we create StringArrays for each chunk of data, concatenate their ndarray storage, and create a StringArray again.

The crux here lies within bypassing the constructor, or at least the validation happening there, in _from_backing_data.
I'll start thinking of/implementing a fix - if you already have a suggestion, let me know!

timlod added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jun 14, 2022

phofl added IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Jun 14, 2022

jorisvandenbossche removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 22, 2022

timlod mentioned this issue Jul 18, 2022

PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ #47781

Merged

4 tasks

mroeschke closed this as completed in #47781 Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: using `use_nullable_dtypes=True` in `read_parquet` slows performance on large dataframes #47345

PERF: using `use_nullable_dtypes=True` in `read_parquet` slows performance on large dataframes #47345

timlod commented Jun 14, 2022 •

edited

Loading

INSTALLED VERSIONS

jorisvandenbossche commented Jun 22, 2022

timlod commented Jun 23, 2022 •

edited

Loading

timlod commented Jun 23, 2022 •

edited

Loading

jorisvandenbossche commented Jul 17, 2022

timlod commented Jul 18, 2022 •

edited

Loading

PERF: using use_nullable_dtypes=True in read_parquet slows performance on large dataframes #47345

PERF: using use_nullable_dtypes=True in read_parquet slows performance on large dataframes #47345

Comments

timlod commented Jun 14, 2022 • edited Loading

Pandas version checks

Reproducible Example

Generate data

Write it out with pyarrow:

Speed test

Installed Versions

INSTALLED VERSIONS

Prior Performance

jorisvandenbossche commented Jun 22, 2022

timlod commented Jun 23, 2022 • edited Loading

timlod commented Jun 23, 2022 • edited Loading

jorisvandenbossche commented Jul 17, 2022

timlod commented Jul 18, 2022 • edited Loading

PERF: using `use_nullable_dtypes=True` in `read_parquet` slows performance on large dataframes #47345

PERF: using `use_nullable_dtypes=True` in `read_parquet` slows performance on large dataframes #47345

timlod commented Jun 14, 2022 •

edited

Loading

Write it out with `pyarrow`:

timlod commented Jun 23, 2022 •

edited

Loading

timlod commented Jun 23, 2022 •

edited

Loading

timlod commented Jul 18, 2022 •

edited

Loading