-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: pd.concat is slower for all-nan object-dtyped columns than float equivalents #52702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Why is your column not float in the first place?
Are you suggesting that for every object dtype encountered by concat, we should check if the values are all nan and provide a specific code path to optimize this one case? When a column object dtype, it is storing Python objects. These are inefficient in both memory and CPU and should be avoided if possible. There isn't anything pandas can do about that. |
It's a string column with missing data.
It seems suspicious that concat takes pathologically longer for all-nan object columns than normal-string-filled object columns or nan-filled float columns. I understand that object columns are less efficient in general, the above notwithstanding. |
If I'm reading this right the object-all-nan case is 19.1 ms and the object-not-all-nan case is 18.2 ms. That just doesn't look that egregious to me. That said you're not wrong: the current behavior with all-NA entries is pathological. #52613 deprecates it. |
Generally speaking, if a dataframe/series is of dtype Also, I agree with @jbrockmendel I don't think this is that bad relative to that this is a object dtype. |
Agreed with @jbrockmendel and @topper-123; closing. If you think we're missing something, comment here and we can reopen. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Calling
pd.concat
for very large datasets can be significantly slower for all-nan object-dtyped columns than if the same columns were float dtyped. The discrepancy in the provided example is mild compared to what I'm seeing for real datasets.Expected Behavior
It should be able to do an optimization under-the-hood for all nan columns, regardless of dtype.
Installed Versions
Note that I am on an older version of pandas because my organization has restrictions.
Replace this line with the output of pd.show_versions()
INSTALLED VERSIONS
commit : 66e3805
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-348.23.1.el8_5.x86_64
Version : #1 SMP Tue Apr 12 11:20:32 EDT 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.5
numpy : 1.23.3
pytz : 2022.4
dateutil : 2.8.2
pip : 22.2.2
setuptools : 65.4.1
Cython : 0.29.32
pytest : 7.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
fsspec : 2022.8.2
fastparquet : None
gcsfs : None
matplotlib : 3.6.0
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 9.0.0
pyxlsb : None
s3fs : None
scipy : 1.9.1
sqlalchemy : 1.4.41
tables : 3.6.1
tabulate : 0.9.0
xarray : 2022.9.0
xlrd : 2.0.1
xlwt : None
numba : 0.56.2
The text was updated successfully, but these errors were encountered: