Skip to content

Series construction: _try_cast typo, function often taking slow route? #28145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
machow opened this issue Aug 26, 2019 · 6 comments · Fixed by #41597
Closed

Series construction: _try_cast typo, function often taking slow route? #28145

machow opened this issue Aug 26, 2019 · 6 comments · Fixed by #41597
Labels
Performance Memory or execution speed performance Series Series data structure
Milestone

Comments

@machow
Copy link

machow commented Aug 26, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.core.internals.construction import _try_cast

arr = np.arange(0, 10, dtype = "int64")

# should return early
_try_cast(arr, None, False, False)

Problem description

During series construction, a function, sanitize_array attempts to use _try_cast, to cast the input to a better type. _try_cast is fairly slow to run, so it tries to avoid casting in common cases. However, due to a missing not keyword, it appears _try_cast runs for the cases it wants to avoid (like the one above).

Here are the relevant lines of _try_catch:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/construction.py#L511-L513

Should this be not maybe_castable(arr)?

It is very surprising that lines like this would intentially create an array, and then try to cast it, even when the dtype option passed is None.

https://github.com/pandas-dev/pandas/blob/master/pandas/core/construction.py#L429-L432

Expected Output

_try_cast not run during sanitize_array for common types (e.g. int64). However, from looking at it with %%prune, and running pdb, I can see functions like maybe_cast_to_datetime are called.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.16.1
pytz : 2018.9
dateutil : 2.8.0
pip : 19.1.1
setuptools : 39.0.1
Cython : None
pytest : 4.4.2
hypothesis : None
sphinx : 2.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.4
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2019

Is there something from a user-facing perspective that you see impacted by this? The linked code you have just points to the range section which may or may not be what your question is limited to - would be helpful to clarify that

@machow
Copy link
Author

machow commented Aug 26, 2019

Hey @WillAyd thanks for your prompt response! It appears that _try_cast accounts for about 30% of Series construction time, so if this is a bug, then fixing will greatly speed this up.

%%snakeviz
for ii in range(50000):
    arr = np.array([1,2,3], dtype = "int64")
    ser = pd.Series(arr)

image

I would say there are 2 other related cases that will benefit

  • Series constructed during simple operations, like ser + 1.
  • Internal Series construction. For example, in some_grouped_df.some_col.transform

(If this isn't a bug, my apologies! :)

@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2019

Cool thanks. I'm personally not super familiar with this part of the code so can't say for certain but seems reasonable. You can try it and run the test suite to see if nothing breaks and if not submit a PR

Also would be worth confirming with an ASV. I only see one for a datetime constructor in asv_bench/benchmarks/series_methods.py which might not be applicable, but could add one if not covered

@WillAyd WillAyd added Performance Memory or execution speed performance Series Series data structure labels Aug 26, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Aug 26, 2019
@machow
Copy link
Author

machow commented Aug 26, 2019

Okay, thanks for the quick feedback--I'll try adding a not and check whether any tests break. As far as I can tell, an "int64" array will never make it through, when the dtype arg to _try_cast is None.

On the other hand, it seems like this behavior is a couple years old, so I wouldn't be surprised if I'm missing something.

@machow
Copy link
Author

machow commented Aug 26, 2019

Alright, I'm having a bit of trouble running the full suite of tests (stacktrace below)

(master)*$ pytest pandas --cov=pandas -r sxX --strict ============================================================= test session starts ============================================================== platform darwin -- Python 3.6.7, pytest-5.1.1, py-1.8.0, pluggy-0.12.0 hypothesis profile 'ci' -> deadline=timedelta(milliseconds=500), suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase('/Users/machow/repo/pandas/.hypothesis/examples') rootdir: /Users/machow/repo/pandas, inifile: setup.cfg, testpaths: pandas plugins: xdist-1.29.0, forked-1.0.2, hypothesis-4.34.0, cov-2.7.1, mock-1.10.4 collecting 40023 items / 1 skipped / 40022 selected Fatal Python error: Aborted

Current thread 0x00007fffabb85380 (most recent call first):
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/llvmlite/binding/ffi.py", line 114 in call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/llvmlite/binding/transforms.py", line 88 in _populate_module_pm
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/llvmlite/binding/transforms.py", line 95 in populate
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/codegen.py", line 664 in _module_pass_manager
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/codegen.py", line 631 in _init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/codegen.py", line 612 in init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/cpu.py", line 49 in init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/compiler_lock.py", line 32 in _acquire_compile_lock
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/base.py", line 250 in init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/registry.py", line 34 in _toplevel_target_context
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/utils.py", line 381 in get
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/registry.py", line 50 in target_context
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/dispatcher.py", line 576 in init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/decorators.py", line 177 in wrapper
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/decorators.py", line 161 in jit
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/decorators.py", line 224 in njit
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/typed/typeddict.py", line 18 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/typed/init.py", line 3 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/init.py", line 45 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/fastparquet/encoding.py", line 8 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "", line 219 in _call_with_frames_removed
File "", line 1023 in _handle_fromlist
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/fastparquet/core.py", line 9 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/fastparquet/init.py", line 8 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/repo/pandas/pandas/tests/io/test_parquet.py", line 30 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/assertion/rewrite.py", line 140 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/py/_path/local.py", line 701 in pyimport
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 501 in _importtestmodule
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 433 in _getobj
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 256 in obj
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 449 in _inject_setup_module_fixture
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 436 in collect
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/runner.py", line 247 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/runner.py", line 220 in from_call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/runner.py", line 247 in pytest_make_collect_report
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 81 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/runner.py", line 363 in collect_one_node
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 701 in genitems
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 704 in genitems
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 476 in _perform_collect
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 437 in perform_collect
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 244 in pytest_collection
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 81 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 234 in _main
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 191 in wrap_session
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 228 in pytest_cmdline_main
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 81 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/config/init.py", line 77 in main
File "/Users/machow/.virtualenvs/pandas/bin/pytest", line 11 in
Abort trap: 6

However, I was able to run a subset of them..

pytest pandas/tests/{groupby,series,dtypes,internals,tseries} --cov=pandas -r sxX --strict

There were 7 failures, related to the change stopping _try_cast from converting numpy arrays with dtype "O". It makes sense to try to cast them, and looking through the code, it appears that it's the only case any casting will happen.

subarr = maybe_cast_to_datetime(arr, dtype)

# we only care about object dtypes
if not is_object_dtype(v):
return value

I'll take another pass at the tests, hopefully in the next couple days!

@jbrockmendel
Copy link
Member

Changing if maybe_castable to if not maybe_castable in construction._try_cast breaks 140 tests for me. Based on skimming it looks like datetime64TZ dtypes are involved in most of the failures.

jbrockmendel added a commit to jbrockmendel/pandas that referenced this issue May 20, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Series Series data structure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants