Series construction: _try_cast typo, function often taking slow route? #28145

machow · 2019-08-26T01:20:22Z

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.core.internals.construction import _try_cast

arr = np.arange(0, 10, dtype = "int64")

# should return early
_try_cast(arr, None, False, False)

Problem description

During series construction, a function, sanitize_array attempts to use _try_cast, to cast the input to a better type. _try_cast is fairly slow to run, so it tries to avoid casting in common cases. However, due to a missing not keyword, it appears _try_cast runs for the cases it wants to avoid (like the one above).

Here are the relevant lines of _try_catch:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/construction.py#L511-L513

Should this be not maybe_castable(arr)?

It is very surprising that lines like this would intentially create an array, and then try to cast it, even when the dtype option passed is None.

https://github.com/pandas-dev/pandas/blob/master/pandas/core/construction.py#L429-L432

Expected Output

_try_cast not run during sanitize_array for common types (e.g. int64). However, from looking at it with %%prune, and running pdb, I can see functions like maybe_cast_to_datetime are called.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.16.1
pytz : 2018.9
dateutil : 2.8.0
pip : 19.1.1
setuptools : 39.0.1
Cython : None
pytest : 4.4.2
hypothesis : None
sphinx : 2.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.4
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-08-26T01:34:33Z

Is there something from a user-facing perspective that you see impacted by this? The linked code you have just points to the range section which may or may not be what your question is limited to - would be helpful to clarify that

machow · 2019-08-26T01:45:00Z

Hey @WillAyd thanks for your prompt response! It appears that _try_cast accounts for about 30% of Series construction time, so if this is a bug, then fixing will greatly speed this up.

%%snakeviz
for ii in range(50000):
    arr = np.array([1,2,3], dtype = "int64")
    ser = pd.Series(arr)

I would say there are 2 other related cases that will benefit

Series constructed during simple operations, like ser + 1.
Internal Series construction. For example, in some_grouped_df.some_col.transform

(If this isn't a bug, my apologies! :)

WillAyd · 2019-08-26T02:09:16Z

Cool thanks. I'm personally not super familiar with this part of the code so can't say for certain but seems reasonable. You can try it and run the test suite to see if nothing breaks and if not submit a PR

Also would be worth confirming with an ASV. I only see one for a datetime constructor in asv_bench/benchmarks/series_methods.py which might not be applicable, but could add one if not covered

machow · 2019-08-26T02:24:17Z

Okay, thanks for the quick feedback--I'll try adding a not and check whether any tests break. As far as I can tell, an "int64" array will never make it through, when the dtype arg to _try_cast is None.

On the other hand, it seems like this behavior is a couple years old, so I wouldn't be surprised if I'm missing something.

machow · 2019-08-26T03:44:13Z

Alright, I'm having a bit of trouble running the full suite of tests (stacktrace below)

(master)*$ pytest pandas --cov=pandas -r sxX --strict ============================================================= test session starts ============================================================== platform darwin -- Python 3.6.7, pytest-5.1.1, py-1.8.0, pluggy-0.12.0 hypothesis profile 'ci' -> deadline=timedelta(milliseconds=500), suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase('/Users/machow/repo/pandas/.hypothesis/examples') rootdir: /Users/machow/repo/pandas, inifile: setup.cfg, testpaths: pandas plugins: xdist-1.29.0, forked-1.0.2, hypothesis-4.34.0, cov-2.7.1, mock-1.10.4 collecting 40023 items / 1 skipped / 40022 selected Fatal Python error: Aborted

Current thread 0x00007fffabb85380 (most recent call first):
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/llvmlite/binding/ffi.py", line 114 in call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/llvmlite/binding/transforms.py", line 88 in _populate_module_pm
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/llvmlite/binding/transforms.py", line 95 in populate
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/codegen.py", line 664 in _module_pass_manager
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/codegen.py", line 631 in _init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/codegen.py", line 612 in init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/cpu.py", line 49 in init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/compiler_lock.py", line 32 in _acquire_compile_lock
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/base.py", line 250 in init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/registry.py", line 34 in _toplevel_target_context
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/utils.py", line 381 in get
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/targets/registry.py", line 50 in target_context
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/dispatcher.py", line 576 in init
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/decorators.py", line 177 in wrapper
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/decorators.py", line 161 in jit
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/decorators.py", line 224 in njit
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/typed/typeddict.py", line 18 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/typed/init.py", line 3 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/numba/init.py", line 45 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/fastparquet/encoding.py", line 8 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "", line 219 in _call_with_frames_removed
File "", line 1023 in _handle_fromlist
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/fastparquet/core.py", line 9 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/fastparquet/init.py", line 8 in
File "", line 219 in _call_with_frames_removed
File "", line 678 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/repo/pandas/pandas/tests/io/test_parquet.py", line 30 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/assertion/rewrite.py", line 140 in exec_module
File "", line 665 in _load_unlocked
File "", line 955 in _find_and_load_unlocked
File "", line 971 in _find_and_load
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/py/_path/local.py", line 701 in pyimport
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 501 in _importtestmodule
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 433 in _getobj
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 256 in obj
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 449 in _inject_setup_module_fixture
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/python.py", line 436 in collect
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/runner.py", line 247 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/runner.py", line 220 in from_call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/runner.py", line 247 in pytest_make_collect_report
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 81 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/runner.py", line 363 in collect_one_node
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 701 in genitems
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 704 in genitems
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 476 in _perform_collect
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 437 in perform_collect
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 244 in pytest_collection
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 81 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 234 in _main
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 191 in wrap_session
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/main.py", line 228 in pytest_cmdline_main
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 81 in
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/manager.py", line 87 in _hookexec
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/pluggy/hooks.py", line 289 in call
File "/Users/machow/.virtualenvs/pandas/lib/python3.6/site-packages/_pytest/config/init.py", line 77 in main
File "/Users/machow/.virtualenvs/pandas/bin/pytest", line 11 in
Abort trap: 6

However, I was able to run a subset of them..

pytest pandas/tests/{groupby,series,dtypes,internals,tseries} --cov=pandas -r sxX --strict

There were 7 failures, related to the change stopping _try_cast from converting numpy arrays with dtype "O". It makes sense to try to cast them, and looking through the code, it appears that it's the only case any casting will happen.

pandas/pandas/core/construction.py

Line 521 in 5d9fd7e

subarr = maybe_cast_to_datetime(arr, dtype)

pandas/pandas/core/dtypes/cast.py

Lines 932 to 934 in 5d9fd7e

    
           # we only care about object dtypes 
        
           if not is_object_dtype(v): 
        
               return value

I'll take another pass at the tests, hopefully in the next couple days!

jbrockmendel · 2019-09-18T23:42:42Z

Changing if maybe_castable to if not maybe_castable in construction._try_cast breaks 140 tests for me. Based on skimming it looks like datetime64TZ dtypes are involved in most of the failures.

WillAyd added Performance Memory or execution speed performance Series Series data structure labels Aug 26, 2019

WillAyd added this to the Contributions Welcome milestone Aug 26, 2019

machow mentioned this issue Aug 30, 2019

Slow (and weird) empty dataframe creation #28188

Closed

jbrockmendel added a commit to jbrockmendel/pandas that referenced this issue May 20, 2021

REF: _try_cast; go through fastpath more often, closes pandas-dev#28145

e471c49

jbrockmendel mentioned this issue May 20, 2021

REF: _try_cast; go through fastpath more often #41597

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.3 May 21, 2021

jreback closed this as completed in #41597 May 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series construction: _try_cast typo, function often taking slow route? #28145

Series construction: _try_cast typo, function often taking slow route? #28145

machow commented Aug 26, 2019 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Aug 26, 2019

machow commented Aug 26, 2019 •

edited

Loading

WillAyd commented Aug 26, 2019

machow commented Aug 26, 2019

machow commented Aug 26, 2019 •

edited

Loading

jbrockmendel commented Sep 18, 2019

Series construction: _try_cast typo, function often taking slow route? #28145

Series construction: _try_cast typo, function often taking slow route? #28145

Comments

machow commented Aug 26, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Aug 26, 2019

machow commented Aug 26, 2019 • edited Loading

WillAyd commented Aug 26, 2019

machow commented Aug 26, 2019

machow commented Aug 26, 2019 • edited Loading

jbrockmendel commented Sep 18, 2019

machow commented Aug 26, 2019 •

edited

Loading

Output of `pd.show_versions()`

machow commented Aug 26, 2019 •

edited

Loading

machow commented Aug 26, 2019 •

edited

Loading