-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF first-non-null #50092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF first-non-null #50092
Conversation
@@ -434,7 +434,7 @@ def first_non_null(values: ndarray) -> int: | |||
if ( | |||
isinstance(val, str) | |||
and | |||
(len(val) == 0 or val in ("now", "today", *nat_strings)) | |||
(len(val) == 0 or val in nat_strings or val in ("now", "today")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this could be faster still as
(len(val) == 0 or val in nat_strings or val in ("now", "today")) | |
((not val) or (val in ("now", "today")) or (val in nat_strings)) |
(parens added for readability and obviousness) to avoid the len
global lookup + function call + comparison-with-zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion - I tried this but it didn't make a difference
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ git diff
diff --git a/pandas/_libs/tslib.pyx b/pandas/_libs/tslib.pyx
index 6f58fecd1a..61116b8f83 100644
--- a/pandas/_libs/tslib.pyx
+++ b/pandas/_libs/tslib.pyx
@@ -434,7 +434,7 @@ def first_non_null(values: ndarray) -> int:
if (
isinstance(val, str)
and
- (len(val) == 0 or val in nat_strings or val in ("now", "today"))
+ ((not val) or (val in ("now", "today")) or (val in nat_strings))
):
continue
return i
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . single-compile-c-extensions.sh
Compiling pandas/_libs/tslib.pyx because it changed.
[1/1] Cythonizing pandas/_libs/tslib.pyx
/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
warnings.warn(msg, _BetaConfiguration)
running build_ext
building 'pandas._libs.tslib' extension
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marcogorelli/mambaforge/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslib.c -o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslib.o
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marcogorelli/mambaforge/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslibs/src/datetime/np_datetime.c -o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslibs/src/datetime/np_datetime.o
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include build/temp.linux-x86_64-cpython-38/pandas/_libs/tslib.o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslibs/src/datetime/np_datetime.o -o build/lib.linux-x86_64-cpython-38/pandas/_libs/tslib.cpython-38-x86_64-linux-gnu.so
Obtaining file:///home/marcogorelli/pandas-dev
Preparing metadata (setup.py) ... done
Requirement already satisfied: python-dateutil>=2.8.2 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (2022.6)
Requirement already satisfied: numpy>=1.20.3 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (1.23.5)
Requirement already satisfied: six>=1.5 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (1.16.0)
Installing collected packages: pandas
Attempting uninstall: pandas
Found existing installation: pandas 2.0.0.dev0+859.gecc8eaaf50
Uninstalling pandas-2.0.0.dev0+859.gecc8eaaf50:
Successfully uninstalled pandas-2.0.0.dev0+859.gecc8eaaf50
Running setup.py develop for pandas
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
db-dtypes 1.0.4 requires pandas<2.0dev,>=0.24.2, but you have pandas 2.0.0.dev0+859.gecc8eaaf50.dirty which is incompatible.
Successfully installed pandas-2.0.0.dev0+859.gecc8eaaf50.dirty
2.0.0.dev0+859.gecc8eaaf50.dirty
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . repl.sh
In [1]: from pandas._libs.tslib import first_non_null
In [2]: arr = np.array([np.nan, np.nan, np.nan, '2012-01-01'], dtype=object)
In [3]: %%timeit
...: first_non_null(arr)
...:
...:
76.4 ns ± 2.51 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, alright. Thanks for checking it out.
I guess Cython is smart enough to Do The Right Thing for len(x) == 0
if it knows x
is a str
(which it must be after that isinstance
)..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with not val
cython uses PyObject_IsTrue
so it isn't doing any string-based optimization there. either way i think its just dominated by the tuple unpacking/construction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM nice catch
As intuitied by @jbrockmendel here, there was indeed a slight perf hit from #50039
It's negligible compared with the time it'd take to parse the array with
to_datetime
, but still, better to fix itTimings on my laptop (this is copy-and-pasted from my terminal, I just removed some output from compiling the C extensions)