Skip to content

PERF first-non-null #50092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 7, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -434,7 +434,7 @@ def first_non_null(values: ndarray) -> int:
if (
isinstance(val, str)
and
(len(val) == 0 or val in ("now", "today", *nat_strings))
(len(val) == 0 or val in nat_strings or val in ("now", "today"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be faster still as

Suggested change
(len(val) == 0 or val in nat_strings or val in ("now", "today"))
((not val) or (val in ("now", "today")) or (val in nat_strings))

(parens added for readability and obviousness) to avoid the len global lookup + function call + comparison-with-zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion - I tried this but it didn't make a difference

(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ git diff
diff --git a/pandas/_libs/tslib.pyx b/pandas/_libs/tslib.pyx
index 6f58fecd1a..61116b8f83 100644
--- a/pandas/_libs/tslib.pyx
+++ b/pandas/_libs/tslib.pyx
@@ -434,7 +434,7 @@ def first_non_null(values: ndarray) -> int:
         if (
             isinstance(val, str)
             and
-            (len(val) == 0 or val in nat_strings or val in ("now", "today"))
+            ((not val) or (val in ("now", "today")) or (val in nat_strings))
         ):
             continue
         return i
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . single-compile-c-extensions.sh 
Compiling pandas/_libs/tslib.pyx because it changed.
[1/1] Cythonizing pandas/_libs/tslib.pyx
/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
  warnings.warn(msg, _BetaConfiguration)
running build_ext
building 'pandas._libs.tslib' extension
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marcogorelli/mambaforge/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslib.c -o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslib.o
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marcogorelli/mambaforge/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslibs/src/datetime/np_datetime.c -o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslibs/src/datetime/np_datetime.o
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include build/temp.linux-x86_64-cpython-38/pandas/_libs/tslib.o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslibs/src/datetime/np_datetime.o -o build/lib.linux-x86_64-cpython-38/pandas/_libs/tslib.cpython-38-x86_64-linux-gnu.so
Obtaining file:///home/marcogorelli/pandas-dev
  Preparing metadata (setup.py) ... done
Requirement already satisfied: python-dateutil>=2.8.2 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (2022.6)
Requirement already satisfied: numpy>=1.20.3 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (1.23.5)
Requirement already satisfied: six>=1.5 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.0.dev0+859.gecc8eaaf50
    Uninstalling pandas-2.0.0.dev0+859.gecc8eaaf50:
      Successfully uninstalled pandas-2.0.0.dev0+859.gecc8eaaf50
  Running setup.py develop for pandas
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
db-dtypes 1.0.4 requires pandas<2.0dev,>=0.24.2, but you have pandas 2.0.0.dev0+859.gecc8eaaf50.dirty which is incompatible.
Successfully installed pandas-2.0.0.dev0+859.gecc8eaaf50.dirty
2.0.0.dev0+859.gecc8eaaf50.dirty
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . repl.sh 

In [1]: from pandas._libs.tslib import first_non_null

In [2]: arr = np.array([np.nan, np.nan, np.nan, '2012-01-01'], dtype=object)

In [3]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
76.4 ns ± 2.51 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, alright. Thanks for checking it out.

I guess Cython is smart enough to Do The Right Thing for len(x) == 0 if it knows x is a str (which it must be after that isinstance)..?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with not val cython uses PyObject_IsTrue so it isn't doing any string-based optimization there. either way i think its just dominated by the tuple unpacking/construction

):
continue
return i
Expand Down