Skip to content

PERF first-non-null #50092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 7, 2022
Merged

Conversation

MarcoGorelli
Copy link
Member

As intuitied by @jbrockmendel here, there was indeed a slight perf hit from #50039

It's negligible compared with the time it'd take to parse the array with to_datetime, but still, better to fix it

Timings on my laptop (this is copy-and-pasted from my terminal, I just removed some output from compiling the C extensions)

(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ git checkout main 
Switched to branch 'main'
Your branch is up-to-date with 'upstream/main'.
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . single-compile-c-extensions.sh 
Compiling pandas/_libs/tslib.pyx because it changed.
[1/1] Cythonizing pandas/_libs/tslib.pyx
/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
  warnings.warn(msg, _BetaConfiguration)
running build_ext
building 'pandas._libs.tslib' extension
Successfully installed pandas-2.0.0.dev0+858.g7c208c8907
2.0.0.dev0+858.g7c208c8907
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . repl.sh 

In [1]: from pandas._libs.tslib import first_non_null

In [2]: arr = np.array([np.nan, np.nan, np.nan, '2012-01-01'], dtype=object)

In [3]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
213 ns ± 1.03 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [4]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
215 ns ± 2.12 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [5]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
214 ns ± 3.36 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [6]:                                                                                                                                                                 
Do you really want to exit ([y]/n)? y
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ git revert 8da8743729
Auto-merging pandas/_libs/tslib.pyx
[main 0427c464b4] Revert "ENH skip 'now' and 'today' when inferring format for array (#50039)"
 2 files changed, 1 insertion(+), 7 deletions(-)
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . single-compile-c-extensions.sh 
Compiling pandas/_libs/tslib.pyx because it changed.
[1/1] Cythonizing pandas/_libs/tslib.pyx
/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
  warnings.warn(msg, _BetaConfiguration)
running build_ext
building 'pandas._libs.tslib' extension
Successfully installed pandas-2.0.0.dev0+859.g0427c464b4
2.0.0.dev0+859.g0427c464b4
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . repl.sh 

In [1]: from pandas._libs.tslib import first_non_null

In [2]: arr = np.array([np.nan, np.nan, np.nan, '2012-01-01'], dtype=object)

In [3]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
70.7 ns ± 1.71 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [4]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
73.3 ns ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
73.6 ns ± 3.65 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]:                                                                                                                                                                 
Do you really want to exit ([y]/n)? y
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ git checkout -
Switched to branch 'first-non-null-perf'
Your branch is up-to-date with 'origin/first-non-null-perf'.
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . single-compile-c-extensions.sh 
Compiling pandas/_libs/tslib.pyx because it changed.
[1/1] Cythonizing pandas/_libs/tslib.pyx
/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
  warnings.warn(msg, _BetaConfiguration)
running build_ext
building 'pandas._libs.tslib' extension
Successfully installed pandas-2.0.0.dev0+859.gecc8eaaf50
2.0.0.dev0+859.gecc8eaaf50
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . repl.sh 

In [1]: from pandas._libs.tslib import first_non_null

In [2]: arr = np.array([np.nan, np.nan, np.nan, '2012-01-01'], dtype=object)

In [3]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
75.2 ns ± 2.69 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [4]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
75.7 ns ± 2.56 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
79.8 ns ± 3.22 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]:                                                                                                                                                                 
Do you really want to exit ([y]/n)? y

@MarcoGorelli MarcoGorelli added the Performance Memory or execution speed performance label Dec 6, 2022
@@ -434,7 +434,7 @@ def first_non_null(values: ndarray) -> int:
if (
isinstance(val, str)
and
(len(val) == 0 or val in ("now", "today", *nat_strings))
(len(val) == 0 or val in nat_strings or val in ("now", "today"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be faster still as

Suggested change
(len(val) == 0 or val in nat_strings or val in ("now", "today"))
((not val) or (val in ("now", "today")) or (val in nat_strings))

(parens added for readability and obviousness) to avoid the len global lookup + function call + comparison-with-zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion - I tried this but it didn't make a difference

(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ git diff
diff --git a/pandas/_libs/tslib.pyx b/pandas/_libs/tslib.pyx
index 6f58fecd1a..61116b8f83 100644
--- a/pandas/_libs/tslib.pyx
+++ b/pandas/_libs/tslib.pyx
@@ -434,7 +434,7 @@ def first_non_null(values: ndarray) -> int:
         if (
             isinstance(val, str)
             and
-            (len(val) == 0 or val in nat_strings or val in ("now", "today"))
+            ((not val) or (val in ("now", "today")) or (val in nat_strings))
         ):
             continue
         return i
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . single-compile-c-extensions.sh 
Compiling pandas/_libs/tslib.pyx because it changed.
[1/1] Cythonizing pandas/_libs/tslib.pyx
/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
  warnings.warn(msg, _BetaConfiguration)
running build_ext
building 'pandas._libs.tslib' extension
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marcogorelli/mambaforge/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslib.c -o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslib.o
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marcogorelli/mambaforge/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslibs/src/datetime/np_datetime.c -o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslibs/src/datetime/np_datetime.o
/home/marcogorelli/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/marcogorelli/mambaforge/envs/pandas-dev/lib -L/home/marcogorelli/mambaforge/envs/pandas-dev/lib -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/marcogorelli/mambaforge/envs/pandas-dev/include build/temp.linux-x86_64-cpython-38/pandas/_libs/tslib.o build/temp.linux-x86_64-cpython-38/pandas/_libs/tslibs/src/datetime/np_datetime.o -o build/lib.linux-x86_64-cpython-38/pandas/_libs/tslib.cpython-38-x86_64-linux-gnu.so
Obtaining file:///home/marcogorelli/pandas-dev
  Preparing metadata (setup.py) ... done
Requirement already satisfied: python-dateutil>=2.8.2 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (2022.6)
Requirement already satisfied: numpy>=1.20.3 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (1.23.5)
Requirement already satisfied: six>=1.5 in /home/marcogorelli/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas==2.0.0.dev0+859.gecc8eaaf50.dirty) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.0.dev0+859.gecc8eaaf50
    Uninstalling pandas-2.0.0.dev0+859.gecc8eaaf50:
      Successfully uninstalled pandas-2.0.0.dev0+859.gecc8eaaf50
  Running setup.py develop for pandas
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
db-dtypes 1.0.4 requires pandas<2.0dev,>=0.24.2, but you have pandas 2.0.0.dev0+859.gecc8eaaf50.dirty which is incompatible.
Successfully installed pandas-2.0.0.dev0+859.gecc8eaaf50.dirty
2.0.0.dev0+859.gecc8eaaf50.dirty
(pandas-dev) marcogorelli@DESKTOP-U8OKFP3:~/pandas-dev$ . repl.sh 

In [1]: from pandas._libs.tslib import first_non_null

In [2]: arr = np.array([np.nan, np.nan, np.nan, '2012-01-01'], dtype=object)

In [3]: %%timeit
   ...: first_non_null(arr)
   ...: 
   ...: 
76.4 ns ± 2.51 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, alright. Thanks for checking it out.

I guess Cython is smart enough to Do The Right Thing for len(x) == 0 if it knows x is a str (which it must be after that isinstance)..?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with not val cython uses PyObject_IsTrue so it isn't doing any string-based optimization there. either way i think its just dominated by the tuple unpacking/construction

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM nice catch

@MarcoGorelli MarcoGorelli added this to the 2.0 milestone Dec 7, 2022
@MarcoGorelli MarcoGorelli merged commit 7c0278e into pandas-dev:main Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants