PERF: avoid copies in lib.infer_dtype #45057

jbrockmendel · 2021-12-24T20:54:40Z

closes API: infer_dtype with skipna=True only skip valid-for-dtype NAs #45022
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

In the skipna=True case, instead of doing values = values[~isnaobj(values)], just pass skipna to the relevant validator functions. This causes trouble because those validator functions have different rules about what NA values they allow (analogous to is_valid_na_for_dtype).

I'm down to 1 test case failing locally, fixing it will require changing the StringArray constructor to allow None and np.nan (xref #40839) in addition to just pd.NA (AFAICT pd.array and pd.Series(... dtype="string") will not be affected). I'm fine with that, but someone more involved with the string code might want to weigh in cc @simonjayhawkins

In [1]: import numpy as np

In [2]: arr = np.arange(10**5).astype(object)

In [3]: from pandas._libs import lib

In [4]: %timeit lib.infer_dtype(arr, skipna=True)
962 µs ± 77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <- PR
3.81 ms ± 33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master


In [7]: %load_ext memory_profiler

In [16]: arr = arr.repeat(100)

In [18]: %memit lib.infer_dtype(arr, skipna=True)
peak memory: 169.24 MiB, increment: 0.00 MiB  # <- PR
peak memory: 265.39 MiB, increment: 95.38 MiB  # <- master

jbrockmendel · 2021-12-28T19:38:19Z

Bad news @mroeschke:

    def test_dialect_conflict_except_delimiter(all_parsers, custom_dialect, arg, value):
[...]
E           AssertionError: Caused unexpected warning(s): [('ResourceWarning', ResourceWarning('unclosed file <_io.BufferedRandom name=10>'), 'D:\\a\\1\\s\\pandas\\core\\indexes\\base.py', 665)]

mroeschke · 2021-12-28T19:41:08Z

Bad news @mroeschke:

    def test_dialect_conflict_except_delimiter(all_parsers, custom_dialect, arg, value):
[...]
E           AssertionError: Caused unexpected warning(s): [('ResourceWarning', ResourceWarning('unclosed file <_io.BufferedRandom name=10>'), 'D:\\a\\1\\s\\pandas\\core\\indexes\\base.py', 665)]

:(

This looks like it occurred on Windows? The check I added doesn't work there since idk the equivalent lsof check.

jreback · 2021-12-28T22:06:10Z

wow, looks pretty good.

lithomas1 · 2021-12-29T04:31:10Z

pandas/_libs/lib.pyx

@@ -1882,7 +1889,7 @@ cdef class StringValidator(Validator):

    cdef bint is_valid_null(self, object value) except -1:
        # We deliberately exclude None / NaN here since StringArray uses NA
-        return value is C_NA
+        return value is C_NA or value is None or util.is_nan(value)


Quick question:
If I do pd.arrays.StringArray(np.array(["a", np.nan], dtype=object)), won't this result in a StringArray with np.nan in it? (something like this? #30966 (comment))

won't this result in a StringArray with np.nan in it?

yep. probably ought to do something about that. have something min mind?

I'm trying to fix this in #41412. This PR is probably stuck in the meantime.

@lithomas1 is this un-blocked?

jreback · 2021-12-30T00:17:17Z

this WIP?

jbrockmendel · 2021-12-30T00:20:27Z

this WIP?

@lithomas1 says this needs to wait on #41412

jreback · 2022-01-17T13:47:51Z

nice!

PERF/WIP: avoid copies in lib.infer_dtype

3338619

jbrockmendel marked this pull request as draft December 27, 2021 00:31

jbrockmendel added 6 commits December 27, 2021 11:42

Merge branch 'master' into perf-ravels-2

27e0a1b

update test

3939517

Merge branch 'master' into perf-ravels-2

a0a1048

update exception message

132b6c4

Merge branch 'master' into perf-ravels-2

47aec9d

is_period_array compat

994b5cd

jreback added Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance labels Dec 28, 2021

jbrockmendel marked this pull request as ready for review December 28, 2021 22:40

lithomas1 reviewed Dec 29, 2021

View reviewed changes

jbrockmendel mentioned this pull request Dec 31, 2021

API: allow nan-likes in StringArray constructor #41412

Closed

4 tasks

jbrockmendel changed the title ~~PERF/WIP: avoid copies in lib.infer_dtype~~ PERF: avoid copies in lib.infer_dtype Jan 7, 2022

Merge branch 'main' into perf-ravels-2

32629aa

jreback added this to the 1.5 milestone Jan 17, 2022

jreback merged commit c2fc924 into pandas-dev:main Jan 17, 2022

jbrockmendel deleted the perf-ravels-2 branch January 17, 2022 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: avoid copies in lib.infer_dtype #45057

PERF: avoid copies in lib.infer_dtype #45057

jbrockmendel commented Dec 24, 2021

jbrockmendel commented Dec 28, 2021

mroeschke commented Dec 28, 2021

jreback commented Dec 28, 2021

lithomas1 Dec 29, 2021

jbrockmendel Dec 29, 2021

lithomas1 Dec 29, 2021

jbrockmendel Jan 17, 2022

jreback commented Dec 30, 2021

jbrockmendel commented Dec 30, 2021

jreback commented Jan 17, 2022

PERF: avoid copies in lib.infer_dtype #45057

PERF: avoid copies in lib.infer_dtype #45057

Conversation

jbrockmendel commented Dec 24, 2021

jbrockmendel commented Dec 28, 2021

mroeschke commented Dec 28, 2021

jreback commented Dec 28, 2021

lithomas1 Dec 29, 2021

Choose a reason for hiding this comment

jbrockmendel Dec 29, 2021

Choose a reason for hiding this comment

lithomas1 Dec 29, 2021

Choose a reason for hiding this comment

jbrockmendel Jan 17, 2022

Choose a reason for hiding this comment

jreback commented Dec 30, 2021

jbrockmendel commented Dec 30, 2021

jreback commented Jan 17, 2022