Skip to content

PERF: avoid copies in lib.infer_dtype #45057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 17, 2022

Conversation

jbrockmendel
Copy link
Member

In the skipna=True case, instead of doing values = values[~isnaobj(values)], just pass skipna to the relevant validator functions. This causes trouble because those validator functions have different rules about what NA values they allow (analogous to is_valid_na_for_dtype).

I'm down to 1 test case failing locally, fixing it will require changing the StringArray constructor to allow None and np.nan (xref #40839) in addition to just pd.NA (AFAICT pd.array and pd.Series(... dtype="string") will not be affected). I'm fine with that, but someone more involved with the string code might want to weigh in cc @simonjayhawkins

In [1]: import numpy as np

In [2]: arr = np.arange(10**5).astype(object)

In [3]: from pandas._libs import lib

In [4]: %timeit lib.infer_dtype(arr, skipna=True)
962 µs ± 77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <- PR
3.81 ms ± 33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master


In [7]: %load_ext memory_profiler

In [16]: arr = arr.repeat(100)

In [18]: %memit lib.infer_dtype(arr, skipna=True)
peak memory: 169.24 MiB, increment: 0.00 MiB  # <- PR
peak memory: 265.39 MiB, increment: 95.38 MiB  # <- master

@jbrockmendel jbrockmendel marked this pull request as draft December 27, 2021 00:31
@jbrockmendel
Copy link
Member Author

Bad news @mroeschke:

    def test_dialect_conflict_except_delimiter(all_parsers, custom_dialect, arg, value):
[...]
E           AssertionError: Caused unexpected warning(s): [('ResourceWarning', ResourceWarning('unclosed file <_io.BufferedRandom name=10>'), 'D:\\a\\1\\s\\pandas\\core\\indexes\\base.py', 665)]

@mroeschke
Copy link
Member

Bad news @mroeschke:

    def test_dialect_conflict_except_delimiter(all_parsers, custom_dialect, arg, value):
[...]
E           AssertionError: Caused unexpected warning(s): [('ResourceWarning', ResourceWarning('unclosed file <_io.BufferedRandom name=10>'), 'D:\\a\\1\\s\\pandas\\core\\indexes\\base.py', 665)]

:(

This looks like it occurred on Windows? The check I added doesn't work there since idk the equivalent lsof check.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance labels Dec 28, 2021
@jreback
Copy link
Contributor

jreback commented Dec 28, 2021

wow, looks pretty good.

@jbrockmendel jbrockmendel marked this pull request as ready for review December 28, 2021 22:40
@@ -1882,7 +1889,7 @@ cdef class StringValidator(Validator):

cdef bint is_valid_null(self, object value) except -1:
# We deliberately exclude None / NaN here since StringArray uses NA
return value is C_NA
return value is C_NA or value is None or util.is_nan(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question:
If I do pd.arrays.StringArray(np.array(["a", np.nan], dtype=object)), won't this result in a StringArray with np.nan in it? (something like this? #30966 (comment))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this result in a StringArray with np.nan in it?

yep. probably ought to do something about that. have something min mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to fix this in #41412. This PR is probably stuck in the meantime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lithomas1 is this un-blocked?

@jreback
Copy link
Contributor

jreback commented Dec 30, 2021

this WIP?

@jbrockmendel
Copy link
Member Author

this WIP?

@lithomas1 says this needs to wait on #41412

@jbrockmendel jbrockmendel changed the title PERF/WIP: avoid copies in lib.infer_dtype PERF: avoid copies in lib.infer_dtype Jan 7, 2022
@jreback jreback added this to the 1.5 milestone Jan 17, 2022
@jreback jreback merged commit c2fc924 into pandas-dev:main Jan 17, 2022
@jreback
Copy link
Contributor

jreback commented Jan 17, 2022

nice!

@jbrockmendel jbrockmendel deleted the perf-ravels-2 branch January 17, 2022 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: infer_dtype with skipna=True only skip valid-for-dtype NAs
4 participants