API: Avoid Hidden numeric heuristics #53781

mroeschke · 2023-06-21T21:26:19Z

There are several places where pandas has hidden heuristics/thresholds dictating certain behavior that is not immediately obvious or configurable to the user. IIRC, there have been bugs in rolling and to_datetime where buggy behavior was encountered when data had a particular value or the data was a certain size for example which can be hard to diagnose.

Ideally we should:

Not change behavior due to some data characteristic introspection
At lease expose the option to the user to control the heuristic

CSV reading tokenizer chunksize

pandas/pandas/_libs/parsers.pyx

Line 119 in bb0403b

int64_t DEFAULT_CHUNKSIZE = 256 * 1024

CSV line buffer size

pandas/pandas/_libs/parsers.pyx

Line 587 in bb0403b

heuristic = 2**20 // self.table_width

Number of elements when to auto use numexpr

pandas/pandas/core/computation/expressions.py

Line 42 in bb0403b

_MIN_ELEMENTS = 1_000_000

TDA iter chunk size processing

pandas/pandas/core/arrays/timedeltas.py

Line 387 in bb0403b

chunksize = 10000

Something pytables related

pandas/pandas/core/computation/pytables.py

Line 101 in bb0403b

_max_selectors = 31

pandas/pandas/io/pytables.py

Line 1887 in bb0403b

chunksize = 100000

Number of element to automatically use caching in to_datetime

pandas/pandas/core/tools/datetimes.py

Line 124 in bb0403b

start_caching_at = 50

Chunk size to use when writing csv

pandas/pandas/io/formats/csvs.py

Line 166 in bb0403b

return (100000 // (len(self.cols) or 1)) or 1

Number of regexes to store when time parsing

pandas/pandas/_libs/tslibs/strptime.pyx

Line 576 in bb0403b

_CACHE_MAX_SIZE = 5 # Max number of regexes stored in _regex_cache

Rank tolerance

pandas/pandas/_libs/algos.pyx

Line 61 in bb0403b

float64_t FP_ERR = 1e-13

isin algo determination

pandas/pandas/core/algorithms.py

Line 521 in bb0403b

len(comps_array) > 1_000_000

Value formatting

pandas/pandas/io/formats/format.py

Line 1562 in bb0403b

has_large_values = (abs_vals > 1e6).any()

Number of elements to populate hash table

pandas/pandas/_libs/index.pyx

Line 99 in bb0403b

_SIZE_CUTOFF = 1_000_000

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2023-06-21T23:14:56Z

is the idea that making these configurable will help in bug hunting? or more of an "anything that can be configured should be configurable"? Because the latter im wary of.

mroeschke · 2023-06-22T00:10:40Z

Personally, more to help with bug hunting, but I also think it's a better user experience if behavior doesn't change based on a silent heuristic. Additionally, I've been diving into slow tests recently, and a lot of the slow tests have to generate large data to trip and test the heuristic path.

mroeschke added the API Design label Jun 21, 2023

mroeschke mentioned this issue Jun 27, 2023

TST: Refactor slow tests #53891

Merged

mroeschke mentioned this issue Aug 22, 2023

BUG: numexpr 2.85 changed integer overflow handling, failing a test #54546

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Avoid Hidden numeric heuristics #53781

API: Avoid Hidden numeric heuristics #53781

mroeschke commented Jun 21, 2023 •

edited

Loading

jbrockmendel commented Jun 21, 2023

mroeschke commented Jun 22, 2023

API: Avoid Hidden numeric heuristics #53781

API: Avoid Hidden numeric heuristics #53781

Comments

mroeschke commented Jun 21, 2023 • edited Loading

jbrockmendel commented Jun 21, 2023

mroeschke commented Jun 22, 2023

mroeschke commented Jun 21, 2023 •

edited

Loading