Skip to content

API: Avoid Hidden numeric heuristics #53781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mroeschke opened this issue Jun 21, 2023 · 2 comments
Open

API: Avoid Hidden numeric heuristics #53781

mroeschke opened this issue Jun 21, 2023 · 2 comments

Comments

@mroeschke
Copy link
Member

mroeschke commented Jun 21, 2023

There are several places where pandas has hidden heuristics/thresholds dictating certain behavior that is not immediately obvious or configurable to the user. IIRC, there have been bugs in rolling and to_datetime where buggy behavior was encountered when data had a particular value or the data was a certain size for example which can be hard to diagnose.

Ideally we should:

  1. Not change behavior due to some data characteristic introspection
  2. At lease expose the option to the user to control the heuristic

CSV reading tokenizer chunksize

int64_t DEFAULT_CHUNKSIZE = 256 * 1024

CSV line buffer size

heuristic = 2**20 // self.table_width

Number of elements when to auto use numexpr

_MIN_ELEMENTS = 1_000_000

TDA iter chunk size processing

chunksize = 10000

Something pytables related

_max_selectors = 31

chunksize = 100000

Number of element to automatically use caching in to_datetime

start_caching_at = 50

Chunk size to use when writing csv

return (100000 // (len(self.cols) or 1)) or 1

Number of regexes to store when time parsing

_CACHE_MAX_SIZE = 5 # Max number of regexes stored in _regex_cache

Rank tolerance

float64_t FP_ERR = 1e-13

isin algo determination

len(comps_array) > 1_000_000

Value formatting

has_large_values = (abs_vals > 1e6).any()

Number of elements to populate hash table

_SIZE_CUTOFF = 1_000_000

@jbrockmendel
Copy link
Member

is the idea that making these configurable will help in bug hunting? or more of an "anything that can be configured should be configurable"? Because the latter im wary of.

@mroeschke
Copy link
Member Author

Personally, more to help with bug hunting, but I also think it's a better user experience if behavior doesn't change based on a silent heuristic. Additionally, I've been diving into slow tests recently, and a lot of the slow tests have to generate large data to trip and test the heuristic path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants