Skip to content

TST: Refactor slow tests #53891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pandas/_libs/parsers.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ from pandas._typing import (
)

STR_NA_VALUES: set[str]
DEFAULT_BUFFER_HEURISTIC: int

def sanitize_objects(
values: npt.NDArray[np.object_],
Expand Down
4 changes: 3 additions & 1 deletion pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,8 @@ cdef:
float64_t NEGINF = -INF
int64_t DEFAULT_CHUNKSIZE = 256 * 1024

DEFAULT_BUFFER_HEURISTIC = 2 ** 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be set as a property of the TextReader? It is a bit ambiguous in a pyx file but in the global namespace all caps I would expect this to be a compile time constant; attaching it as a property makes things clearer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if this is just for testing maybe you can patch buffer_lines directly? The naming here is a bit unclear when scoped outside of the initializer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the same concern about DEFAULT_CHUNKSIZE above too?

Ideally I think these magic numbers should at least be made obvious so they can be configured or removed #53781 and it being in the TextReader would make that less obvious?

Copy link
Member

@WillAyd WillAyd Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am mistaken with how Cython generates the code comparing this to DEFAULT_CHUNKSIZE is exactly the problem; that is a compile time constant whereas this value can be modified at runtime, but they both look the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you suggest how I could make DEFAULT_BUFFER_HEURISTIC a property on the cdef TextReader class? I'm having trouble defining it in a way that I could monkeypatch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm does declaring it cpdef help at all? Not worth going down a rabbit hole over if its a hold up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my unsuccessful attempts so far just to double check

    cdef public:
        int64_t leading_cols, table_width, DEFAULT_BUFFER_HEURISTIC=2**20
                                          ^
------------------------------------------------------------

pandas/_libs/parsers.pyx:365:43: Cannot assign default value to fields in cdef classes, structs or unions
    cpdef DEFAULT_BUFFER_HEURISTIC=2**20
                                 ^
------------------------------------------------------------

pandas/_libs/parsers.pyx:377:34: Cannot assign default value to fields in cdef classes, structs or unions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for checking. Let's not get hung up on it for now then - I think just a wart between the C/Python and how they get expressed in the Cython global namespace. Can always come back and refactor if we establish a better pattern

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming!



cdef extern from "pandas/portable.h":
# I *think* this is here so that strcasecmp is defined on Windows
Expand Down Expand Up @@ -584,7 +586,7 @@ cdef class TextReader:
raise EmptyDataError("No columns to parse from file")

# Compute buffer_lines as function of table width.
heuristic = 2**20 // self.table_width
heuristic = DEFAULT_BUFFER_HEURISTIC // self.table_width
self.buffer_lines = 1
while self.buffer_lines * 2 < heuristic:
self.buffer_lines *= 2
Expand Down
21 changes: 8 additions & 13 deletions pandas/tests/indexes/datetimes/test_date_range.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,11 +212,16 @@ def test_date_range_int64_overflow_non_recoverable(self):
date_range(end="1969-11-14", periods=106752 * 24, freq="H")

@pytest.mark.slow
def test_date_range_int64_overflow_stride_endpoint_different_signs(self):
@pytest.mark.parametrize(
"s_ts, e_ts", [("2262-02-23", "1969-11-14"), ("1970-02-01", "1677-10-22")]
)
def test_date_range_int64_overflow_stride_endpoint_different_signs(
self, s_ts, e_ts
):
# cases where stride * periods overflow int64 and stride/endpoint
# have different signs
start = Timestamp("2262-02-23")
end = Timestamp("1969-11-14")
start = Timestamp(s_ts)
end = Timestamp(e_ts)

expected = date_range(start=start, end=end, freq="-1H")
assert expected[0] == start
Expand All @@ -225,16 +230,6 @@ def test_date_range_int64_overflow_stride_endpoint_different_signs(self):
dti = date_range(end=end, periods=len(expected), freq="-1H")
tm.assert_index_equal(dti, expected)

start2 = Timestamp("1970-02-01")
end2 = Timestamp("1677-10-22")

expected2 = date_range(start=start2, end=end2, freq="-1H")
assert expected2[0] == start2
assert expected2[-1] == end2

dti2 = date_range(start=start2, periods=len(expected2), freq="-1H")
tm.assert_index_equal(dti2, expected2)

def test_date_range_out_of_bounds(self):
# GH#14187
msg = "Cannot generate range"
Expand Down
11 changes: 8 additions & 3 deletions pandas/tests/io/parser/common/test_chunksize.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import numpy as np
import pytest

from pandas._libs import parsers as libparsers
from pandas.errors import DtypeWarning

from pandas import (
Expand Down Expand Up @@ -162,14 +163,18 @@ def test_chunk_begins_with_newline_whitespace(all_parsers):


@pytest.mark.slow
def test_chunks_have_consistent_numerical_type(all_parsers):
def test_chunks_have_consistent_numerical_type(all_parsers, monkeypatch):
# mainly an issue with the C parser
heuristic = 2**3
parser = all_parsers
integers = [str(i) for i in range(499999)]
integers = [str(i) for i in range(heuristic - 1)]
data = "a\n" + "\n".join(integers + ["1.0", "2.0"] + integers)

# Coercions should work without warnings.
with tm.assert_produces_warning(None):
result = parser.read_csv(StringIO(data))
with monkeypatch.context() as m:
m.setattr(libparsers, "DEFAULT_BUFFER_HEURISTIC", heuristic)
result = parser.read_csv(StringIO(data))

assert type(result.a[0]) is np.float64
assert result.a.dtype == float
Expand Down
13 changes: 9 additions & 4 deletions pandas/tests/io/parser/dtypes/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
import numpy as np
import pytest

from pandas._libs import parsers as libparsers

from pandas.core.dtypes.dtypes import CategoricalDtype

import pandas as pd
Expand Down Expand Up @@ -105,13 +107,16 @@ def test_categorical_dtype_missing(all_parsers):

@xfail_pyarrow
@pytest.mark.slow
def test_categorical_dtype_high_cardinality_numeric(all_parsers):
def test_categorical_dtype_high_cardinality_numeric(all_parsers, monkeypatch):
# see gh-18186
# was an issue with C parser, due to DEFAULT_BUFFER_HEURISTIC
parser = all_parsers
data = np.sort([str(i) for i in range(524289)])
heuristic = 2**5
data = np.sort([str(i) for i in range(heuristic + 1)])
expected = DataFrame({"a": Categorical(data, ordered=True)})

actual = parser.read_csv(StringIO("a\n" + "\n".join(data)), dtype="category")
with monkeypatch.context() as m:
m.setattr(libparsers, "DEFAULT_BUFFER_HEURISTIC", heuristic)
actual = parser.read_csv(StringIO("a\n" + "\n".join(data)), dtype="category")
actual["a"] = actual["a"].cat.reorder_categories(
np.sort(actual.a.cat.categories), ordered=True
)
Expand Down
26 changes: 0 additions & 26 deletions pandas/tests/io/parser/test_c_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,32 +44,6 @@ def test_buffer_overflow(c_parser_only, malformed):
parser.read_csv(StringIO(malformed))


def test_buffer_rd_bytes(c_parser_only):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the original issue, this sounded like a PY2 bug specifically so I don't think this needs testing anymore

# see gh-12098: src->buffer in the C parser can be freed twice leading
# to a segfault if a corrupt gzip file is read with 'read_csv', and the
# buffer is filled more than once before gzip raises an Exception.

data = (
"\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x03\xED\xC3\x41\x09"
"\x00\x00\x08\x00\xB1\xB7\xB6\xBA\xFE\xA5\xCC\x21\x6C\xB0"
"\xA6\x4D" + "\x55" * 267 + "\x7D\xF7\x00\x91\xE0\x47\x97\x14\x38\x04\x00"
"\x1f\x8b\x08\x00VT\x97V\x00\x03\xed]\xefO"
)
parser = c_parser_only

for _ in range(100):
try:
parser.read_csv_check_warnings(
RuntimeWarning,
"compression has no effect when passing a non-binary object as input",
StringIO(data),
compression="gzip",
delim_whitespace=True,
)
except Exception:
pass


def test_delim_whitespace_custom_terminator(c_parser_only):
# See gh-12912
data = "a b c~1 2 3~4 5 6~7 8 9"
Expand Down
47 changes: 19 additions & 28 deletions pandas/tests/io/parser/test_multi_thread.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,38 +22,16 @@
]


def _construct_dataframe(num_rows):
"""
Construct a DataFrame for testing.

Parameters
----------
num_rows : int
The number of rows for our DataFrame.

Returns
-------
df : DataFrame
"""
df = DataFrame(np.random.rand(num_rows, 5), columns=list("abcde"))
df["foo"] = "foo"
df["bar"] = "bar"
df["baz"] = "baz"
df["date"] = pd.date_range("20000101 09:00:00", periods=num_rows, freq="s")
df["int"] = np.arange(num_rows, dtype="int64")
return df


def test_multi_thread_string_io_read_csv(all_parsers):
# see gh-11786
parser = all_parsers
max_row_range = 10000
num_files = 100
max_row_range = 100
num_files = 10

bytes_to_df = [
bytes_to_df = (
"\n".join([f"{i:d},{i:d},{i:d}" for i in range(max_row_range)]).encode()
for _ in range(num_files)
]
)

# Read all files in many threads.
with ExitStack() as stack:
Expand Down Expand Up @@ -141,11 +119,24 @@ def reader(arg):
def test_multi_thread_path_multipart_read_csv(all_parsers):
# see gh-11786
num_tasks = 4
num_rows = 100000
num_rows = 48

parser = all_parsers
file_name = "__thread_pool_reader__.csv"
df = _construct_dataframe(num_rows)
df = DataFrame(
{
"a": np.random.rand(num_rows),
"b": np.random.rand(num_rows),
"c": np.random.rand(num_rows),
"d": np.random.rand(num_rows),
"e": np.random.rand(num_rows),
"foo": ["foo"] * num_rows,
"bar": ["bar"] * num_rows,
"baz": ["baz"] * num_rows,
"date": pd.date_range("20000101 09:00:00", periods=num_rows, freq="s"),
"int": np.arange(num_rows, dtype="int64"),
}
)

with tm.ensure_clean(file_name) as path:
df.to_csv(path)
Expand Down
27 changes: 12 additions & 15 deletions pandas/tests/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,32 +96,29 @@ def test_int64_overflow_groupby_large_range(self):

@pytest.mark.parametrize("agg", ["mean", "median"])
def test_int64_overflow_groupby_large_df_shuffled(self, agg):
arr = np.random.randint(-1 << 12, 1 << 12, (1 << 15, 5))
i = np.random.choice(len(arr), len(arr) * 4)
rs = np.random.RandomState(42)
arr = rs.randint(-1 << 12, 1 << 12, (1 << 15, 5))
i = rs.choice(len(arr), len(arr) * 4)
arr = np.vstack((arr, arr[i])) # add some duplicate rows

i = np.random.permutation(len(arr))
i = rs.permutation(len(arr))
arr = arr[i] # shuffle rows

df = DataFrame(arr, columns=list("abcde"))
df["jim"], df["joe"] = np.random.randn(2, len(df)) * 10
df["jim"], df["joe"] = np.zeros((2, len(df)))
gr = df.groupby(list("abcde"))

# verify this is testing what it is supposed to test!
assert is_int64_overflow_possible(gr.grouper.shape)

# manually compute groupings
jim, joe = defaultdict(list), defaultdict(list)
for key, a, b in zip(map(tuple, arr), df["jim"], df["joe"]):
jim[key].append(a)
joe[key].append(b)

assert len(gr) == len(jim)
mi = MultiIndex.from_tuples(jim.keys(), names=list("abcde"))
mi = MultiIndex.from_arrays(
[ar.ravel() for ar in np.array_split(np.unique(arr, axis=0), 5, axis=1)],
names=list("abcde"),
)

f = lambda a: np.fromiter(map(getattr(np, agg), a), dtype="f8")
arr = np.vstack((f(jim.values()), f(joe.values()))).T
res = DataFrame(arr, columns=["jim", "joe"], index=mi).sort_index()
res = DataFrame(
np.zeros((len(mi), 2)), columns=["jim", "joe"], index=mi
).sort_index()

tm.assert_frame_equal(getattr(gr, agg)(), res)

Expand Down