Skip to content

Commit c69037c

Browse files
gfyoungjreback
authored andcommitted
BUG: Fixed grow_buffer to grow when capacity is reached
Addresses issue in #12494 by allowing `grow_buffer` to grow the size of the parser buffer when buffer capacity is achieved. Previously, you had to exceed capacity for this to occur, but that was inconsistent with the `end_field` check later on when handling the EOF terminator, where reached capacity was considered a buffer overflow. Author: gfyoung <[email protected]> Closes #12504 from gfyoung/read_csv_empty_header and squashes the following commits: 8ba3dd0 [gfyoung] BUG: Fixed grow_buffer to grow when capacity is reached
1 parent 9313089 commit c69037c

File tree

3 files changed

+22
-1
lines changed

3 files changed

+22
-1
lines changed

doc/source/whatsnew/v0.18.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1199,3 +1199,4 @@ Bug Fixes
11991199
- Bug in ``DataFrame.apply`` in which reduction was not being prevented for cases in which ``dtype`` was not a numpy dtype (:issue:`12244`)
12001200
- Bug when initializing categorical series with a scalar value. (:issue:`12336`)
12011201
- Bug when specifying a UTC ``DatetimeIndex`` by setting ``utc=True`` in ``.to_datetime`` (:issue:`11934`)
1202+
- Bug when increasing the buffer size of CSV reader in ``read_csv`` (:issue:`12494`)

pandas/io/tests/test_parsers.py

+20
Original file line numberDiff line numberDiff line change
@@ -2635,6 +2635,26 @@ def test_eof_states(self):
26352635
self.assertRaises(Exception, self.read_csv,
26362636
StringIO(data), escapechar='\\')
26372637

2638+
def test_grow_boundary_at_cap(self):
2639+
# See gh-12494
2640+
#
2641+
# Cause of error was the fact that pandas
2642+
# was not increasing the buffer size when
2643+
# the desired space would fill the buffer
2644+
# to capacity, which later would cause a
2645+
# buffer overflow error when checking the
2646+
# EOF terminator of the CSV stream
2647+
def test_empty_header_read(count):
2648+
s = StringIO(',' * count)
2649+
expected = DataFrame(columns=[
2650+
'Unnamed: {i}'.format(i=i)
2651+
for i in range(count + 1)])
2652+
df = read_csv(s)
2653+
tm.assert_frame_equal(df, expected)
2654+
2655+
for count in range(1, 101):
2656+
test_empty_header_read(count)
2657+
26382658

26392659
class TestPythonParser(ParserTests, tm.TestCase):
26402660

pandas/src/parser/tokenizer.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ static void *grow_buffer(void *buffer, int length, int *capacity,
111111
void *newbuffer = buffer;
112112

113113
// Can we fit potentially nbytes tokens (+ null terminators) in the stream?
114-
while ( (length + space > cap) && (newbuffer != NULL) ){
114+
while ( (length + space >= cap) && (newbuffer != NULL) ){
115115
cap = cap? cap << 1 : 2;
116116
buffer = newbuffer;
117117
newbuffer = safe_realloc(newbuffer, elsize * cap);

0 commit comments

Comments
 (0)