Skip to content

Commit 96cac41

Browse files
jeffcareyjorisvandenbossche
authored andcommitted
BUG: Corrects stopping logic when nrows argument is supplied (#7626)
closes #7626 Subsets of tabular files with different "shapes" will now load when a valid skiprows/nrows is given as an argument - Conditions for error: 1) There are different "shapes" within a tabular data file, i.e. different numbers of columns. 2) A "narrower" set of columns is followed by a "wider" (more columns) one, and the narrower set is laid out such that the end of a 262144-byte block occurs within it. Issue summary: The C engine for parsing files reads in 262144 bytes at a time. Previously, the "start_lines" variable in tokenizer.c/tokenize_bytes() was set incorrectly to the first line in that chunk, rather than the overall first row requested. This lead to incorrect logic on when to stop reading when nrows is supplied by the user. This always happened but only caused a crash when a wider set of columns followed in the file. In other cases, extra rows were read in but then harmlessly discarded. This pull request always uses the first requested row for comparisons, so only nrows will be parsed when supplied. Author: Jeff Carey <[email protected]> Closes #14747 from jeffcarey/fix/7626 and squashes the following commits: cac1bac [Jeff Carey] Removed duplicative test 6f1965a [Jeff Carey] BUG: Corrects stopping logic when nrows argument is supplied (Fixes #7626) (cherry picked from commit 4378f82) Conflicts: pandas/io/tests/parser/c_parser_only.py
1 parent 90e1922 commit 96cac41

File tree

3 files changed

+21
-5
lines changed

3 files changed

+21
-5
lines changed

doc/source/whatsnew/v0.19.2.txt

+1
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ Bug Fixes
7070

7171

7272
- Bug in ``pd.read_csv()`` in which the ``dtype`` parameter was not being respected for empty data (:issue:`14712`)
73+
- Bug in ``pd.read_csv()`` in which the ``nrows`` parameter was not being respected for large input when using the C engine for parsing (:issue:`7626`)
7374

7475

7576

pandas/io/tests/parser/c_parser_only.py

+17
Original file line numberDiff line numberDiff line change
@@ -607,3 +607,20 @@ def test_empty_dtype(self):
607607
result = self.read_csv(StringIO(data), header=0,
608608
dtype={'a': np.int32, 1: np.float64})
609609
tm.assert_frame_equal(result, expected)
610+
611+
def test_read_nrows_large(self):
612+
# gh-7626 - Read only nrows of data in for large inputs (>262144b)
613+
header_narrow = '\t'.join(['COL_HEADER_' + str(i)
614+
for i in range(10)]) + '\n'
615+
data_narrow = '\t'.join(['somedatasomedatasomedata1'
616+
for i in range(10)]) + '\n'
617+
header_wide = '\t'.join(['COL_HEADER_' + str(i)
618+
for i in range(15)]) + '\n'
619+
data_wide = '\t'.join(['somedatasomedatasomedata2'
620+
for i in range(15)]) + '\n'
621+
test_input = (header_narrow + data_narrow * 1050 +
622+
header_wide + data_wide * 2)
623+
624+
df = self.read_csv(StringIO(test_input), sep='\t', nrows=1010)
625+
626+
self.assertTrue(df.size == 1010 * 10)

pandas/src/parser/tokenizer.c

+3-5
Original file line numberDiff line numberDiff line change
@@ -726,16 +726,14 @@ int skip_this_line(parser_t *self, int64_t rownum) {
726726
}
727727
}
728728

729-
int tokenize_bytes(parser_t *self, size_t line_limit)
729+
int tokenize_bytes(parser_t *self, size_t line_limit, int start_lines)
730730
{
731-
int i, slen, start_lines;
731+
int i, slen;
732732
long maxstreamsize;
733733
char c;
734734
char *stream;
735735
char *buf = self->data + self->datapos;
736736

737-
start_lines = self->lines;
738-
739737
if (make_stream_space(self, self->datalen - self->datapos) < 0) {
740738
self->error_msg = "out of memory";
741739
return -1;
@@ -1384,7 +1382,7 @@ int _tokenize_helper(parser_t *self, size_t nrows, int all) {
13841382
TRACE(("_tokenize_helper: Trying to process %d bytes, datalen=%d, datapos= %d\n",
13851383
self->datalen - self->datapos, self->datalen, self->datapos));
13861384

1387-
status = tokenize_bytes(self, nrows);
1385+
status = tokenize_bytes(self, nrows, start_lines);
13881386

13891387
if (status < 0) {
13901388
// XXX

0 commit comments

Comments
 (0)