ENH: read_{csv,table} look for index columns in row after header with C engine #7591

mcwitt · 2014-06-27T18:25:31Z

Closes #6893.

Currently the Python parser can read data with the index columns specified on the first line after the header, e.g.

In [3]: pd.__version__
Out[3]: '0.14.0-271-gf8b101c'

In [4]: text = """                      A       B       C       D        E
one two three   four
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
a   q   20      4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30      3    -0.6662 -0.5243 -0.3580  0.89145  2.5838"""

In [5]: pd.read_table(StringIO(text), sep='\s+', engine='python')
Out[5]: 
                           A       B       C        D       E
one two three   four                                         
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
    q   20.0000 4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30.0000 3    -0.6662 -0.5243 -0.3580  0.89145  2.5838

but the C parser fails:

In [6]: pd.read_table(StringIO(text), sep='\s+', engine='c')
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
. . .
CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 9

This PR patches the C parser to enable this feature.

mcwitt · 2014-06-27T18:39:41Z

pandas/parser.pyx

+            # temporarily set expected_fields to prevent parser from raising
+            # error if it sees extra columns
+            ex_fields = self.parser.expected_fields
+            self.parser.expected_fields = field_count


This is a bit hacky but it works. Unless expected_fields is set, the C parser discards lines with extra fields and raises a CParserError. Since we don't necessarily want to discard the 2nd data line if it has extra fields (but instead check whether header + 1st = 2nd) we can make the C parser look the other way by setting expected_fields.

ENH: read_{csv,table} look for index columns in row after header with C engine

jreback · 2014-06-30T19:26:53Z

@mcwitt thanks!

This reverts commit 2c603e1, reversing changes made to 49a86f1.

* commit 'v0.14.0-345-g8cd3dd6': (73 commits) PERF: allow slice indexers to be computed faster PERF: allow dst transition computations to be handled much faster if the end-points are ok (GH7633) Revert "Merge pull request pandas-dev#7591 from mcwitt/parse-index-cols-c" TST: fixes for 2.6 comparisons BUG: Error in rolling_var if window is larger than array, fixes pandas-dev#7297 REGR: Add back #N/A N/A as a default NA value (regresion from 0.12) (GH5521) BUG: xlim on plots with shared axes (GH2960, GH3490) BUG: Bug in Series.get with a boolean accessor (GH7407) DOC: add v0.15.0.txt template DOC: small doc build fixes DOC: v0.14.1 edits BUG: doc example in groupby.rst (GH7559 / GH7628) PERF: optimize MultiIndex.from_product for large iterables ENH: change BlockManager pickle format to work with dup items BUG: {expanding,rolling}_{cov,corr} don't handle arguments with different index sets properly CLN/DEPR: Fix instances of 'U'/'rU' in open(...) CLN: Fix typo TST: fix groupby test on windows (related GH7580) COMPAT: make numpy NaT comparison use a view to avoid implicit conversions BUG: Bug in to_timedelta that accepted invalid units and misinterpreted m/h (GH7611, GH6423) ...

mcwitt reviewed Jun 27, 2014
View reviewed changes

mcwitt changed the title ~~ENH: look for index columns in row after header~~ ENH: read_{csv,table} look for index columns in row after header with C engine Jun 27, 2014

jreback added Bug labels Jun 28, 2014

jreback added this to the 0.14.1 milestone Jun 28, 2014

ENH: look for index columns in row after header

6934d0d

jreback added a commit that referenced this pull request Jun 30, 2014

Merge pull request #7591 from mcwitt/parse-index-cols-c

2c603e1

ENH: read_{csv,table} look for index columns in row after header with C engine

jreback merged commit 2c603e1 into pandas-dev:master Jun 30, 2014

mcwitt deleted the parse-index-cols-c branch June 30, 2014 21:05

jreback mentioned this pull request Jun 30, 2014

TST: failing windows parser test #7623

Closed

jreback added a commit that referenced this pull request Jul 2, 2014

Revert "Merge pull request #7591 from mcwitt/parse-index-cols-c"

31cac55

This reverts commit 2c603e1, reversing changes made to 49a86f1.

jreback mentioned this pull request Jul 2, 2014

read_table fails with MultiIndex input and delim_whitespace=True #6893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: read_{csv,table} look for index columns in row after header with C engine #7591

ENH: read_{csv,table} look for index columns in row after header with C engine #7591

Uh oh!

mcwitt commented Jun 27, 2014

Uh oh!

mcwitt Jun 27, 2014

Uh oh!

jreback commented Jun 30, 2014

Uh oh!

Uh oh!

Uh oh!

ENH: read_{csv,table} look for index columns in row after header with C engine #7591

ENH: read_{csv,table} look for index columns in row after header with C engine #7591

Uh oh!

Conversation

mcwitt commented Jun 27, 2014

Uh oh!

mcwitt Jun 27, 2014

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 30, 2014

Uh oh!

Uh oh!