Skip to content

ENH: read_{csv,table} look for index columns in row after header with C engine #7591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 30, 2014

Conversation

mcwitt
Copy link
Contributor

@mcwitt mcwitt commented Jun 27, 2014

Closes #6893.

Currently the Python parser can read data with the index columns specified on the first line after the header, e.g.

In [3]: pd.__version__
Out[3]: '0.14.0-271-gf8b101c'

In [4]: text = """                      A       B       C       D        E
one two three   four
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
a   q   20      4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30      3    -0.6662 -0.5243 -0.3580  0.89145  2.5838"""

In [5]: pd.read_table(StringIO(text), sep='\s+', engine='python')
Out[5]: 
                           A       B       C        D       E
one two three   four                                         
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
    q   20.0000 4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30.0000 3    -0.6662 -0.5243 -0.3580  0.89145  2.5838

but the C parser fails:

In [6]: pd.read_table(StringIO(text), sep='\s+', engine='c')
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
. . .
CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 9

This PR patches the C parser to enable this feature.

# temporarily set expected_fields to prevent parser from raising
# error if it sees extra columns
ex_fields = self.parser.expected_fields
self.parser.expected_fields = field_count
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hacky but it works. Unless expected_fields is set, the C parser discards lines with extra fields and raises a CParserError. Since we don't necessarily want to discard the 2nd data line if it has extra fields (but instead check whether header + 1st = 2nd) we can make the C parser look the other way by setting expected_fields.

@mcwitt mcwitt changed the title ENH: look for index columns in row after header ENH: read_{csv,table} look for index columns in row after header with C engine Jun 27, 2014
@jreback jreback added this to the 0.14.1 milestone Jun 28, 2014
jreback added a commit that referenced this pull request Jun 30, 2014
ENH: read_{csv,table} look for index columns in row after header with C engine
@jreback jreback merged commit 2c603e1 into pandas-dev:master Jun 30, 2014
@jreback
Copy link
Contributor

jreback commented Jun 30, 2014

@mcwitt thanks!

@mcwitt mcwitt deleted the parse-index-cols-c branch June 30, 2014 21:05
jreback added a commit that referenced this pull request Jul 2, 2014
This reverts commit 2c603e1, reversing
changes made to 49a86f1.
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jul 15, 2014
* commit 'v0.14.0-345-g8cd3dd6': (73 commits)
  PERF: allow slice indexers to be computed faster
  PERF: allow dst transition computations to be handled much faster       if the end-points are ok (GH7633)
  Revert "Merge pull request pandas-dev#7591 from mcwitt/parse-index-cols-c"
  TST: fixes for 2.6 comparisons
  BUG: Error in rolling_var if window is larger than array, fixes pandas-dev#7297
  REGR: Add back #N/A N/A as a default NA value (regresion from 0.12) (GH5521)
  BUG: xlim on plots with shared axes (GH2960, GH3490)
  BUG: Bug in Series.get with a boolean accessor (GH7407)
  DOC: add v0.15.0.txt template
  DOC: small doc build fixes
  DOC: v0.14.1 edits
  BUG: doc example in groupby.rst (GH7559 / GH7628)
  PERF: optimize MultiIndex.from_product for large iterables
  ENH: change BlockManager pickle format to work with dup items
  BUG: {expanding,rolling}_{cov,corr} don't handle arguments with different index sets properly
  CLN/DEPR: Fix instances of 'U'/'rU' in open(...)
  CLN: Fix typo
  TST: fix groupby test on windows (related GH7580)
  COMPAT: make numpy NaT comparison use a view to avoid implicit conversions
  BUG: Bug in to_timedelta that accepted invalid units and misinterpreted m/h (GH7611, GH6423)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_table fails with MultiIndex input and delim_whitespace=True
2 participants