Skip to content

BUG: fix reading multi-index data in python parser #7029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 6, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,7 @@ Bug Fixes
- accept ``TextFileReader`` in ``concat``, which was affecting a common user idiom (:issue:`6583`)
- Bug in C parser with leading whitespace (:issue:`3374`)
- Bug in C parser with ``delim_whitespace=True`` and ``\r``-delimited lines
- Bug in python parser with explicit multi-index in row following column header (:issue:`6893`)
- Bug in ``Series.rank`` and ``DataFrame.rank`` that caused small floats (<1e-13) to all receive the same rank (:issue:`6886`)
- Bug in ``DataFrame.apply`` with functions that used \*args`` or \*\*kwargs and returned
an empty result (:issue:`6952`)
Expand Down
9 changes: 5 additions & 4 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1383,7 +1383,7 @@ def __init__(self, f, **kwds):
# multiple date column thing turning into a real spaghetti factory
if not self._has_complex_date_col:
(index_names,
self.orig_names, columns_) = self._get_index_name(self.columns)
self.orig_names, self.columns) = self._get_index_name(self.columns)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ignore columns here, we will lose the columns added from the row specifying the index...

self._name_processed = True
if self.index_names is None:
self.index_names = index_names
Expand Down Expand Up @@ -1811,8 +1811,9 @@ def _get_index_name(self, columns):
columns.insert(0, c)

# Update list of original names to include all indices.
self.num_original_columns = len(next_line)
return line, columns, orig_names
orig_names = list(columns)
self.num_original_columns = len(columns)
return line, orig_names, columns

if implicit_first_cols > 0:
# Case 1
Expand All @@ -1824,7 +1825,7 @@ def _get_index_name(self, columns):

else:
# Case 2
(index_name, columns,
(index_name, columns_,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the previous behavior of ignoring the columns returned by _clean_index_names, except now we are not ignoring the columns returned by _get_index_name which may contain index columns (see "case 0")

self.index_col) = _clean_index_names(columns, self.index_col)

return index_name, orig_names, columns
Expand Down
11 changes: 9 additions & 2 deletions pandas/io/tests/test_parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1569,7 +1569,7 @@ def test_converter_return_string_bug(self):

def test_read_table_buglet_4x_multiindex(self):
# GH 6607
# Parsing multiindex columns currently causes an error in the C parser.
# Parsing multi-level index currently causes an error in the C parser.
# Temporarily copied to TestPythonParser.
# Here test that CParserError is raised:

Expand Down Expand Up @@ -2692,7 +2692,7 @@ def test_decompression_regex_sep(self):
def test_read_table_buglet_4x_multiindex(self):
# GH 6607
# This is a copy which should eventually be merged into ParserTests
# when the issue with multiindex columns is fixed in the C parser.
# when the issue with multi-level index is fixed in the C parser.

text = """ A B C D E
one two three four
Expand All @@ -2704,6 +2704,13 @@ def test_read_table_buglet_4x_multiindex(self):
df = self.read_table(StringIO(text), sep='\s+')
self.assertEquals(df.index.names, ('one', 'two', 'three', 'four'))

# GH 6893
data = ' A B C\na b c\n1 3 7 0 3 6\n3 1 4 1 5 9'
expected = DataFrame.from_records([(1,3,7,0,3,6), (3,1,4,1,5,9)],
columns=list('abcABC'), index=list('abc'))
actual = self.read_table(StringIO(data), sep='\s+')
tm.assert_frame_equal(actual, expected)

class TestFwfColspaceSniffing(tm.TestCase):
def test_full_file(self):
# File with all values
Expand Down