Skip to content

Commit e067b61

Browse files
committed
Merge pull request #5298 from jreback/mi_csv
BUG: parser can handle a common_format multi-column index (no row index cols), (GH4702)
2 parents 1d411b9 + 3edc336 commit e067b61

File tree

6 files changed

+169
-31
lines changed

6 files changed

+169
-31
lines changed

doc/source/io.rst

+24-7
Original file line numberDiff line numberDiff line change
@@ -890,6 +890,22 @@ of tupleizing columns, specify ``tupleize_cols=True``.
890890
print(open('mi.csv').read())
891891
pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1])
892892
893+
Starting in 0.13.0, ``read_csv`` will be able to interpret a more common format
894+
of multi-columns indices.
895+
896+
.. ipython:: python
897+
:suppress:
898+
899+
data = ",a,a,a,b,c,c\n,q,r,s,t,u,v\none,1,2,3,4,5,6\ntwo,7,8,9,10,11,12"
900+
fh = open('mi2.csv','w')
901+
fh.write(data)
902+
fh.close()
903+
904+
.. ipython:: python
905+
906+
print(open('mi2.csv').read())
907+
pd.read_csv('mi2.csv',header=[0,1],index_col=0)
908+
893909
Note: If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it
894910
with ``df.to_csv(..., index=False``), then any ``names`` on the columns index will be *lost*.
895911

@@ -898,6 +914,7 @@ with ``df.to_csv(..., index=False``), then any ``names`` on the columns index wi
898914
899915
import os
900916
os.remove('mi.csv')
917+
os.remove('mi2.csv')
901918
902919
.. _io.sniff:
903920

@@ -1069,7 +1086,7 @@ Note ``NaN``'s, ``NaT``'s and ``None`` will be converted to ``null`` and ``datet
10691086
Orient Options
10701087
++++++++++++++
10711088

1072-
There are a number of different options for the format of the resulting JSON
1089+
There are a number of different options for the format of the resulting JSON
10731090
file / string. Consider the following DataFrame and Series:
10741091

10751092
.. ipython:: python
@@ -1080,7 +1097,7 @@ file / string. Consider the following DataFrame and Series:
10801097
sjo = Series(dict(x=15, y=16, z=17), name='D')
10811098
sjo
10821099
1083-
**Column oriented** (the default for ``DataFrame``) serialises the data as
1100+
**Column oriented** (the default for ``DataFrame``) serialises the data as
10841101
nested JSON objects with column labels acting as the primary index:
10851102

10861103
.. ipython:: python
@@ -1113,7 +1130,7 @@ values only, column and index labels are not included:
11131130
dfjo.to_json(orient="values")
11141131
# Not available for Series
11151132
1116-
**Split oriented** serialises to a JSON object containing separate entries for
1133+
**Split oriented** serialises to a JSON object containing separate entries for
11171134
values, index and columns. Name is also included for ``Series``:
11181135

11191136
.. ipython:: python
@@ -1123,7 +1140,7 @@ values, index and columns. Name is also included for ``Series``:
11231140
11241141
.. note::
11251142

1126-
Any orient option that encodes to a JSON object will not preserve the ordering of
1143+
Any orient option that encodes to a JSON object will not preserve the ordering of
11271144
index and column labels during round-trip serialisation. If you wish to preserve
11281145
label ordering use the `split` option as it uses ordered containers.
11291146

@@ -1351,7 +1368,7 @@ The Numpy Parameter
13511368

13521369
If ``numpy=True`` is passed to ``read_json`` an attempt will be made to sniff
13531370
an appropriate dtype during deserialisation and to subsequently decode directly
1354-
to numpy arrays, bypassing the need for intermediate Python objects.
1371+
to numpy arrays, bypassing the need for intermediate Python objects.
13551372

13561373
This can provide speedups if you are deserialising a large amount of numeric
13571374
data:
@@ -1375,7 +1392,7 @@ data:
13751392
The speedup is less noticable for smaller datasets:
13761393

13771394
.. ipython:: python
1378-
1395+
13791396
jsonfloats = dffloats.head(100).to_json()
13801397
13811398
.. ipython:: python
@@ -1399,7 +1416,7 @@ The speedup is less noticable for smaller datasets:
13991416

14001417
- labels are ordered. Labels are only read from the first container, it is assumed
14011418
that each subsequent row / column has been encoded in the same order. This should be satisfied if the
1402-
data was encoded using ``to_json`` but may not be the case if the JSON
1419+
data was encoded using ``to_json`` but may not be the case if the JSON
14031420
is from another source.
14041421

14051422
.. ipython:: python

doc/source/release.rst

+2
Original file line numberDiff line numberDiff line change
@@ -634,6 +634,8 @@ Bug Fixes
634634
- Fixed seg fault in C parser caused by passing more names than columns in
635635
the file. (:issue:`5156`)
636636
- Fix ``Series.isin`` with date/time-like dtypes (:issue:`5021`)
637+
- C and Python Parser can now handle the more common multi-index column format
638+
which doesn't have a row for index names (:issue:`4702`)
637639

638640
pandas 0.12.0
639641
-------------

pandas/io/parsers.py

+18-3
Original file line numberDiff line numberDiff line change
@@ -569,7 +569,6 @@ def _clean_options(self, options, engine):
569569
skiprows = set() if skiprows is None else set(skiprows)
570570

571571
# put stuff back
572-
result['index_col'] = index_col
573572
result['names'] = names
574573
result['converters'] = converters
575574
result['na_values'] = na_values
@@ -641,7 +640,7 @@ def __init__(self, kwds):
641640
self.orig_names = None
642641
self.prefix = kwds.pop('prefix', None)
643642

644-
self.index_col = kwds.pop('index_col', None)
643+
self.index_col = kwds.get('index_col', None)
645644
self.index_names = None
646645
self.col_names = None
647646

@@ -1455,6 +1454,7 @@ def _convert_data(self, data):
14551454
def _infer_columns(self):
14561455
names = self.names
14571456
num_original_columns = 0
1457+
clear_buffer = True
14581458
if self.header is not None:
14591459
header = self.header
14601460

@@ -1473,13 +1473,15 @@ def _infer_columns(self):
14731473
while self.pos <= hr:
14741474
line = self._next_line()
14751475

1476+
unnamed_count = 0
14761477
this_columns = []
14771478
for i, c in enumerate(line):
14781479
if c == '':
14791480
if have_mi_columns:
14801481
this_columns.append('Unnamed: %d_level_%d' % (i, level))
14811482
else:
14821483
this_columns.append('Unnamed: %d' % i)
1484+
unnamed_count += 1
14831485
else:
14841486
this_columns.append(c)
14851487

@@ -1490,12 +1492,25 @@ def _infer_columns(self):
14901492
if cur_count > 0:
14911493
this_columns[i] = '%s.%d' % (col, cur_count)
14921494
counts[col] = cur_count + 1
1495+
elif have_mi_columns:
1496+
1497+
# if we have grabbed an extra line, but its not in our format
1498+
# so save in the buffer, and create an blank extra line for the rest of the
1499+
# parsing code
1500+
if hr == header[-1]:
1501+
lc = len(this_columns)
1502+
ic = len(self.index_col) if self.index_col is not None else 0
1503+
if lc != unnamed_count and lc-ic > unnamed_count:
1504+
clear_buffer = False
1505+
this_columns = [ None ] * lc
1506+
self.buf = [ self.buf[-1] ]
14931507

14941508
columns.append(this_columns)
14951509
if len(columns) == 1:
14961510
num_original_columns = len(this_columns)
14971511

1498-
self._clear_buffer()
1512+
if clear_buffer:
1513+
self._clear_buffer()
14991514

15001515
if names is not None:
15011516
if (self.usecols is not None and len(names) != len(self.usecols)) \

pandas/io/tests/test_parsers.py

+90-6
Original file line numberDiff line numberDiff line change
@@ -1215,29 +1215,113 @@ def test_header_multi_index(self):
12151215
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2
12161216
"""
12171217

1218-
df = read_csv(StringIO(data), header=[0, 2, 3, 4], index_col=[0, 1], tupleize_cols=False)
1218+
df = self.read_csv(StringIO(data), header=[0, 2, 3, 4], index_col=[0, 1], tupleize_cols=False)
12191219
tm.assert_frame_equal(df, expected)
12201220

12211221
# skipping lines in the header
1222-
df = read_csv(StringIO(data), header=[0, 2, 3, 4], index_col=[0, 1], tupleize_cols=False)
1222+
df = self.read_csv(StringIO(data), header=[0, 2, 3, 4], index_col=[0, 1], tupleize_cols=False)
12231223
tm.assert_frame_equal(df, expected)
12241224

12251225
#### invalid options ####
12261226

12271227
# no as_recarray
1228-
self.assertRaises(ValueError, read_csv, StringIO(data), header=[0,1,2,3],
1228+
self.assertRaises(ValueError, self.read_csv, StringIO(data), header=[0,1,2,3],
12291229
index_col=[0,1], as_recarray=True, tupleize_cols=False)
12301230

12311231
# names
1232-
self.assertRaises(ValueError, read_csv, StringIO(data), header=[0,1,2,3],
1232+
self.assertRaises(ValueError, self.read_csv, StringIO(data), header=[0,1,2,3],
12331233
index_col=[0,1], names=['foo','bar'], tupleize_cols=False)
12341234
# usecols
1235-
self.assertRaises(ValueError, read_csv, StringIO(data), header=[0,1,2,3],
1235+
self.assertRaises(ValueError, self.read_csv, StringIO(data), header=[0,1,2,3],
12361236
index_col=[0,1], usecols=['foo','bar'], tupleize_cols=False)
12371237
# non-numeric index_col
1238-
self.assertRaises(ValueError, read_csv, StringIO(data), header=[0,1,2,3],
1238+
self.assertRaises(ValueError, self.read_csv, StringIO(data), header=[0,1,2,3],
12391239
index_col=['foo','bar'], tupleize_cols=False)
12401240

1241+
def test_header_multiindex_common_format(self):
1242+
1243+
df = DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],
1244+
index=['one','two'],
1245+
columns=MultiIndex.from_tuples([('a','q'),('a','r'),('a','s'),
1246+
('b','t'),('c','u'),('c','v')]))
1247+
1248+
# to_csv
1249+
data = """,a,a,a,b,c,c
1250+
,q,r,s,t,u,v
1251+
,,,,,,
1252+
one,1,2,3,4,5,6
1253+
two,7,8,9,10,11,12"""
1254+
1255+
result = self.read_csv(StringIO(data),header=[0,1],index_col=0)
1256+
tm.assert_frame_equal(df,result)
1257+
1258+
# common
1259+
data = """,a,a,a,b,c,c
1260+
,q,r,s,t,u,v
1261+
one,1,2,3,4,5,6
1262+
two,7,8,9,10,11,12"""
1263+
1264+
result = self.read_csv(StringIO(data),header=[0,1],index_col=0)
1265+
tm.assert_frame_equal(df,result)
1266+
1267+
# common, no index_col
1268+
data = """a,a,a,b,c,c
1269+
q,r,s,t,u,v
1270+
1,2,3,4,5,6
1271+
7,8,9,10,11,12"""
1272+
1273+
result = self.read_csv(StringIO(data),header=[0,1],index_col=None)
1274+
tm.assert_frame_equal(df.reset_index(drop=True),result)
1275+
1276+
# malformed case 1
1277+
expected = DataFrame(np.array([[ 2, 3, 4, 5, 6],
1278+
[ 8, 9, 10, 11, 12]]),
1279+
index=Index([1, 7]),
1280+
columns=MultiIndex(levels=[[u('a'), u('b'), u('c')], [u('r'), u('s'), u('t'), u('u'), u('v')]],
1281+
labels=[[0, 0, 1, 2, 2], [0, 1, 2, 3, 4]],
1282+
names=[u('a'), u('q')]))
1283+
1284+
data = """a,a,a,b,c,c
1285+
q,r,s,t,u,v
1286+
1,2,3,4,5,6
1287+
7,8,9,10,11,12"""
1288+
1289+
result = self.read_csv(StringIO(data),header=[0,1],index_col=0)
1290+
tm.assert_frame_equal(expected,result)
1291+
1292+
# malformed case 2
1293+
expected = DataFrame(np.array([[ 2, 3, 4, 5, 6],
1294+
[ 8, 9, 10, 11, 12]]),
1295+
index=Index([1, 7]),
1296+
columns=MultiIndex(levels=[[u('a'), u('b'), u('c')], [u('r'), u('s'), u('t'), u('u'), u('v')]],
1297+
labels=[[0, 0, 1, 2, 2], [0, 1, 2, 3, 4]],
1298+
names=[None, u('q')]))
1299+
1300+
data = """,a,a,b,c,c
1301+
q,r,s,t,u,v
1302+
1,2,3,4,5,6
1303+
7,8,9,10,11,12"""
1304+
1305+
result = self.read_csv(StringIO(data),header=[0,1],index_col=0)
1306+
tm.assert_frame_equal(expected,result)
1307+
1308+
# mi on columns and index (malformed)
1309+
expected = DataFrame(np.array([[ 3, 4, 5, 6],
1310+
[ 9, 10, 11, 12]]),
1311+
index=MultiIndex(levels=[[1, 7], [2, 8]],
1312+
labels=[[0, 1], [0, 1]]),
1313+
columns=MultiIndex(levels=[[u('a'), u('b'), u('c')], [u('s'), u('t'), u('u'), u('v')]],
1314+
labels=[[0, 1, 2, 2], [0, 1, 2, 3]],
1315+
names=[None, u('q')]))
1316+
1317+
data = """,a,a,b,c,c
1318+
q,r,s,t,u,v
1319+
1,2,3,4,5,6
1320+
7,8,9,10,11,12"""
1321+
1322+
result = self.read_csv(StringIO(data),header=[0,1],index_col=[0, 1])
1323+
tm.assert_frame_equal(expected,result)
1324+
12411325
def test_pass_names_with_index(self):
12421326
lines = self.data1.split('\n')
12431327
no_header = '\n'.join(lines[1:])

pandas/parser.pyx

+20-1
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,7 @@ cdef class TextReader:
250250
object memory_map
251251
object as_recarray
252252
object header, orig_header, names, header_start, header_end
253+
object index_col
253254
object low_memory
254255
object skiprows
255256
object compact_ints, use_unsigned
@@ -266,6 +267,7 @@ cdef class TextReader:
266267
header=0,
267268
header_start=0,
268269
header_end=0,
270+
index_col=None,
269271
names=None,
270272

271273
memory_map=False,
@@ -439,6 +441,8 @@ cdef class TextReader:
439441
# XXX
440442
self.noconvert = set()
441443

444+
self.index_col = index_col
445+
442446
#----------------------------------------
443447
# header stuff
444448

@@ -574,7 +578,7 @@ cdef class TextReader:
574578
# header is now a list of lists, so field_count should use header[0]
575579

576580
cdef:
577-
size_t i, start, data_line, field_count, passed_count, hr
581+
size_t i, start, data_line, field_count, passed_count, hr, unnamed_count
578582
char *word
579583
object name
580584
int status
@@ -606,6 +610,7 @@ cdef class TextReader:
606610

607611
# TODO: Py3 vs. Py2
608612
counts = {}
613+
unnamed_count = 0
609614
for i in range(field_count):
610615
word = self.parser.words[start + i]
611616

@@ -623,6 +628,7 @@ cdef class TextReader:
623628
name = 'Unnamed: %d_level_%d' % (i,level)
624629
else:
625630
name = 'Unnamed: %d' % i
631+
unnamed_count += 1
626632

627633
count = counts.get(name, 0)
628634
if count > 0 and self.mangle_dupe_cols and not self.has_mi_columns:
@@ -631,6 +637,19 @@ cdef class TextReader:
631637
this_header.append(name)
632638
counts[name] = count + 1
633639

640+
if self.has_mi_columns:
641+
642+
# if we have grabbed an extra line, but its not in our format
643+
# so save in the buffer, and create an blank extra line for the rest of the
644+
# parsing code
645+
if hr == self.header[-1]:
646+
lc = len(this_header)
647+
ic = len(self.index_col) if self.index_col is not None else 0
648+
if lc != unnamed_count and lc-ic > unnamed_count:
649+
hr -= 1
650+
self.parser_start -= 1
651+
this_header = [ None ] * lc
652+
634653
data_line = hr + 1
635654
header.append(this_header)
636655

0 commit comments

Comments
 (0)