Skip to content

Commit 860b05d

Browse files
committed
Merge pull request #3575 from jreback/mi_csv
ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg
2 parents 8eaf19a + faf4d53 commit 860b05d

File tree

12 files changed

+609
-151
lines changed

12 files changed

+609
-151
lines changed

RELEASE.rst

+15
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,15 @@ pandas 0.11.1
3434
courtesy of @cpcloud. (GH3477_)
3535
- Support for reading Amazon S3 files. (GH3504_)
3636
- Added module for reading and writing Stata files: pandas.io.stata (GH1512_)
37+
- Added support for writing in ``to_csv`` and reading in ``read_csv``,
38+
multi-index columns. The ``header`` option in ``read_csv`` now accepts a
39+
list of the rows from which to read the index. Added the option,
40+
``tupleize_cols`` to provide compatiblity for the pre 0.11.1 behavior of
41+
writing and reading multi-index columns via a list of tuples. The default in
42+
0.11.1 is to write lists of tuples and *not* interpret list of tuples as a
43+
multi-index column.
44+
Note: The default value will change in 0.12 to make the default *to* write and
45+
read multi-index columns in the new format. (GH3571_, GH1651_, GH3141_)
3746

3847
**Improvements to existing features**
3948

@@ -180,13 +189,19 @@ pandas 0.11.1
180189
.. _GH3596: https://github.com/pydata/pandas/issues/3596
181190
.. _GH3617: https://github.com/pydata/pandas/issues/3617
182191
.. _GH3435: https://github.com/pydata/pandas/issues/3435
192+
<<<<<<< HEAD
183193
.. _GH3611: https://github.com/pydata/pandas/issues/3611
184194
.. _GH3062: https://github.com/pydata/pandas/issues/3062
185195
.. _GH3624: https://github.com/pydata/pandas/issues/3624
186196
.. _GH3626: https://github.com/pydata/pandas/issues/3626
187197
.. _GH3601: https://github.com/pydata/pandas/issues/3601
188198
.. _GH3631: https://github.com/pydata/pandas/issues/3631
189199
.. _GH1512: https://github.com/pydata/pandas/issues/1512
200+
=======
201+
.. _GH3571: https://github.com/pydata/pandas/issues/3571
202+
.. _GH1651: https://github.com/pydata/pandas/issues/1651
203+
.. _GH3141: https://github.com/pydata/pandas/issues/3141
204+
>>>>>>> DOC: updated releasenotes, v0.11.1 whatsnew, io.rst
190205

191206

192207
pandas 0.11.0

doc/source/io.rst

+43-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,10 @@ They can take a number of arguments:
5757
specified, data types will be inferred.
5858
- ``header``: row number to use as the column names, and the start of the
5959
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
60-
pass ``header=0`` to be able to replace existing names.
60+
pass ``header=0`` to be able to replace existing names. The header can be
61+
a list of integers that specify row locations for a multi-index on the columns
62+
E.g. [0,1,3]. Interveaning rows that are not specified will be skipped.
63+
(E.g. 2 in this example are skipped)
6164
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
6265
also be an integer to skip the first ``n`` rows
6366
- ``index_col``: column number, column name, or list of column numbers/names,
@@ -112,6 +115,10 @@ They can take a number of arguments:
112115
- ``error_bad_lines``: if False then any lines causing an error will be skipped :ref:`bad lines <io.bad_lines>`
113116
- ``usecols``: a subset of columns to return, results in much faster parsing
114117
time and lower memory usage.
118+
- ``mangle_dupe_cols``: boolean, default True, then duplicate columns will be specified
119+
as 'X.0'...'X.N', rather than 'X'...'X'
120+
- ``tupleize_cols``: boolean, default True, if False, convert a list of tuples
121+
to a multi-index of columns, otherwise, leave the column index as a list of tuples
115122

116123
.. ipython:: python
117124
:suppress:
@@ -762,6 +769,36 @@ column numbers to turn multiple columns into a ``MultiIndex``:
762769
df
763770
df.ix[1978]
764771
772+
.. _io.multi_index_columns:
773+
774+
Specifying a multi-index columns
775+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
776+
777+
By specifying list of row locations for the ``header`` argument, you
778+
can read in a multi-index for the columns. Specifying non-consecutive
779+
rows will skip the interveaing rows.
780+
781+
.. ipython:: python
782+
783+
from pandas.util.testing import makeCustomDataframe as mkdf
784+
df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)
785+
df.to_csv('mi.csv',tupleize_cols=False)
786+
print open('mi.csv').read()
787+
pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)
788+
789+
Note: The default behavior in 0.11.1 remains unchanged (``tupleize_cols=True``),
790+
but starting with 0.12, the default *to* write and read multi-index columns will be in the new
791+
format (``tupleize_cols=False``)
792+
793+
Note: If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it
794+
with ``df.to_csv(..., index=False``), then any ``names`` on the columns index will be *lost*.
795+
796+
.. ipython:: python
797+
:suppress:
798+
799+
import os
800+
os.remove('mi.csv')
801+
765802
.. _io.sniff:
766803

767804
Automatically "sniffing" the delimiter
@@ -845,6 +882,8 @@ function takes a number of arguments. Only the first is required.
845882
- ``sep`` : Field delimiter for the output file (default ",")
846883
- ``encoding``: a string representing the encoding to use if the contents are
847884
non-ascii, for python versions prior to 3
885+
- ``tupleize_cols``: boolean, default True, if False, write as a list of tuples,
886+
otherwise write in an expanded line format suitable for ``read_csv``
848887

849888
Writing a formatted string
850889
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -876,6 +915,9 @@ The Series object also has a ``to_string`` method, but with only the ``buf``,
876915
which, if set to ``True``, will additionally output the length of the Series.
877916

878917

918+
HTML
919+
----
920+
879921
Reading HTML format
880922
~~~~~~~~~~~~~~~~~~~~~~
881923

doc/source/v0.11.1.txt

+37
Original file line numberDiff line numberDiff line change
@@ -73,13 +73,47 @@ Enhancements
7373
an index with a different frequency than the existing, or attempting
7474
to append an index with a different name than the existing
7575
- support datelike columns with a timezone as data_columns (GH2852_)
76+
7677
- ``fillna`` methods now raise a ``TypeError`` if the ``value`` parameter is
7778
a list or tuple.
7879
- Added module for reading and writing Stata files: pandas.io.stata (GH1512_)
7980
- ``DataFrame.replace()`` now allows regular expressions on contained
8081
``Series`` with object dtype. See the examples section in the regular docs
8182
:ref:`Replacing via String Expression <missing_data.replace_expression>`
8283

84+
- Multi-index column support for reading and writing csvs
85+
86+
- The ``header`` option in ``read_csv`` now accepts a
87+
list of the rows from which to read the index.
88+
89+
- The option, ``tupleize_cols`` can now be specified in both ``to_csv`` and
90+
``read_csv``, to provide compatiblity for the pre 0.11.1 behavior of
91+
writing and reading multi-index columns via a list of tuples. The default in
92+
0.11.1 is to write lists of tuples and *not* interpret list of tuples as a
93+
multi-index column.
94+
95+
Note: The default behavior in 0.11.1 remains unchanged, but starting with 0.12,
96+
the default *to* write and read multi-index columns will be in the new
97+
format. (GH3571_, GH1651_, GH3141_)
98+
99+
- If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it
100+
with ``df.to_csv(..., index=False``), then any ``names`` on the columns index will
101+
be *lost*.
102+
103+
.. ipython:: python
104+
105+
from pandas.util.testing import makeCustomDataframe as mkdf
106+
df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)
107+
df.to_csv('mi.csv',tupleize_cols=False)
108+
print open('mi.csv').read()
109+
pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)
110+
111+
.. ipython:: python
112+
:suppress:
113+
114+
import os
115+
os.remove('mi.csv')
116+
83117
See the `full release notes
84118
<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
85119
on GitHub for a complete list.
@@ -96,3 +130,6 @@ on GitHub for a complete list.
96130
.. _GH1512: https://github.com/pydata/pandas/issues/1512
97131
.. _GH2285: https://github.com/pydata/pandas/issues/2285
98132
.. _GH3631: https://github.com/pydata/pandas/issues/3631
133+
.. _GH3571: https://github.com/pydata/pandas/issues/3571
134+
.. _GH1651: https://github.com/pydata/pandas/issues/1651
135+
.. _GH3141: https://github.com/pydata/pandas/issues/3141

pandas/core/format.py

+81-37
Original file line numberDiff line numberDiff line change
@@ -772,9 +772,10 @@ def grouper(x):
772772
class CSVFormatter(object):
773773

774774
def __init__(self, obj, path_or_buf, sep=",", na_rep='', float_format=None,
775-
cols=None, header=True, index=True, index_label=None,
776-
mode='w', nanRep=None, encoding=None, quoting=None,
777-
line_terminator='\n', chunksize=None, engine=None):
775+
cols=None, header=True, index=True, index_label=None,
776+
mode='w', nanRep=None, encoding=None, quoting=None,
777+
line_terminator='\n', chunksize=None, engine=None,
778+
tupleize_cols=True):
778779

779780
self.engine = engine # remove for 0.12
780781

@@ -803,6 +804,15 @@ def __init__(self, obj, path_or_buf, sep=",", na_rep='', float_format=None,
803804
msg= "columns.is_unique == False not supported with engine='python'"
804805
raise NotImplementedError(msg)
805806

807+
self.tupleize_cols = tupleize_cols
808+
self.has_mi_columns = isinstance(obj.columns, MultiIndex
809+
) and not self.tupleize_cols
810+
811+
# validate mi options
812+
if self.has_mi_columns:
813+
if cols is not None:
814+
raise Exception("cannot specify cols with a multi_index on the columns")
815+
806816
if cols is not None:
807817
if isinstance(cols,Index):
808818
cols = cols.to_native_types(na_rep=na_rep,float_format=float_format)
@@ -958,48 +968,82 @@ def _save_header(self):
958968
obj = self.obj
959969
index_label = self.index_label
960970
cols = self.cols
971+
has_mi_columns = self.has_mi_columns
961972
header = self.header
973+
encoded_labels = []
962974

963975
has_aliases = isinstance(header, (tuple, list, np.ndarray))
964-
if has_aliases or self.header:
965-
if self.index:
966-
# should write something for index label
967-
if index_label is not False:
968-
if index_label is None:
969-
if isinstance(obj.index, MultiIndex):
970-
index_label = []
971-
for i, name in enumerate(obj.index.names):
972-
if name is None:
973-
name = ''
974-
index_label.append(name)
976+
if not (has_aliases or self.header):
977+
return
978+
979+
if self.index:
980+
# should write something for index label
981+
if index_label is not False:
982+
if index_label is None:
983+
if isinstance(obj.index, MultiIndex):
984+
index_label = []
985+
for i, name in enumerate(obj.index.names):
986+
if name is None:
987+
name = ''
988+
index_label.append(name)
989+
else:
990+
index_label = obj.index.name
991+
if index_label is None:
992+
index_label = ['']
975993
else:
976-
index_label = obj.index.name
977-
if index_label is None:
978-
index_label = ['']
979-
else:
980-
index_label = [index_label]
981-
elif not isinstance(index_label, (list, tuple, np.ndarray)):
982-
# given a string for a DF with Index
983-
index_label = [index_label]
994+
index_label = [index_label]
995+
elif not isinstance(index_label, (list, tuple, np.ndarray)):
996+
# given a string for a DF with Index
997+
index_label = [index_label]
984998

985-
encoded_labels = list(index_label)
986-
else:
987-
encoded_labels = []
999+
encoded_labels = list(index_label)
1000+
else:
1001+
encoded_labels = []
9881002

989-
if has_aliases:
990-
if len(header) != len(cols):
991-
raise ValueError(('Writing %d cols but got %d aliases'
992-
% (len(cols), len(header))))
993-
else:
994-
write_cols = header
1003+
if has_aliases:
1004+
if len(header) != len(cols):
1005+
raise ValueError(('Writing %d cols but got %d aliases'
1006+
% (len(cols), len(header))))
9951007
else:
996-
write_cols = cols
997-
encoded_cols = list(write_cols)
998-
999-
writer.writerow(encoded_labels + encoded_cols)
1008+
write_cols = header
10001009
else:
1001-
encoded_cols = list(cols)
1002-
writer.writerow(encoded_cols)
1010+
write_cols = cols
1011+
1012+
if not has_mi_columns:
1013+
encoded_labels += list(write_cols)
1014+
1015+
else:
1016+
1017+
if not has_mi_columns:
1018+
encoded_labels += list(cols)
1019+
1020+
# write out the mi
1021+
if has_mi_columns:
1022+
columns = obj.columns
1023+
1024+
# write out the names for each level, then ALL of the values for each level
1025+
for i in range(columns.nlevels):
1026+
1027+
# we need at least 1 index column to write our col names
1028+
col_line = []
1029+
if self.index:
1030+
1031+
# name is the first column
1032+
col_line.append( columns.names[i] )
1033+
1034+
if isinstance(index_label,list) and len(index_label)>1:
1035+
col_line.extend([ '' ] * (len(index_label)-1))
1036+
1037+
col_line.extend(columns.get_level_values(i))
1038+
1039+
writer.writerow(col_line)
1040+
1041+
# add blanks for the columns, so that we
1042+
# have consistent seps
1043+
encoded_labels.extend([ '' ] * len(columns))
1044+
1045+
# write out the index label line
1046+
writer.writerow(encoded_labels)
10031047

10041048
def _save(self):
10051049

pandas/core/frame.py

+12-4
Original file line numberDiff line numberDiff line change
@@ -1250,7 +1250,7 @@ def _from_arrays(cls, arrays, columns, index, dtype=None):
12501250

12511251
@classmethod
12521252
def from_csv(cls, path, header=0, sep=',', index_col=0,
1253-
parse_dates=True, encoding=None):
1253+
parse_dates=True, encoding=None, tupleize_cols=False):
12541254
"""
12551255
Read delimited file into DataFrame
12561256
@@ -1266,6 +1266,9 @@ def from_csv(cls, path, header=0, sep=',', index_col=0,
12661266
is used. Different default from read_table
12671267
parse_dates : boolean, default True
12681268
Parse dates. Different default from read_table
1269+
tupleize_cols : boolean, default True
1270+
write multi_index columns as a list of tuples (if True)
1271+
or new (expanded format) if False)
12691272
12701273
Notes
12711274
-----
@@ -1280,7 +1283,7 @@ def from_csv(cls, path, header=0, sep=',', index_col=0,
12801283
from pandas.io.parsers import read_table
12811284
return read_table(path, header=header, sep=sep,
12821285
parse_dates=parse_dates, index_col=index_col,
1283-
encoding=encoding)
1286+
encoding=encoding,tupleize_cols=False)
12841287

12851288
@classmethod
12861289
def from_dta(dta, path, parse_dates=True, convert_categoricals=True, encoding=None, index_col=None):
@@ -1391,7 +1394,8 @@ def to_panel(self):
13911394
def to_csv(self, path_or_buf, sep=",", na_rep='', float_format=None,
13921395
cols=None, header=True, index=True, index_label=None,
13931396
mode='w', nanRep=None, encoding=None, quoting=None,
1394-
line_terminator='\n', chunksize=None,**kwds):
1397+
line_terminator='\n', chunksize=None,
1398+
tupleize_cols=True, **kwds):
13951399
"""
13961400
Write DataFrame to a comma-separated values (csv) file
13971401
@@ -1429,6 +1433,9 @@ def to_csv(self, path_or_buf, sep=",", na_rep='', float_format=None,
14291433
quoting : optional constant from csv module
14301434
defaults to csv.QUOTE_MINIMAL
14311435
chunksize : rows to write at a time
1436+
tupleize_cols : boolean, default True
1437+
write multi_index columns as a list of tuples (if True)
1438+
or new (expanded format) if False)
14321439
"""
14331440
if nanRep is not None: # pragma: no cover
14341441
import warnings
@@ -1445,7 +1452,8 @@ def to_csv(self, path_or_buf, sep=",", na_rep='', float_format=None,
14451452
float_format=float_format, cols=cols,
14461453
header=header, index=index,
14471454
index_label=index_label,mode=mode,
1448-
chunksize=chunksize,engine=kwds.get("engine") )
1455+
chunksize=chunksize,engine=kwds.get("engine"),
1456+
tupleize_cols=tupleize_cols)
14491457
formatter.save()
14501458

14511459
def to_excel(self, excel_writer, sheet_name='sheet1', na_rep='',

0 commit comments

Comments
 (0)