Skip to content

Commit ee46b06

Browse files
takluyverwesm
authored andcommitted
Update docs on reading CSV files.
1 parent 4ea44a4 commit ee46b06

File tree

1 file changed

+38
-48
lines changed

1 file changed

+38
-48
lines changed

doc/source/io.rst

+38-48
Original file line numberDiff line numberDiff line change
@@ -19,77 +19,67 @@
1919
IO Tools (Text, CSV, HDF5, ...)
2020
*******************************
2121

22-
Text files
23-
----------
22+
CSV & Text files
23+
----------------
2424

25-
The two workhorse functions for reading text (a.k.a. flat) files are
26-
``read_csv`` and ``read_table``. They both utilize the same parsing code for
27-
intelligently converting tabular data into a DataFrame object. They take a
28-
number of different arguments:
25+
The two workhorse functions for reading text files (a.k.a. flat files) are
26+
:func:`~pandas.io.parsers.read_csv` and :func:`~pandas.io.parsers.read_table`.
27+
They both utilize the same parsing code for intelligently converting tabular
28+
data into a DataFrame object. They take a number of different arguments:
2929

30-
- ``path_or_buffer``: Either a string path to a file or any object (such as
31-
an open ``file`` or ``StringIO``) with a ``read`` method.
30+
- ``path_or_buffer``: Either a string path to a file or any object with a
31+
``read`` method (such as an open file or ``StringIO``).
3232
- ``delimiter``: For ``read_table`` only, a regular expression to split
3333
fields on. ``read_csv`` uses the ``csv`` module to do this and hence only
34-
supports comma-separated values
35-
- ``skiprows``: Rows in the file to skip
36-
- ``header``: row number to use as the columns, defaults to 0 (first row)
37-
- ``index_col``: integer, defaulting to 0 (the first column), instructing the
38-
parser to use a particular column as the ``index`` (row labels) of the
39-
resulting DataFrame
40-
- ``na_values``: optional list of strings to recognize as NA/NaN
34+
supports comma-separated values.
35+
- ``header``: row number to use as the column names, and the start of the data.
36+
Defaults to 0 (first row); specify None if there is no header row.
37+
- ``names``: List of column names to use if header is None.
38+
- ``skiprows``: A collection of numbers for rows in the file to skip.
39+
- ``index_col``: column number, or list of column numbers, to use as the
40+
``index`` (row labels) of the resulting DataFrame. By default, it will number
41+
the rows without using any column, unless there is one more data column than
42+
there are headers, in which case the first column is taken as the index.
43+
- ``parse_dates``: If True, attempt to parse the index column as dates. False
44+
by default.
4145
- ``date_parser``: function to use to parse strings into datetime
42-
objects. Defaults to the very robust ``dateutil.parser``
43-
- ``names``: optional list of column names for the data. Otherwise will be
44-
read from the file
46+
objects. If ``parse_dates`` is True, it defaults to the very robust
47+
``dateutil.parser``. Specifying this implicitly sets ``parse_dates`` as True.
48+
- ``na_values``: optional list of strings to recognize as NaN (missing values),
49+
in addition to a default set.
50+
4551

4652
.. code-block:: ipython
4753
48-
In [2]: print open('foo.csv').read()
49-
A,B,C
54+
In [1]: print open('foo.csv').read()
55+
date,A,B,C
5056
20090101,a,1,2
5157
20090102,b,3,4
5258
20090103,c,4,5
59+
60+
# A basic index is created by default:
61+
In [3]: read_csv('foo.csv')
62+
Out[3]:
63+
date A B C
64+
0 20090101 a 1 2
65+
1 20090102 b 3 4
66+
2 20090103 c 4 5
5367
54-
In [3]: df = read_csv('foo.csv')
55-
68+
# Use a column as an index, and parse it as dates.
69+
In [3]: df = read_csv('foo.csv', index_col=0, parse_dates=True)
70+
5671
In [4]: df
5772
Out[4]:
5873
A B C
5974
2009-01-01 a 1 2
6075
2009-01-02 b 3 4
6176
2009-01-03 c 4 5
6277
63-
# dates parsed to datetime
78+
# These are python datetime objects
6479
In [16]: df.index
6580
Out[16]: Index([2009-01-01 00:00:00, 2009-01-02 00:00:00,
6681
2009-01-03 00:00:00], dtype=object)
6782
68-
If ``index_col=None``, the index will be a generic ``0...nrows-1``:
69-
70-
.. code-block:: ipython
71-
72-
In [1]: print open('foo.csv').read()
73-
index,A,B,C
74-
20090101,a,1,2
75-
20090102,b,3,4
76-
20090103,c,4,5
77-
78-
In [2]: read_csv('foo.csv')
79-
Out[2]:
80-
A B C
81-
2009-01-01 a 1 2
82-
2009-01-02 b 3 4
83-
2009-01-03 c 4 5
84-
85-
86-
In [3]: read_csv('foo.csv', index_col=None)
87-
Out[3]:
88-
index A B C
89-
0 20090101 a 1 2
90-
1 20090102 b 3 4
91-
2 20090103 c 4 5
92-
9383
9484
The parsers make every attempt to "do the right thing" and not be very
9585
fragile. Type inference is a pretty big deal. So if a column can be coerced to

0 commit comments

Comments
 (0)