|
19 | 19 | IO Tools (Text, CSV, HDF5, ...)
|
20 | 20 | *******************************
|
21 | 21 |
|
22 |
| -Text files |
23 |
| ----------- |
| 22 | +CSV & Text files |
| 23 | +---------------- |
24 | 24 |
|
25 |
| -The two workhorse functions for reading text (a.k.a. flat) files are |
26 |
| -``read_csv`` and ``read_table``. They both utilize the same parsing code for |
27 |
| -intelligently converting tabular data into a DataFrame object. They take a |
28 |
| -number of different arguments: |
| 25 | +The two workhorse functions for reading text files (a.k.a. flat files) are |
| 26 | +:func:`~pandas.io.parsers.read_csv` and :func:`~pandas.io.parsers.read_table`. |
| 27 | +They both utilize the same parsing code for intelligently converting tabular |
| 28 | +data into a DataFrame object. They take a number of different arguments: |
29 | 29 |
|
30 |
| - - ``path_or_buffer``: Either a string path to a file or any object (such as |
31 |
| - an open ``file`` or ``StringIO``) with a ``read`` method. |
| 30 | + - ``path_or_buffer``: Either a string path to a file or any object with a |
| 31 | + ``read`` method (such as an open file or ``StringIO``). |
32 | 32 | - ``delimiter``: For ``read_table`` only, a regular expression to split
|
33 | 33 | fields on. ``read_csv`` uses the ``csv`` module to do this and hence only
|
34 |
| - supports comma-separated values |
35 |
| - - ``skiprows``: Rows in the file to skip |
36 |
| - - ``header``: row number to use as the columns, defaults to 0 (first row) |
37 |
| - - ``index_col``: integer, defaulting to 0 (the first column), instructing the |
38 |
| - parser to use a particular column as the ``index`` (row labels) of the |
39 |
| - resulting DataFrame |
40 |
| - - ``na_values``: optional list of strings to recognize as NA/NaN |
| 34 | + supports comma-separated values. |
| 35 | + - ``header``: row number to use as the column names, and the start of the data. |
| 36 | + Defaults to 0 (first row); specify None if there is no header row. |
| 37 | + - ``names``: List of column names to use if header is None. |
| 38 | + - ``skiprows``: A collection of numbers for rows in the file to skip. |
| 39 | + - ``index_col``: column number, or list of column numbers, to use as the |
| 40 | + ``index`` (row labels) of the resulting DataFrame. By default, it will number |
| 41 | + the rows without using any column, unless there is one more data column than |
| 42 | + there are headers, in which case the first column is taken as the index. |
| 43 | + - ``parse_dates``: If True, attempt to parse the index column as dates. False |
| 44 | + by default. |
41 | 45 | - ``date_parser``: function to use to parse strings into datetime
|
42 |
| - objects. Defaults to the very robust ``dateutil.parser`` |
43 |
| - - ``names``: optional list of column names for the data. Otherwise will be |
44 |
| - read from the file |
| 46 | + objects. If ``parse_dates`` is True, it defaults to the very robust |
| 47 | + ``dateutil.parser``. Specifying this implicitly sets ``parse_dates`` as True. |
| 48 | + - ``na_values``: optional list of strings to recognize as NaN (missing values), |
| 49 | + in addition to a default set. |
| 50 | + |
45 | 51 |
|
46 | 52 | .. code-block:: ipython
|
47 | 53 |
|
48 |
| - In [2]: print open('foo.csv').read() |
49 |
| - A,B,C |
| 54 | + In [1]: print open('foo.csv').read() |
| 55 | + date,A,B,C |
50 | 56 | 20090101,a,1,2
|
51 | 57 | 20090102,b,3,4
|
52 | 58 | 20090103,c,4,5
|
| 59 | + |
| 60 | + # A basic index is created by default: |
| 61 | + In [3]: read_csv('foo.csv') |
| 62 | + Out[3]: |
| 63 | + date A B C |
| 64 | + 0 20090101 a 1 2 |
| 65 | + 1 20090102 b 3 4 |
| 66 | + 2 20090103 c 4 5 |
53 | 67 |
|
54 |
| - In [3]: df = read_csv('foo.csv') |
55 |
| -
|
| 68 | + # Use a column as an index, and parse it as dates. |
| 69 | + In [3]: df = read_csv('foo.csv', index_col=0, parse_dates=True) |
| 70 | + |
56 | 71 | In [4]: df
|
57 | 72 | Out[4]:
|
58 | 73 | A B C
|
59 | 74 | 2009-01-01 a 1 2
|
60 | 75 | 2009-01-02 b 3 4
|
61 | 76 | 2009-01-03 c 4 5
|
62 | 77 |
|
63 |
| - # dates parsed to datetime |
| 78 | + # These are python datetime objects |
64 | 79 | In [16]: df.index
|
65 | 80 | Out[16]: Index([2009-01-01 00:00:00, 2009-01-02 00:00:00,
|
66 | 81 | 2009-01-03 00:00:00], dtype=object)
|
67 | 82 |
|
68 |
| -If ``index_col=None``, the index will be a generic ``0...nrows-1``: |
69 |
| - |
70 |
| -.. code-block:: ipython |
71 |
| -
|
72 |
| - In [1]: print open('foo.csv').read() |
73 |
| - index,A,B,C |
74 |
| - 20090101,a,1,2 |
75 |
| - 20090102,b,3,4 |
76 |
| - 20090103,c,4,5 |
77 |
| -
|
78 |
| - In [2]: read_csv('foo.csv') |
79 |
| - Out[2]: |
80 |
| - A B C |
81 |
| - 2009-01-01 a 1 2 |
82 |
| - 2009-01-02 b 3 4 |
83 |
| - 2009-01-03 c 4 5 |
84 |
| -
|
85 |
| -
|
86 |
| - In [3]: read_csv('foo.csv', index_col=None) |
87 |
| - Out[3]: |
88 |
| - index A B C |
89 |
| - 0 20090101 a 1 2 |
90 |
| - 1 20090102 b 3 4 |
91 |
| - 2 20090103 c 4 5 |
92 |
| -
|
93 | 83 |
|
94 | 84 | The parsers make every attempt to "do the right thing" and not be very
|
95 | 85 | fragile. Type inference is a pretty big deal. So if a column can be coerced to
|
|
0 commit comments