@@ -68,35 +68,53 @@ data into a DataFrame object. They can take a number of arguments:
68
68
whitespace.
69
69
- ``header ``: row number to use as the column names, and the start of the data.
70
70
Defaults to 0 (first row); specify None if there is no header row.
71
- - ``names ``: List of column names to use. If passed, header will be
72
- implicitly set to None.
73
71
- ``skiprows ``: A collection of numbers for rows in the file to skip. Can
74
72
also be an integer to skip the first ``n `` rows
75
- - ``index_col ``: column number, or list of column numbers, to use as the
76
- ``index `` (row labels) of the resulting DataFrame. By default, it will number
77
- the rows without using any column, unless there is one more data column than
78
- there are headers, in which case the first column is taken as the index.
79
- - ``parse_dates ``: If True, attempt to parse the index column as dates. False
80
- by default.
73
+ - ``index_col ``: column number, column name, or list of column numbers/names,
74
+ to use as the ``index `` (row labels) of the resulting DataFrame. By default,
75
+ it will number the rows without using any column, unless there is one more
76
+ data column than there are headers, in which case the first column is taken
77
+ as the index.
78
+ - ``names ``: List of column names to use. If passed, header will be
79
+ implicitly set to None.
80
+ - ``na_values ``: optional list of strings to recognize as NaN (missing values),
81
+ in addition to a default set.
82
+ - ``parse_dates ``: if True then index will be parsed as dates
83
+ (False by default). You can specify more complicated options to parse
84
+ a subset of columns or a combination of columns into a single date column
85
+ (list of ints or names, list of lists, or dict)
86
+ [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column
87
+ [[1, 3]] -> combine columns 1 and 3 and parse as a single date column
88
+ {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
89
+ - ``keep_date_col ``: if True, then date component columns passed into
90
+ ``parse_dates `` will be retained in the output (False by default).
81
91
- ``date_parser ``: function to use to parse strings into datetime
82
92
objects. If ``parse_dates `` is True, it defaults to the very robust
83
93
``dateutil.parser ``. Specifying this implicitly sets ``parse_dates `` as True.
84
- - ``na_values ``: optional list of strings to recognize as NaN (missing values),
85
- in addition to a default set.
94
+ You can also use functions from community supported date converters from
95
+ date_converters.py
96
+ - ``dayfirst ``: if True then uses the DD/MM international/European date format
97
+ (This is False by default)
98
+ - ``thousands ``: sepcifies the thousands separator. If not None, then parser
99
+ will try to look for it in the output and parse relevant data to integers.
100
+ Because it has to essentially scan through the data again, this causes a
101
+ significant performance hit so only use if necessary.
102
+ - ``comment ``: denotes the start of a comment and ignores the rest of the line.
103
+ Currently line commenting is not supported.
86
104
- ``nrows ``: Number of rows to read out of the file. Useful to only read a
87
105
small portion of a large file
106
+ - ``iterator ``: If True, return a ``TextParser `` to enable reading a file
107
+ into memory piece by piece
88
108
- ``chunksize ``: An number of rows to be used to "chunk" a file into
89
109
pieces. Will cause an ``TextParser `` object to be returned. More on this
90
110
below in the section on :ref: `iterating and chunking <io.chunking >`
91
- - ``iterator ``: If True, return a ``TextParser `` to enable reading a file
92
- into memory piece by piece
93
111
- ``skip_footer ``: number of lines to skip at bottom of file (default 0)
94
112
- ``converters ``: a dictionary of functions for converting values in certain
95
113
columns, where keys are either integers or column labels
96
114
- ``encoding ``: a string representing the encoding to use if the contents are
97
115
non-ascii
98
- - ``verbose `` : show number of NA values inserted in non-numeric columns
99
-
116
+ - ``verbose ``: show number of NA values inserted in non-numeric columns
117
+ - `` squeeze ``: if True then output with only one column is turned into Series
100
118
101
119
.. ipython :: python
102
120
:suppress:
@@ -117,8 +135,22 @@ The default for `read_csv` is to create a DataFrame with simple numbered rows:
117
135
118
136
read_csv(' foo.csv' )
119
137
120
- In the case of indexed data, you can pass the column number (or a list of
121
- column numbers, for a hierarchical index) you wish to use as the index.
138
+ In the case of indexed data, you can pass the column number or column name you
139
+ wish to use as the index:
140
+
141
+ .. ipython :: python
142
+
143
+ read_csv(' foo.csv' , index_col = 0 )
144
+
145
+ .. ipython :: python
146
+
147
+ read_csv(' foo.csv' , index_col = ' date' )
148
+
149
+ You can also use a list of columns to create a hierarchical index:
150
+
151
+ .. ipython :: python
152
+
153
+ read_csv(' foo.csv' , index_col = [0 , ' A' ])
122
154
123
155
The parsers make every attempt to "do the right thing" and not be very
124
156
fragile. Type inference is a pretty big deal. So if a column can be coerced to
@@ -127,6 +159,9 @@ columns will come through as object dtype as with the rest of pandas objects.
127
159
128
160
.. _io.parse_dates :
129
161
162
+ Specifying Date Columns
163
+ ~~~~~~~~~~~~~~~~~~~~~~~
164
+
130
165
To better facilitate working with datetime data, :func: `~pandas.io.parsers.read_csv ` and :func: `~pandas.io.parsers.read_table `
131
166
uses the keyword arguments ``parse_dates `` and ``date_parser `` to allow users
132
167
to specify a variety of columns and date/time formats to turn the input text
@@ -139,6 +174,7 @@ The simplest case is to just pass in ``parse_dates=True``:
139
174
# Use a column as an index, and parse it as dates.
140
175
df = read_csv(' foo.csv' , index_col = 0 , parse_dates = True )
141
176
df
177
+
142
178
# These are python datetime objects
143
179
df.index
144
180
@@ -184,6 +220,12 @@ to retain them via the ``keep_date_col`` keyword:
184
220
keep_date_col = True )
185
221
df
186
222
223
+ Note that if you wish to combine multiple columns into a single date column, a
224
+ nested list must be used. In other words, ``parse_dates=[1, 2] `` indicates that
225
+ the second and third columns should each be parsed as separate date columns
226
+ while ``parse_dates=[[1, 2]] `` means the two columns should be parsed into a
227
+ single column.
228
+
187
229
You can also use a dict to specify custom name columns:
188
230
189
231
.. ipython :: python
@@ -192,6 +234,8 @@ You can also use a dict to specify custom name columns:
192
234
df = read_csv(' tmp.csv' , header = None , parse_dates = date_spec)
193
235
df
194
236
237
+ Date Parsing Functions
238
+ ~~~~~~~~~~~~~~~~~~~~~~
195
239
Finally, the parser allows you can specify a custom ``date_parser `` function to
196
240
take full advantage of the flexiblity of the date parsing API:
197
241
@@ -204,7 +248,124 @@ take full advantage of the flexiblity of the date parsing API:
204
248
205
249
You can explore the date parsing functionality in ``date_converters.py `` and
206
250
add your own. We would love to turn this module into a community supported set
207
- of date/time parsers.
251
+ of date/time parsers. To get you started, ``date_converters.py `` contains
252
+ functions to parse dual date and time columns, year/month/day columns,
253
+ and year/month/day/hour/minute/second columns. It also contains a
254
+ ``generic_parser `` function so you can curry it with a function that deals with
255
+ a single date rather than the entire array.
256
+
257
+ .. ipython :: python
258
+ :suppress:
259
+
260
+ os.remove(' tmp.csv' )
261
+
262
+ .. _io.convenience :
263
+
264
+ Thousand Separators
265
+ ~~~~~~~~~~~~~~~~~~~
266
+ For large integers that have been written with a thousands separator, you can
267
+ set the ``thousands `` keyword to ``True `` so that integers will be parsed
268
+ correctly:
269
+
270
+ .. ipython :: python
271
+ :suppress:
272
+
273
+ data = (" ID|level|category\n "
274
+ " Patient1|123,000|x\n "
275
+ " Patient2|23,000|y\n "
276
+ " Patient3|1,234,018|z" )
277
+
278
+ with open (' tmp.csv' , ' w' ) as fh:
279
+ fh.write(data)
280
+
281
+ By default, integers with a thousands separator will be parsed as strings
282
+
283
+ .. ipython :: python
284
+
285
+ print open (' tmp.csv' ).read()
286
+ df = read_csv(' tmp.csv' , sep = ' |' )
287
+ df
288
+
289
+ df.level.dtype
290
+
291
+ The ``thousands `` keyword allows integers to be parsed correctly
292
+
293
+ .. ipython :: python
294
+
295
+ print open (' tmp.csv' ).read()
296
+ df = read_csv(' tmp.csv' , sep = ' |' , thousands = ' ,' )
297
+ df
298
+
299
+ df.level.dtype
300
+
301
+ .. ipython :: python
302
+ :suppress:
303
+
304
+ os.remove(' tmp.csv' )
305
+
306
+ Comments
307
+ ~~~~~~~~
308
+ Sometimes comments or meta data may be included in a file:
309
+
310
+ .. ipython :: python
311
+ :suppress:
312
+
313
+ data = (" ID,level,category\n "
314
+ " Patient1,123000,x # really unpleasant\n "
315
+ " Patient2,23000,y # wouldn't take his medicine\n "
316
+ " Patient3,1234018,z # awesome" )
317
+
318
+ with open (' tmp.csv' , ' w' ) as fh:
319
+ fh.write(data)
320
+
321
+ .. ipython :: python
322
+
323
+ print open (' tmp.csv' ).read()
324
+
325
+ By default, the parse includes the comments in the output:
326
+
327
+ .. ipython :: python
328
+
329
+ df = read_csv(' tmp.csv' )
330
+ df
331
+
332
+ We can suppress the comments using the ``comment `` keyword:
333
+
334
+ .. ipython :: python
335
+
336
+ df = read_csv(' tmp.csv' , comment = ' #' )
337
+ df
338
+
339
+ .. ipython :: python
340
+ :suppress:
341
+
342
+ os.remove(' tmp.csv' )
343
+
344
+ Returning Series
345
+ ~~~~~~~~~~~~~~~~
346
+
347
+ Using the ``squeeze `` keyword, the parser will return output with a single column
348
+ as a ``Series ``:
349
+
350
+ .. ipython :: python
351
+ :suppress:
352
+
353
+ data = (" level\n "
354
+ " Patient1,123000\n "
355
+ " Patient2,23000\n "
356
+ " Patient3,1234018" )
357
+
358
+ with open (' tmp.csv' , ' w' ) as fh:
359
+ fh.write(data)
360
+
361
+ .. ipython :: python
362
+
363
+ print open (' tmp.csv' ).read()
364
+
365
+ output = read_csv(' tmp.csv' , squeeze = True )
366
+ output
367
+
368
+ type (output)
208
369
209
370
.. ipython :: python
210
371
:suppress:
0 commit comments