-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
/
Copy pathio.rst
5301 lines (3628 loc) · 168 KB
/
io.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
.. _io:
.. currentmodule:: pandas
.. ipython:: python
:suppress:
import os
import csv
from pandas.compat import StringIO, BytesIO
import pandas as pd
ExcelWriter = pd.ExcelWriter
import numpy as np
np.random.seed(123456)
randn = np.random.randn
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.close('all')
import pandas.util.testing as tm
pd.options.display.max_rows=15
clipdf = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':['p','q','r']},
index=['x','y','z'])
===============================
IO Tools (Text, CSV, HDF5, ...)
===============================
The pandas I/O API is a set of top level ``reader`` functions accessed like ``pd.read_csv()`` that generally return a ``pandas``
object.
* :ref:`read_csv<io.read_csv_table>`
* :ref:`read_excel<io.excel_reader>`
* :ref:`read_hdf<io.hdf5>`
* :ref:`read_feather<io.feather>`
* :ref:`read_sql<io.sql>`
* :ref:`read_json<io.json_reader>`
* :ref:`read_msgpack<io.msgpack>`
* :ref:`read_html<io.read_html>`
* :ref:`read_gbq<io.bigquery>`
* :ref:`read_stata<io.stata_reader>`
* :ref:`read_sas<io.sas_reader>`
* :ref:`read_clipboard<io.clipboard>`
* :ref:`read_pickle<io.pickle>`
The corresponding ``writer`` functions are object methods that are accessed like ``df.to_csv()``
* :ref:`to_csv<io.store_in_csv>`
* :ref:`to_excel<io.excel_writer>`
* :ref:`to_hdf<io.hdf5>`
* :ref:`to_feather<io.feather>`
* :ref:`to_sql<io.sql>`
* :ref:`to_json<io.json_writer>`
* :ref:`to_msgpack<io.msgpack>`
* :ref:`to_html<io.html>`
* :ref:`to_gbq<io.bigquery>`
* :ref:`to_stata<io.stata_writer>`
* :ref:`to_clipboard<io.clipboard>`
* :ref:`to_pickle<io.pickle>`
:ref:`Here <io.perf>` is an informal performance comparison for some of these IO methods.
.. note::
For examples that use the ``StringIO`` class, make sure you import it
according to your Python version, i.e. ``from StringIO import StringIO`` for
Python 2 and ``from io import StringIO`` for Python 3.
.. _io.read_csv_table:
CSV & Text files
----------------
The two workhorse functions for reading text files (a.k.a. flat files) are
:func:`read_csv` and :func:`read_table`. They both use the same parsing code to
intelligently convert tabular data into a DataFrame object. See the
:ref:`cookbook<cookbook.csv>` for some advanced strategies.
Parsing options
'''''''''''''''
:func:`read_csv` and :func:`read_table` accept the following arguments:
Basic
+++++
filepath_or_buffer : various
Either a path to a file (a :class:`python:str`, :class:`python:pathlib.Path`,
or :class:`py:py._path.local.LocalPath`), URL (including http, ftp, and S3
locations), or any object with a ``read()`` method (such as an open file or
:class:`~python:io.StringIO`).
sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_table`
Delimiter to use. If sep is ``None``,
will try to automatically determine this. Separators longer than 1 character
and different from ``'\s+'`` will be interpreted as regular expressions, will
force use of the python parsing engine and will ignore quotes in the data.
Regex example: ``'\\r\\t'``.
delimiter : str, default ``None``
Alternative argument name for sep.
delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``)
will be used as the delimiter. Equivalent to setting ``sep='\s+'``.
If this option is set to True, nothing should be passed in for the
``delimiter`` parameter.
.. versionadded:: 0.18.1 support for the Python parser.
Column and Index Locations and Names
++++++++++++++++++++++++++++++++++++
header : int or list of ints, default ``'infer'``
Row number(s) to use as the column names, and the start of the data. Default
behavior is as if ``header=0`` if no ``names`` passed, otherwise as if
``header=None``. Explicitly pass ``header=0`` to be able to replace existing
names. The header can be a list of ints that specify row locations for a
multi-index on the columns e.g. ``[0,1,3]``. Intervening rows that are not
specified will be skipped (e.g. 2 in this example is skipped). Note that
this parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so header=0 denotes the first line of data
rather than the first line of the file.
names : array-like, default ``None``
List of column names to use. If file contains no header row, then you should
explicitly pass ``header=None``. Duplicates in this list are not allowed unless
``mangle_dupe_cols=True``, which is the default.
index_col : int or sequence or ``False``, default ``None``
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end of
each line, you might consider ``index_col=False`` to force pandas to *not* use
the first column as the index (row names).
usecols : array-like or callable, default ``None``
Return a subset of the columns. If array-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid array-like
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz'].
If callable, the callable function will be evaluated against the column names,
returning names where the callable function evaluates to True:
.. ipython:: python
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3'])
Using this parameter results in much faster parsing time and lower memory usage.
as_recarray : boolean, default ``False``
DEPRECATED: this argument will be removed in a future version. Please call
``pd.read_csv(...).to_records()`` instead.
Return a NumPy recarray instead of a DataFrame after parsing the data. If
set to ``True``, this option takes precedence over the ``squeeze`` parameter.
In addition, as row indices are not available in such a format, the ``index_col``
parameter will be ignored.
squeeze : boolean, default ``False``
If the parsed data only contains one column then return a Series.
prefix : str, default ``None``
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
mangle_dupe_cols : boolean, default ``True``
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'.
Passing in False will cause data to be overwritten if there are duplicate
names in the columns.
General Parsing Configuration
+++++++++++++++++++++++++++++
dtype : Type name or dict of column -> type, default ``None``
Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32}``
(unsupported with ``engine='python'``). Use `str` or `object` to preserve and
not interpret dtype.
.. versionadded:: 0.20.0 support for the Python parser.
engine : {``'c'``, ``'python'``}
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
converters : dict, default ``None``
Dict of functions for converting values in certain columns. Keys can either be
integers or column labels.
true_values : list, default ``None``
Values to consider as ``True``.
false_values : list, default ``None``
Values to consider as ``False``.
skipinitialspace : boolean, default ``False``
Skip spaces after delimiter.
skiprows : list-like or integer, default ``None``
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start
of the file.
If callable, the callable function will be evaluated against the row
indices, returning True if the row should be skipped and False otherwise:
.. ipython:: python
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
skipfooter : int, default ``0``
Number of lines at bottom of file to skip (unsupported with engine='c').
skip_footer : int, default ``0``
DEPRECATED: use the ``skipfooter`` parameter instead, as they are identical
nrows : int, default ``None``
Number of rows of file to read. Useful for reading pieces of large files.
low_memory : boolean, default ``True``
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
types either set ``False``, or specify the type with the ``dtype`` parameter.
Note that the entire file is read into a single DataFrame regardless,
use the ``chunksize`` or ``iterator`` parameter to return the data in chunks.
(Only valid with C parser)
buffer_lines : int, default None
DEPRECATED: this argument will be removed in a future version because its
value is not respected by the parser
compact_ints : boolean, default False
DEPRECATED: this argument will be removed in a future version
If ``compact_ints`` is ``True``, then for any column that is of integer dtype, the
parser will attempt to cast it as the smallest integer ``dtype`` possible, either
signed or unsigned depending on the specification from the ``use_unsigned`` parameter.
use_unsigned : boolean, default False
DEPRECATED: this argument will be removed in a future version
If integer columns are being compacted (i.e. ``compact_ints=True``), specify whether
the column should be compacted to the smallest signed or unsigned integer dtype.
memory_map : boolean, default False
If a filepath is provided for ``filepath_or_buffer``, map the file object
directly onto memory and access the data directly from there. Using this
option can improve performance because there is no longer any I/O overhead.
NA and Missing Data Handling
++++++++++++++++++++++++++++
na_values : scalar, str, list-like, or dict, default ``None``
Additional strings to recognize as NA/NaN. If dict passed, specific per-column
NA values. By default the following values are interpreted as NaN:
``'-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA',
'#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''``.
keep_default_na : boolean, default ``True``
If na_values are specified and keep_default_na is ``False`` the default NaN
values are overridden, otherwise they're appended to.
na_filter : boolean, default ``True``
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing ``na_filter=False`` can improve the performance
of reading a large file.
verbose : boolean, default ``False``
Indicate number of NA values placed in non-numeric columns.
skip_blank_lines : boolean, default ``True``
If ``True``, skip over blank lines rather than interpreting as NaN values.
Datetime Handling
+++++++++++++++++
parse_dates : boolean or list of ints or names or list of lists or dict, default ``False``.
- If ``True`` -> try parsing the index.
- If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date
column.
- If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date
column.
- If ``{'foo' : [1, 3]}`` -> parse columns 1, 3 as date and call result 'foo'.
A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default ``False``
If ``True`` and parse_dates is enabled for a column, attempt to infer the
datetime format to speed up the processing.
keep_date_col : boolean, default ``False``
If ``True`` and parse_dates specifies combining multiple columns then keep the
original columns.
date_parser : function, default ``None``
Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses ``dateutil.parser.parser`` to do the
conversion. Pandas will try to call date_parser in three different ways,
advancing to the next if an exception occurs: 1) Pass one or more arrays (as
defined by parse_dates) as arguments; 2) concatenate (row-wise) the string
values from the columns defined by parse_dates into a single array and pass
that; and 3) call date_parser once for each row using one or more strings
(corresponding to the columns defined by parse_dates) as arguments.
dayfirst : boolean, default ``False``
DD/MM format dates, international and European format.
Iteration
+++++++++
iterator : boolean, default ``False``
Return `TextFileReader` object for iteration or getting chunks with
``get_chunk()``.
chunksize : int, default ``None``
Return `TextFileReader` object for iteration. See :ref:`iterating and chunking
<io.chunking>` below.
Quoting, Compression, and File Format
+++++++++++++++++++++++++++++++++++++
compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``}, default ``'infer'``
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
bz2, zip, or xz if filepath_or_buffer is a string ending in '.gz', '.bz2',
'.zip', or '.xz', respectively, and no decompression otherwise. If using 'zip',
the ZIP file must contain only one data file to be read in.
Set to ``None`` for no decompression.
.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
thousands : str, default ``None``
Thousands separator.
decimal : str, default ``'.'``
Character to recognize as decimal point. E.g. use ``','`` for European data.
float_precision : string, default None
Specifies which converter the C engine should use for floating-point values.
The options are ``None`` for the ordinary converter, ``high`` for the
high-precision converter, and ``round_trip`` for the round-trip converter.
lineterminator : str (length 1), default ``None``
Character to break file into lines. Only valid with C parser.
quotechar : str (length 1)
The character used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
quoting : int or ``csv.QUOTE_*`` instance, default ``0``
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
``QUOTE_MINIMAL`` (0), ``QUOTE_ALL`` (1), ``QUOTE_NONNUMERIC`` (2) or
``QUOTE_NONE`` (3).
doublequote : boolean, default ``True``
When ``quotechar`` is specified and ``quoting`` is not ``QUOTE_NONE``,
indicate whether or not to interpret two consecutive ``quotechar`` elements
**inside** a field as a single ``quotechar`` element.
escapechar : str (length 1), default ``None``
One-character string used to escape delimiter when quoting is ``QUOTE_NONE``.
comment : str, default ``None``
Indicates remainder of line should not be parsed. If found at the beginning of
a line, the line will be ignored altogether. This parameter must be a single
character. Like empty lines (as long as ``skip_blank_lines=True``), fully
commented lines are ignored by the parameter `header` but not by `skiprows`.
For example, if ``comment='#'``, parsing '#empty\\na,b,c\\n1,2,3' with
`header=0` will result in 'a,b,c' being treated as the header.
encoding : str, default ``None``
Encoding to use for UTF when reading/writing (e.g. ``'utf-8'``). `List of
Python standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_.
dialect : str or :class:`python:csv.Dialect` instance, default ``None``
If provided, this parameter will override values (default or not) for the
following parameters: `delimiter`, `doublequote`, `escapechar`,
`skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to
override values, a ParserWarning will be issued. See :class:`python:csv.Dialect`
documentation for more details.
tupleize_cols : boolean, default ``False``
Leave a list of tuples on columns as is (default is to convert to a MultiIndex
on the columns).
Error Handling
++++++++++++++
error_bad_lines : boolean, default ``True``
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned. If
``False``, then these "bad lines" will dropped from the DataFrame that is
returned (only valid with C parser). See :ref:`bad lines <io.bad_lines>`
below.
warn_bad_lines : boolean, default ``True``
If error_bad_lines is ``False``, and warn_bad_lines is ``True``, a warning for
each "bad line" will be output (only valid with C parser).
.. _io.dtypes:
Specifying column data types
''''''''''''''''''''''''''''
Starting with v0.10, you can indicate the data type for the whole DataFrame or
individual columns:
.. ipython:: python
data = 'a,b,c\n1,2,3\n4,5,6\n7,8,9'
print(data)
df = pd.read_csv(StringIO(data), dtype=object)
df
df['a'][0]
df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64})
df.dtypes
Fortunately, ``pandas`` offers more than one way to ensure that your column(s)
contain only one ``dtype``. If you're unfamiliar with these concepts, you can
see :ref:`here<basics.dtypes>` to learn more about dtypes, and
:ref:`here<basics.object_conversion>` to learn more about ``object`` conversion in
``pandas``.
For instance, you can use the ``converters`` argument
of :func:`~pandas.read_csv`:
.. ipython:: python
data = "col_1\n1\n2\n'A'\n4.22"
df = pd.read_csv(StringIO(data), converters={'col_1':str})
df
df['col_1'].apply(type).value_counts()
Or you can use the :func:`~pandas.to_numeric` function to coerce the
dtypes after reading in the data,
.. ipython:: python
df2 = pd.read_csv(StringIO(data))
df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce')
df2
df2['col_1'].apply(type).value_counts()
which would convert all valid parsing to floats, leaving the invalid parsing
as ``NaN``.
Ultimately, how you deal with reading in columns containing mixed dtypes
depends on your specific needs. In the case above, if you wanted to ``NaN`` out
the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
However, if you wanted for all the data to be coerced, no matter the type, then
using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be
worth trying.
.. versionadded:: 0.20.0 support for the Python parser.
The ``dtype`` option is supported by the 'python' engine
.. note::
In some cases, reading in abnormal data with columns containing mixed dtypes
will result in an inconsistent dataset. If you rely on pandas to infer the
dtypes of your columns, the parsing engine will go and infer the dtypes for
different chunks of the data, rather than the whole dataset at once. Consequently,
you can end up with column(s) with mixed dtypes. For example,
.. ipython:: python
:okwarning:
df = pd.DataFrame({'col_1': list(range(500000)) + ['a', 'b'] + list(range(500000))})
df.to_csv('foo.csv')
mixed_df = pd.read_csv('foo.csv')
mixed_df['col_1'].apply(type).value_counts()
mixed_df['col_1'].dtype
will result with `mixed_df` containing an ``int`` dtype for certain chunks
of the column, and ``str`` for others due to the mixed dtypes from the
data that was read in. It is important to note that the overall column will be
marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes.
.. ipython:: python
:suppress:
os.remove('foo.csv')
.. _io.categorical:
Specifying Categorical dtype
''''''''''''''''''''''''''''
.. versionadded:: 0.19.0
``Categorical`` columns can be parsed directly by specifying ``dtype='category'``
.. ipython:: python
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data)).dtypes
pd.read_csv(StringIO(data), dtype='category').dtypes
Individual columns can be parsed as a ``Categorical`` using a dict specification
.. ipython:: python
pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes
.. note::
The resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the
:func:`to_numeric` function, or as appropriate, another converter
such as :func:`to_datetime`.
.. ipython:: python
df = pd.read_csv(StringIO(data), dtype='category')
df.dtypes
df['col3']
df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
df['col3']
Naming and Using Columns
''''''''''''''''''''''''
.. _io.headers:
Handling column names
+++++++++++++++++++++
A file may or may not have a header row. pandas assumes the first row should be
used as the column names:
.. ipython:: python
data = 'a,b,c\n1,2,3\n4,5,6\n7,8,9'
print(data)
pd.read_csv(StringIO(data))
By specifying the ``names`` argument in conjunction with ``header`` you can
indicate other names to use and whether or not to throw away the header row (if
any):
.. ipython:: python
print(data)
pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=0)
pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=None)
If the header is in a row other than the first, pass the row number to
``header``. This will skip the preceding rows:
.. ipython:: python
data = 'skip this skip it\na,b,c\n1,2,3\n4,5,6\n7,8,9'
pd.read_csv(StringIO(data), header=1)
.. _io.dupe_names:
Duplicate names parsing
'''''''''''''''''''''''
If the file or header contains duplicate names, pandas by default will deduplicate
these names so as to prevent data overwrite:
.. ipython :: python
data = 'a,b,a\n0,1,2\n3,4,5'
pd.read_csv(StringIO(data))
There is no more duplicate data because ``mangle_dupe_cols=True`` by default, which modifies
a series of duplicate columns 'X'...'X' to become 'X.0'...'X.N'. If ``mangle_dupe_cols
=False``, duplicate data can arise:
.. code-block :: python
In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
Out[3]:
a b a
0 2 1 2
1 5 4 5
To prevent users from encountering this problem with duplicate data, a ``ValueError``
exception is raised if ``mangle_dupe_cols != True``:
.. code-block :: python
In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
...
ValueError: Setting mangle_dupe_cols=False is not supported yet
.. _io.usecols:
Filtering columns (``usecols``)
+++++++++++++++++++++++++++++++
The ``usecols`` argument allows you to select any subset of the columns in a
file, either using the column names, position numbers or a callable:
.. versionadded:: 0.20.0 support for callable `usecols` arguments
.. ipython:: python
data = 'a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz'
pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data), usecols=['b', 'd'])
pd.read_csv(StringIO(data), usecols=[0, 2, 3])
pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['A', 'C'])
The ``usecols`` argument can also be used to specify which columns not to
use in the final result:
.. ipython:: python
pd.read_csv(StringIO(data), usecols=lambda x: x not in ['a', 'c'])
In this case, the callable is specifying that we exclude the "a" and "c"
columns from the output.
Comments and Empty Lines
''''''''''''''''''''''''
.. _io.skiplines:
Ignoring line comments and empty lines
++++++++++++++++++++++++++++++++++++++
If the ``comment`` parameter is specified, then completely commented lines will
be ignored. By default, completely blank lines will be ignored as well. Both of
these are API changes introduced in version 0.15.
.. ipython:: python
data = '\na,b,c\n \n# commented line\n1,2,3\n\n4,5,6'
print(data)
pd.read_csv(StringIO(data), comment='#')
If ``skip_blank_lines=False``, then ``read_csv`` will not ignore blank lines:
.. ipython:: python
data = 'a,b,c\n\n1,2,3\n\n\n4,5,6'
pd.read_csv(StringIO(data), skip_blank_lines=False)
.. warning::
The presence of ignored lines might create ambiguities involving line numbers;
the parameter ``header`` uses row numbers (ignoring commented/empty
lines), while ``skiprows`` uses line numbers (including commented/empty lines):
.. ipython:: python
data = '#comment\na,b,c\nA,B,C\n1,2,3'
pd.read_csv(StringIO(data), comment='#', header=1)
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
pd.read_csv(StringIO(data), comment='#', skiprows=2)
If both ``header`` and ``skiprows`` are specified, ``header`` will be
relative to the end of ``skiprows``. For example:
.. ipython:: python
data = '# empty\n# second empty line\n# third empty' \
'line\nX,Y,Z\n1,2,3\nA,B,C\n1,2.,4.\n5.,NaN,10.0'
print(data)
pd.read_csv(StringIO(data), comment='#', skiprows=4, header=1)
.. _io.comments:
Comments
++++++++
Sometimes comments or meta data may be included in a file:
.. ipython:: python
:suppress:
data = ("ID,level,category\n"
"Patient1,123000,x # really unpleasant\n"
"Patient2,23000,y # wouldn't take his medicine\n"
"Patient3,1234018,z # awesome")
with open('tmp.csv', 'w') as fh:
fh.write(data)
.. ipython:: python
print(open('tmp.csv').read())
By default, the parser includes the comments in the output:
.. ipython:: python
df = pd.read_csv('tmp.csv')
df
We can suppress the comments using the ``comment`` keyword:
.. ipython:: python
df = pd.read_csv('tmp.csv', comment='#')
df
.. ipython:: python
:suppress:
os.remove('tmp.csv')
.. _io.unicode:
Dealing with Unicode Data
'''''''''''''''''''''''''
The ``encoding`` argument should be used for encoded unicode data, which will
result in byte strings being decoded to unicode in the result:
.. ipython:: python
data = b'word,length\nTr\xc3\xa4umen,7\nGr\xc3\xbc\xc3\x9fe,5'.decode('utf8').encode('latin-1')
df = pd.read_csv(BytesIO(data), encoding='latin-1')
df
df['word'][1]
Some formats which encode all characters as multiple bytes, like UTF-16, won't
parse correctly at all without specifying the encoding. `Full list of Python
standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_
.. _io.index_col:
Index columns and trailing delimiters
'''''''''''''''''''''''''''''''''''''
If a file has one more column of data than the number of column names, the
first column will be used as the DataFrame's row names:
.. ipython:: python
data = 'a,b,c\n4,apple,bat,5.7\n8,orange,cow,10'
pd.read_csv(StringIO(data))
.. ipython:: python
data = 'index,a,b,c\n4,apple,bat,5.7\n8,orange,cow,10'
pd.read_csv(StringIO(data), index_col=0)
Ordinarily, you can achieve this behavior using the ``index_col`` option.
There are some exception cases when a file has been prepared with delimiters at
the end of each data line, confusing the parser. To explicitly disable the
index column inference and discard the last column, pass ``index_col=False``:
.. ipython:: python
data = 'a,b,c\n4,apple,bat,\n8,orange,cow,'
print(data)
pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data), index_col=False)
.. _io.parse_dates:
Date Handling
'''''''''''''
Specifying Date Columns
+++++++++++++++++++++++
To better facilitate working with datetime data, :func:`read_csv` and
:func:`read_table` use the keyword arguments ``parse_dates`` and ``date_parser``
to allow users to specify a variety of columns and date/time formats to turn the
input text data into ``datetime`` objects.
The simplest case is to just pass in ``parse_dates=True``:
.. ipython:: python
:suppress:
f = open('foo.csv','w')
f.write('date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5')
f.close()
.. ipython:: python
# Use a column as an index, and parse it as dates.
df = pd.read_csv('foo.csv', index_col=0, parse_dates=True)
df
# These are python datetime objects
df.index
It is often the case that we may want to store date and time data separately,
or store various date fields separately. the ``parse_dates`` keyword can be
used to specify a combination of columns to parse the dates and/or times from.
You can specify a list of column lists to ``parse_dates``, the resulting date
columns will be prepended to the output (so as to not affect the existing column
order) and the new column names will be the concatenation of the component
column names:
.. ipython:: python
:suppress:
data = ("KORD,19990127, 19:00:00, 18:56:00, 0.8100\n"
"KORD,19990127, 20:00:00, 19:56:00, 0.0100\n"
"KORD,19990127, 21:00:00, 20:56:00, -0.5900\n"
"KORD,19990127, 21:00:00, 21:18:00, -0.9900\n"
"KORD,19990127, 22:00:00, 21:56:00, -0.5900\n"
"KORD,19990127, 23:00:00, 22:56:00, -0.5900")
with open('tmp.csv', 'w') as fh:
fh.write(data)
.. ipython:: python
print(open('tmp.csv').read())
df = pd.read_csv('tmp.csv', header=None, parse_dates=[[1, 2], [1, 3]])
df
By default the parser removes the component date columns, but you can choose
to retain them via the ``keep_date_col`` keyword:
.. ipython:: python
df = pd.read_csv('tmp.csv', header=None, parse_dates=[[1, 2], [1, 3]],
keep_date_col=True)
df
Note that if you wish to combine multiple columns into a single date column, a
nested list must be used. In other words, ``parse_dates=[1, 2]`` indicates that
the second and third columns should each be parsed as separate date columns
while ``parse_dates=[[1, 2]]`` means the two columns should be parsed into a
single column.
You can also use a dict to specify custom name columns:
.. ipython:: python
date_spec = {'nominal': [1, 2], 'actual': [1, 3]}
df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec)
df
It is important to remember that if multiple text columns are to be parsed into
a single date column, then a new column is prepended to the data. The `index_col`
specification is based off of this new set of columns rather than the original
data columns:
.. ipython:: python
date_spec = {'nominal': [1, 2], 'actual': [1, 3]}
df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec,
index_col=0) #index is the nominal column
df
.. note::
If a column or index contains an unparseable date, the entire column or
index will be returned unaltered as an object data type. For non-standard
datetime parsing, use :func:`to_datetime` after ``pd.read_csv``.
.. note::
read_csv has a fast_path for parsing datetime strings in iso8601 format,
e.g "2000-01-01T00:01:02+00:00" and similar variations. If you can arrange
for your data to store datetimes in this format, load times will be
significantly faster, ~20x has been observed.
.. note::
When passing a dict as the `parse_dates` argument, the order of
the columns prepended is not guaranteed, because `dict` objects do not impose
an ordering on their keys. On Python 2.7+ you may use `collections.OrderedDict`
instead of a regular `dict` if this matters to you. Because of this, when using a
dict for 'parse_dates' in conjunction with the `index_col` argument, it's best to
specify `index_col` as a column label rather then as an index on the resulting frame.
Date Parsing Functions
++++++++++++++++++++++
Finally, the parser allows you to specify a custom ``date_parser`` function to
take full advantage of the flexibility of the date parsing API:
.. ipython:: python
import pandas.io.date_converters as conv
df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec,
date_parser=conv.parse_date_time)
df
Pandas will try to call the ``date_parser`` function in three different ways. If
an exception is raised, the next one is tried:
1. ``date_parser`` is first called with one or more arrays as arguments,
as defined using `parse_dates` (e.g., ``date_parser(['2013', '2013'], ['1', '2'])``)
2. If #1 fails, ``date_parser`` is called with all the columns
concatenated row-wise into a single array (e.g., ``date_parser(['2013 1', '2013 2'])``)
3. If #2 fails, ``date_parser`` is called once for every row with one or more
string arguments from the columns indicated with `parse_dates`
(e.g., ``date_parser('2013', '1')`` for the first row, ``date_parser('2013', '2')``
for the second, etc.)
Note that performance-wise, you should try these methods of parsing dates in order:
1. Try to infer the format using ``infer_datetime_format=True`` (see section below)
2. If you know the format, use ``pd.to_datetime()``:
``date_parser=lambda x: pd.to_datetime(x, format=...)``
3. If you have a really non-standard format, use a custom ``date_parser`` function.
For optimal performance, this should be vectorized, i.e., it should accept arrays
as arguments.
You can explore the date parsing functionality in ``date_converters.py`` and
add your own. We would love to turn this module into a community supported set
of date/time parsers. To get you started, ``date_converters.py`` contains
functions to parse dual date and time columns, year/month/day columns,
and year/month/day/hour/minute/second columns. It also contains a
``generic_parser`` function so you can curry it with a function that deals with
a single date rather than the entire array.
.. ipython:: python
:suppress:
os.remove('tmp.csv')
.. _io.dayfirst:
Inferring Datetime Format
+++++++++++++++++++++++++
If you have ``parse_dates`` enabled for some or all of your columns, and your
datetime strings are all formatted the same way, you may get a large speed
up by setting ``infer_datetime_format=True``. If set, pandas will attempt
to guess the format of your datetime strings, and then use a faster means
of parsing the strings. 5-10x parsing speeds have been observed. pandas
will fallback to the usual parsing if either the format cannot be guessed
or the format that was guessed cannot properly parse the entire column
of strings. So in general, ``infer_datetime_format`` should not have any
negative consequences if enabled.
Here are some examples of datetime strings that can be guessed (All
representing December 30th, 2011 at 00:00:00)
- "20111230"
- "2011/12/30"
- "20111230 00:00:00"
- "12/30/2011 00:00:00"
- "30/Dec/2011 00:00:00"
- "30/December/2011 00:00:00"
``infer_datetime_format`` is sensitive to ``dayfirst``. With
``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With
``dayfirst=False`` (default) it will guess "01/12/2011" to be January 12th.
.. ipython:: python
# Try to infer the format for the index column
df = pd.read_csv('foo.csv', index_col=0, parse_dates=True,
infer_datetime_format=True)
df
.. ipython:: python
:suppress:
os.remove('foo.csv')
International Date Formats
++++++++++++++++++++++++++
While US date formats tend to be MM/DD/YYYY, many international formats use
DD/MM/YYYY instead. For convenience, a ``dayfirst`` keyword is provided:
.. ipython:: python
:suppress:
data = "date,value,cat\n1/6/2000,5,a\n2/6/2000,10,b\n3/6/2000,15,c"
with open('tmp.csv', 'w') as fh:
fh.write(data)
.. ipython:: python
print(open('tmp.csv').read())
pd.read_csv('tmp.csv', parse_dates=[0])
pd.read_csv('tmp.csv', dayfirst=True, parse_dates=[0])
.. _io.float_precision:
Specifying method for floating-point conversion
'''''''''''''''''''''''''''''''''''''''''''''''
The parameter ``float_precision`` can be specified in order to use
a specific floating-point converter during parsing with the C engine.
The options are the ordinary converter, the high-precision converter, and
the round-trip converter (which is guaranteed to round-trip values after
writing to a file). For example:
.. ipython:: python
val = '0.3066101993807095471566981359501369297504425048828125'
data = 'a,b,c\n1,2,{0}'.format(val)
abs(pd.read_csv(StringIO(data), engine='c', float_precision=None)['c'][0] - float(val))
abs(pd.read_csv(StringIO(data), engine='c', float_precision='high')['c'][0] - float(val))
abs(pd.read_csv(StringIO(data), engine='c', float_precision='round_trip')['c'][0] - float(val))
.. _io.thousands:
Thousand Separators
'''''''''''''''''''
For large numbers that have been written with a thousands separator, you can
set the ``thousands`` keyword to a string of length 1 so that integers will be parsed
correctly:
.. ipython:: python
:suppress:
data = ("ID|level|category\n"
"Patient1|123,000|x\n"
"Patient2|23,000|y\n"
"Patient3|1,234,018|z")
with open('tmp.csv', 'w') as fh:
fh.write(data)
By default, numbers with a thousands separator will be parsed as strings
.. ipython:: python
print(open('tmp.csv').read())
df = pd.read_csv('tmp.csv', sep='|')