Skip to content

ENH: Add option in read_csv to infer compression type from filename #9770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 18, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@ They can take a number of arguments:
- ``delim_whitespace``: Parse whitespace-delimited (spaces or tabs) file
(much faster than using a regular expression)
- ``compression``: decompress ``'gzip'`` and ``'bz2'`` formats on the fly.
Set to ``'infer'`` (the default) to guess a format based on the file
extension.
- ``dialect``: string or :class:`python:csv.Dialect` instance to expose more
ways to specify the file format
- ``dtype``: A data type name or a dict of column name to data type. If not
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.16.1.txt
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ API changes
- :meth:`~pandas.DataFrame.assign` now inserts new columns in alphabetical order. Previously
the order was arbitrary. (:issue:`9777`)

- By default, ``read_csv`` and ``read_table`` will now try to infer the compression type based on the file extension. Set ``compression=None`` to restore the previous behavior (no decompression). (:issue:`9770`)

.. _whatsnew_0161.performance:

Expand Down
23 changes: 19 additions & 4 deletions pandas/io/parsers.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,11 @@ class ParserWarning(Warning):
dtype : Type name or dict of column -> type
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupported with engine='python')
compression : {'gzip', 'bz2', None}, default None
For on-the-fly decompression of on-disk data
compression : {'gzip', 'bz2', 'infer', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use gzip or
bz2 if filepath_or_buffer is a string ending in '.gz' or '.bz2',
respectively, and no decompression otherwise. Set to None for no
decompression.
dialect : string or csv.Dialect instance, default None
If None defaults to Excel dialect. Ignored if sep longer than 1 char
See csv.Dialect documentation for more details
Expand Down Expand Up @@ -295,7 +298,7 @@ def _read(filepath_or_buffer, kwds):
'verbose': False,
'encoding': None,
'squeeze': False,
'compression': None,
'compression': 'infer',
'mangle_dupe_cols': True,
'tupleize_cols': False,
'infer_datetime_format': False,
Expand Down Expand Up @@ -335,7 +338,7 @@ def _make_parser_function(name, sep=','):
def parser_f(filepath_or_buffer,
sep=sep,
dialect=None,
compression=None,
compression='infer',

doublequote=True,
escapechar=None,
Expand Down Expand Up @@ -1317,6 +1320,7 @@ def _wrap_compressed(f, compression, encoding=None):
"""
compression = compression.lower()
encoding = encoding or get_option('display.encoding')

if compression == 'gzip':
import gzip

Expand Down Expand Up @@ -1389,6 +1393,17 @@ def __init__(self, f, **kwds):
self.comment = kwds['comment']
self._comment_lines = []

if self.compression == 'infer':
if isinstance(f, compat.string_types):
if f.endswith('.gz'):
self.compression = 'gzip'
elif f.endswith('.bz2'):
self.compression = 'bz2'
else:
self.compression = None
else:
self.compression = None

if isinstance(f, compat.string_types):
f = com._get_handle(f, 'r', encoding=self.encoding,
compression=self.compression)
Expand Down
Binary file added pandas/io/tests/data/test1.csv.bz2
Binary file not shown.
Binary file added pandas/io/tests/data/test1.csv.gz
Binary file not shown.
15 changes: 15 additions & 0 deletions pandas/io/tests/test_parsers.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -1098,6 +1098,21 @@ def test_read_csv_no_index_name(self):
self.assertEqual(df.ix[:, ['A', 'B', 'C', 'D']].values.dtype, np.float64)
tm.assert_frame_equal(df, df2)

def test_read_csv_infer_compression(self):
# GH 9770
expected = self.read_csv(self.csv1, index_col=0, parse_dates=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here as a comment


inputs = [self.csv1, self.csv1 + '.gz',
self.csv1 + '.bz2', open(self.csv1)]

for f in inputs:
df = self.read_csv(f, index_col=0, parse_dates=True,
compression='infer')

tm.assert_frame_equal(expected, df)

inputs[3].close()

def test_read_table_unicode(self):
fin = BytesIO(u('\u0141aski, Jan;1').encode('utf-8'))
df1 = read_table(fin, sep=";", encoding="utf-8", header=None)
Expand Down
11 changes: 11 additions & 0 deletions pandas/parser.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -541,6 +541,17 @@ cdef class TextReader:
self.parser.cb_io = NULL
self.parser.cb_cleanup = NULL

if self.compression == 'infer':
if isinstance(source, basestring):
if source.endswith('.gz'):
self.compression = 'gzip'
elif source.endswith('.bz2'):
self.compression = 'bz2'
else:
self.compression = None
else:
self.compression = None

if self.compression:
if self.compression == 'gzip':
import gzip
Expand Down