Skip to content

ENH: Added colspecs detection to read_fwf #4955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 30, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,4 @@ pandas/io/*.json

.project
.pydevproject
.settings
23 changes: 19 additions & 4 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -742,10 +742,13 @@ function works with data files that have known and fixed column widths.
The function parameters to ``read_fwf`` are largely the same as `read_csv` with
two extra parameters:

- ``colspecs``: a list of pairs (tuples), giving the extents of the
fixed-width fields of each line as half-open intervals [from, to[
- ``widths``: a list of field widths, which can be used instead of
``colspecs`` if the intervals are contiguous
- ``colspecs``: A list of pairs (tuples) giving the extents of the
fixed-width fields of each line as half-open intervals (i.e., [from, to[ ).
String value 'infer' can be used to instruct the parser to try detecting
the column specifications from the first 100 rows of the data. Default
behaviour, if not specified, is to infer.
- ``widths``: A list of field widths which can be used instead of 'colspecs'
if the intervals are contiguous.

.. ipython:: python
:suppress:
Expand Down Expand Up @@ -789,6 +792,18 @@ column widths for contiguous columns:
The parser will take care of extra white spaces around the columns
so it's ok to have extra separation between the columns in the file.

.. versionadded:: 0.13.0

By default, ``read_fwf`` will try to infer the file's ``colspecs`` by using the
first 100 rows of the file. It can do it only in cases when the columns are
aligned and correctly separated by the provided ``delimiter`` (default delimiter
is whitespace).

.. ipython:: python

df = pd.read_fwf('bar.csv', header=None, index_col=0)
df

.. ipython:: python
:suppress:

Expand Down
1 change: 1 addition & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ New features
- Added ``isin`` method to DataFrame (:issue:`4211`)
- Clipboard functionality now works with PySide (:issue:`4282`)
- New ``extract`` string method returns regex matches more conveniently (:issue:`4685`)
- Auto-detect field widths in read_fwf when unspecified (:issue:`4488`)

Experimental Features
~~~~~~~~~~~~~~~~~~~~~
Expand Down
3 changes: 3 additions & 0 deletions doc/source/v0.13.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,9 @@ Enhancements

can also be used.
- ``read_stata` now accepts Stata 13 format (:issue:`4291`)
- ``read_fwf`` now infers the column specifications from the first 100 rows of
the file if the data has correctly separated and properly aligned columns
using the delimiter provided to the function (:issue:`4488`).

.. _whatsnew_0130.experimental:

Expand Down
98 changes: 69 additions & 29 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,11 +160,15 @@
""" % (_parser_params % _table_sep)

_fwf_widths = """\
colspecs : a list of pairs (tuples), giving the extents
of the fixed-width fields of each line as half-open internals
(i.e., [from, to[ ).
widths : a list of field widths, which can be used instead of
'colspecs' if the intervals are contiguous.
colspecs : list of pairs (int, int) or 'infer'. optional
A list of pairs (tuples) giving the extents of the fixed-width
fields of each line as half-open intervals (i.e., [from, to[ ).
String value 'infer' can be used to instruct the parser to try
detecting the column specifications from the first 100 rows of
the data (default='infer').
widths : list of ints. optional
A list of field widths which can be used instead of 'colspecs' if
the intervals are contiguous.
"""

_read_fwf_doc = """
Expand All @@ -184,7 +188,8 @@ def _read(filepath_or_buffer, kwds):
if skipfooter is not None:
kwds['skip_footer'] = skipfooter

filepath_or_buffer, _ = get_filepath_or_buffer(filepath_or_buffer)
filepath_or_buffer, _ = get_filepath_or_buffer(filepath_or_buffer,
encoding)

if kwds.get('date_parser', None) is not None:
if isinstance(kwds['parse_dates'], bool):
Expand Down Expand Up @@ -267,8 +272,8 @@ def _read(filepath_or_buffer, kwds):
}

_fwf_defaults = {
'colspecs': None,
'widths': None
'colspecs': 'infer',
'widths': None,
}

_c_unsupported = set(['skip_footer'])
Expand Down Expand Up @@ -412,15 +417,15 @@ def parser_f(filepath_or_buffer,


@Appender(_read_fwf_doc)
def read_fwf(filepath_or_buffer, colspecs=None, widths=None, **kwds):
def read_fwf(filepath_or_buffer, colspecs='infer', widths=None, **kwds):
# Check input arguments.
if colspecs is None and widths is None:
raise ValueError("Must specify either colspecs or widths")
elif colspecs is not None and widths is not None:
elif colspecs not in (None, 'infer') and widths is not None:
raise ValueError("You must specify only one of 'widths' and "
"'colspecs'")

# Compute 'colspec' from 'widths', if specified.
# Compute 'colspecs' from 'widths', if specified.
if widths is not None:
colspecs, col = [], 0
for w in widths:
Expand Down Expand Up @@ -519,7 +524,8 @@ def _clean_options(self, options, engine):
engine = 'python'
elif sep is not None and len(sep) > 1:
# wait until regex engine integrated
engine = 'python'
if engine not in ('python', 'python-fwf'):
engine = 'python'

# C engine not supported yet
if engine == 'c':
Expand Down Expand Up @@ -2012,31 +2018,65 @@ class FixedWidthReader(object):
"""
A reader of fixed-width lines.
"""
def __init__(self, f, colspecs, filler, thousands=None, encoding=None):
def __init__(self, f, colspecs, delimiter, comment):
self.f = f
self.colspecs = colspecs
self.filler = filler # Empty characters between fields.
self.thousands = thousands
if encoding is None:
encoding = get_option('display.encoding')
self.encoding = encoding

if not isinstance(colspecs, (tuple, list)):
self.buffer = None
self.delimiter = '\r\n' + delimiter if delimiter else '\n\r\t '
self.comment = comment
if colspecs == 'infer':
self.colspecs = self.detect_colspecs()
else:
self.colspecs = colspecs

if not isinstance(self.colspecs, (tuple, list)):
raise TypeError("column specifications must be a list or tuple, "
"input was a %r" % type(colspecs).__name__)

for colspec in colspecs:
for colspec in self.colspecs:
if not (isinstance(colspec, (tuple, list)) and
len(colspec) == 2 and
isinstance(colspec[0], int) and
isinstance(colspec[1], int)):
len(colspec) == 2 and
isinstance(colspec[0], (int, np.integer)) and
isinstance(colspec[1], (int, np.integer))):
raise TypeError('Each column specification must be '
'2 element tuple or list of integers')

def get_rows(self, n):
rows = []
for i, row in enumerate(self.f, 1):
rows.append(row)
if i >= n:
break
self.buffer = iter(rows)
return rows

def detect_colspecs(self, n=100):
# Regex escape the delimiters
delimiters = ''.join([r'\%s' % x for x in self.delimiter])
pattern = re.compile('([^%s]+)' % delimiters)
rows = self.get_rows(n)
max_len = max(map(len, rows))
mask = np.zeros(max_len + 1, dtype=int)
if self.comment is not None:
rows = [row.partition(self.comment)[0] for row in rows]
for row in rows:
for m in pattern.finditer(row):
mask[m.start():m.end()] = 1
shifted = np.roll(mask, 1)
shifted[0] = 0
edges = np.where((mask ^ shifted) == 1)[0]
return list(zip(edges[::2], edges[1::2]))

def next(self):
line = next(self.f)
if self.buffer is not None:
try:
line = next(self.buffer)
except StopIteration:
self.buffer = None
line = next(self.f)
else:
line = next(self.f)
# Note: 'colspecs' is a sequence of half-open intervals.
return [line[fromm:to].strip(self.filler or ' ')
return [line[fromm:to].strip(self.delimiter)
for (fromm, to) in self.colspecs]

# Iterator protocol in Python 3 uses __next__()
Expand All @@ -2050,10 +2090,10 @@ class FixedWidthFieldParser(PythonParser):
"""
def __init__(self, f, **kwds):
# Support iterators, convert to a list.
self.colspecs = list(kwds.pop('colspecs'))
self.colspecs = kwds.pop('colspecs')

PythonParser.__init__(self, f, **kwds)

def _make_reader(self, f):
self.data = FixedWidthReader(f, self.colspecs, self.delimiter,
encoding=self.encoding)
self.comment)
111 changes: 106 additions & 5 deletions pandas/io/tests/test_parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1706,7 +1706,7 @@ def test_utf16_example(self):
self.assertEquals(len(result), 50)

def test_converters_corner_with_nas(self):
# skip aberration observed on Win64 Python 3.2.2
# skip aberration observed on Win64 Python 3.2.2
if hash(np.int64(-1)) != -2:
raise nose.SkipTest("skipping because of windows hash on Python"
" 3.2.2")
Expand Down Expand Up @@ -2078,19 +2078,19 @@ def test_fwf(self):
read_fwf(StringIO(data3), colspecs=colspecs, widths=[6, 10, 10, 7])

with tm.assertRaisesRegexp(ValueError, "Must specify either"):
read_fwf(StringIO(data3))
read_fwf(StringIO(data3), colspecs=None, widths=None)

def test_fwf_colspecs_is_list_or_tuple(self):
with tm.assertRaisesRegexp(TypeError,
'column specifications must be a list or '
'tuple.+'):
fwr = pd.io.parsers.FixedWidthReader(StringIO(self.data1),
{'a': 1}, ',')
pd.io.parsers.FixedWidthReader(StringIO(self.data1),
{'a': 1}, ',', '#')

def test_fwf_colspecs_is_list_or_tuple_of_two_element_tuples(self):
with tm.assertRaisesRegexp(TypeError,
'Each column specification must be.+'):
read_fwf(StringIO(self.data1), {'a': 1})
read_fwf(StringIO(self.data1), [('a', 1)])

def test_fwf_regression(self):
# GH 3594
Expand Down Expand Up @@ -2223,6 +2223,107 @@ def test_iteration_open_handle(self):
expected = Series(['DDD', 'EEE', 'FFF', 'GGG'])
tm.assert_series_equal(result, expected)


class TestFwfColspaceSniffing(unittest.TestCase):
def test_full_file(self):
# File with all values
test = '''index A B C
2000-01-03T00:00:00 0.980268513777 3 foo
2000-01-04T00:00:00 1.04791624281 -4 bar
2000-01-05T00:00:00 0.498580885705 73 baz
2000-01-06T00:00:00 1.12020151869 1 foo
2000-01-07T00:00:00 0.487094399463 0 bar
2000-01-10T00:00:00 0.836648671666 2 baz
2000-01-11T00:00:00 0.157160753327 34 foo'''
colspecs = ((0, 19), (21, 35), (38, 40), (42, 45))
expected = read_fwf(StringIO(test), colspecs=colspecs)
tm.assert_frame_equal(expected, read_fwf(StringIO(test)))

def test_full_file_with_missing(self):
# File with missing values
test = '''index A B C
2000-01-03T00:00:00 0.980268513777 3 foo
2000-01-04T00:00:00 1.04791624281 -4 bar
0.498580885705 73 baz
2000-01-06T00:00:00 1.12020151869 1 foo
2000-01-07T00:00:00 0 bar
2000-01-10T00:00:00 0.836648671666 2 baz
34'''
colspecs = ((0, 19), (21, 35), (38, 40), (42, 45))
expected = read_fwf(StringIO(test), colspecs=colspecs)
tm.assert_frame_equal(expected, read_fwf(StringIO(test)))

def test_full_file_with_spaces(self):
# File with spaces in columns
test = '''
Account Name Balance CreditLimit AccountCreated
101 Keanu Reeves 9315.45 10000.00 1/17/1998
312 Gerard Butler 90.00 1000.00 8/6/2003
868 Jennifer Love Hewitt 0 17000.00 5/25/1985
761 Jada Pinkett-Smith 49654.87 100000.00 12/5/2006
317 Bill Murray 789.65 5000.00 2/5/2007
'''.strip('\r\n')
colspecs = ((0, 7), (8, 28), (30, 38), (42, 53), (56, 70))
expected = read_fwf(StringIO(test), colspecs=colspecs)
tm.assert_frame_equal(expected, read_fwf(StringIO(test)))

def test_full_file_with_spaces_and_missing(self):
# File with spaces and missing values in columsn
test = '''
Account Name Balance CreditLimit AccountCreated
101 10000.00 1/17/1998
312 Gerard Butler 90.00 1000.00 8/6/2003
868 5/25/1985
761 Jada Pinkett-Smith 49654.87 100000.00 12/5/2006
317 Bill Murray 789.65
'''.strip('\r\n')
colspecs = ((0, 7), (8, 28), (30, 38), (42, 53), (56, 70))
expected = read_fwf(StringIO(test), colspecs=colspecs)
tm.assert_frame_equal(expected, read_fwf(StringIO(test)))

def test_messed_up_data(self):
# Completely messed up file
test = '''
Account Name Balance Credit Limit Account Created
101 10000.00 1/17/1998
312 Gerard Butler 90.00 1000.00

761 Jada Pinkett-Smith 49654.87 100000.00 12/5/2006
317 Bill Murray 789.65
'''.strip('\r\n')
colspecs = ((2, 10), (15, 33), (37, 45), (49, 61), (64, 79))
expected = read_fwf(StringIO(test), colspecs=colspecs)
tm.assert_frame_equal(expected, read_fwf(StringIO(test)))

def test_multiple_delimiters(self):
test = r'''
col1~~~~~col2 col3++++++++++++++++++col4
~~22.....11.0+++foo~~~~~~~~~~Keanu Reeves
33+++122.33\\\bar.........Gerard Butler
++44~~~~12.01 baz~~Jennifer Love Hewitt
~~55 11+++foo++++Jada Pinkett-Smith
..66++++++.03~~~bar Bill Murray
'''.strip('\r\n')
colspecs = ((0, 4), (7, 13), (15, 19), (21, 41))
expected = read_fwf(StringIO(test), colspecs=colspecs,
delimiter=' +~.\\')
tm.assert_frame_equal(expected, read_fwf(StringIO(test),
delimiter=' +~.\\'))

def test_variable_width_unicode(self):
if not compat.PY3:
raise nose.SkipTest('Bytes-related test - only needs to work on Python 3')
test = '''
שלום שלום
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this one actually needs to work on Python 2 as well. I know it's a little annoying to do that. Also, can this handle non-utf8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied it from test_BytesIO_input line 2103 which also skips it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also the problem is that in Python2 i don't get a unicode string from the reader even if I pass the encoding. I think this is a bug on the reader :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this isn't the case, but are you sure it isn't because we've missed a function call somewhere that was supposed to get an encoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure that in the FixedWidthReader when I call next(self.f) i get <type 'str'> :) Is that what I should get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner I'm not familiar with the read1 issue.

And the problem about the StringIO is the following:

In [2]: type(pd.compat.StringIO(u'test').read())
Out[2]: unicode

In [3]: type(pd.compat.StringIO('test').read())
Out[3]: str

If I get a StringIO object I have no idea if I will get bytes or strings from it.

On the other hand:

In [7]: type(io.StringIO(u'test').read())
Out[7]: unicode

In [8]: type(io.StringIO('test').read())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-ec84dbdb83c4> in <module>()
----> 1 type(io.StringIO('test').read())

TypeError: initial_value must be unicode or None, not str

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge deal, encountered it in other places. anyways, many many test cases use StringIO(some_str) to pass into functions, so we can't really deprecate support for it.

Help me out here though - why does it actually matter to wrap in Python 2? The issue in Python 3 is that you can't use string methods on bytes, but in Python 2 that's not a problem. Anyways, we should probably put this in a separate issue so we don't get too sidetracked here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner "The issue in Python 3 is that you can't use string methods on bytes, but in Python 2 that's not a problem."
This is not true. Yes, it's true that you can use them, but they won't produce a correct result. Here is a simple example:

>>> s = u'абвгдђежз'
>>> print s.upper()
АБВГДЂЕЖЗ
>>> print s.encode('utf8').upper()
абвгдђежз
>>> print s.encode('utf8').upper().decode('utf-8')
абвгдђежз

The method wont raises an error, but the result won't be correct. That's why it's important to wrap everything to be unicode. None of the Series.str methods would work correctly with python 2 strings if you have unicode data. Example:

In [21]: s1 = pd.Series(['фоо', 'бар', 'баз'])

In [22]: s2 = pd.Series([u'фоо', u'бар', u'баз'])

In [23]: s1.str.upper()
Out[23]:
0    фоо
1    бар
2    баз
dtype: object

In [24]: s2.str.upper()
Out[24]:
0    ФОО
1    БАР
2    БАЗ
dtype: object

I encountered the same problem when I wrote the column detection for read_fwf function. Unless the strings are unicode, I cannot determine the correct widths of the columns for variable byte encodings like utf-8.
Yes, of course I can use the encoding and check if the thing I got is bytes and then decode it using the encoding, calculate the thing I need, and so on. But that is not DRY. You will have to do that all over the place in every method that does something with strings.
Instead of that, my proposal is to fix the PythonParser.__init__ function to create a stream object (f) that will always return unicode strings, and have the logic only in one place. The same thing would apply for the C parser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's basically what is already the case with Python 3. (you might have to put it in two places because of how things are currently set up). So are you thinking you'd always read StringIO fully (or at least one line) and then handle appropriately (remember there are both cStringIO and StringIO in Python 2)? then after that you could use TextIOWrapper, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there are 3 :D - cStringIO.StringIO which accepts only bytes or strings that can be decoded using the ascii encoding, StringIO.StringIO that accepts both bytes and str and returns the same thing it was created with, and io.StringIO which accepts only str (I'm using the python 3 terminology so str == unicode).

Yes the idea is to create something like the django's safe_str method, but SafeStringIO that will accept any of them and always emit unicode...

But I'll create a new issue and we can discuss that there, I'll reference this conversation from there.

ום שלל
של ום
'''.strip('\r\n')
expected = pd.read_fwf(BytesIO(test.encode('utf8')),
colspecs=[(0, 4), (5, 9)], header=None)
tm.assert_frame_equal(expected, read_fwf(BytesIO(test.encode('utf8')),
header=None))


class TestCParserHighMemory(ParserTests, unittest.TestCase):

def read_csv(self, *args, **kwds):
Expand Down