ENH: Added colspecs detection to read_fwf #4955

alefnula · 2013-09-23T22:52:12Z

Implemented an algorithm that uses a bitmask to detect the gaps between the columns.
Also the reader buffers the lines used for detection in case it's input is not seek-able.
Added tests.

jreback · 2013-09-23T22:56:24Z

pandas/io/parsers.py

        self.f = f
        self.colspecs = colspecs
-        self.filler = filler  # Empty characters between fields.


what happened to encoding? is thousands necessary? (I get filler can prob be removed)

They were not used anywhere in this class so I removed them. They are used in the PythonParser no the FixedWidthReader. Also I renamed the filler to delimiter to be consistent with other functions.

ok....encoding should stay in....its prob not tested....can you add something for that? (see the tests for read_csv)

ecoding is used in the FixedWidthFieldParser not FixedWidthReader. All tests passed. I could return it if you want, but as i said it's not used anywhere...

jreback · 2013-09-23T22:57:08Z

@alefnula pls add an extry in release notes and v0.13.0 example (and put the same example in io.rst). you can do these at the end....just an FYI

alefnula · 2013-09-23T22:59:23Z

@jreback Yes, I was waiting for some feedback first. :)

jtratner · 2013-09-23T23:00:47Z

@alefnula apparently my comment got lost: what happens if you have many many columns (e.g., 10,000) or a first row that is very long (maybe 100,000 characters or something)? Will your implementation still work?

alefnula · 2013-09-23T23:09:20Z

@jtratner It will work. But it won't be very efficient if you have 10k columns. Characters lengths of the row are not such a big deal, I will allocate just one ndarray of 100.000 ints, which is 800k. That is the only structure I need... And the buffer. But I could remove the buffering if it could be guaranteed that the item passed to the reader is seekable.

Maybe the PythonParser._make_reader should take care of buffering...

jreback · 2013-09-23T23:13:44Z

@alefnula I think buffering is fine as well as what you are doing. The time you spend determining widths is a) much lower than human time doing it!, b) almost always less than actually reading the file....

jreback · 2013-09-23T23:15:46Z

what about doing an API like: widths='infer' to enable this? (I understand you then don't get specific rows/number of rows....but I view that as a real edge case)

alefnula · 2013-09-23T23:19:47Z

@jreback As I said, whatever you like :) I would just use: colspecs='infer' since I'm actually inferring the colspecs...

Or maybe if either of the two is 'infer'.

jreback · 2013-09-23T23:21:04Z

@alefnula how useful is to inspect certain rows do you think? (as opposed to just top 100)

alefnula · 2013-09-23T23:33:06Z

@jreback At first I thought it could be useful, but when I started writing the tests I found out that it's much more painful to determine the correct rows from which the colspecs can be inferred than to actually count the column widths. So, now that I played with it a little, I think it's completely useless :D Except maybe in some really, really rare cases where one of rows determine the width.

But even then I'll inver correctly using this algorithm... When I added that parameter, I had another algorithm in mind, but it turned out that it wasn't correct.

jreback · 2013-09-23T23:36:18Z

ok in that case I would then turn auto detect on infer is passed for either widths or column spec

alefnula · 2013-09-23T23:38:25Z

@jreback Deal. But that will wait have to wait until tomorrow, it's 2AM here :D

alefnula · 2013-09-24T16:14:58Z

@jreback I would maybe just change one more thing: enable just the colspecs parameter to accept 'infer'. It's duplicating if both parameters accept 'infer' and may confuse the user to think that we do something differently if he passes widths='infer' compared to colspecs='infer'.

Maybe even let it be the default if none of the above is specified?

jreback · 2013-09-24T16:16:28Z

@alefnula that's fine.....

alefnula · 2013-09-24T16:28:28Z

@jreback Which part? Just the colspecs excepting 'infer' or auto detection being the default fallback mechanism?

jreback · 2013-09-24T16:36:16Z

i like only accepting 'infer' on colspecs. I would also agree that if both colspecs and width are None then inferring is fine. (so basically the default for colspecs is then defaulted to infer. if for some reason the user sets colspecs to None then the existing error message can come up

alefnula · 2013-09-24T17:21:20Z

Everything done. If I missed something or have to add something more please tell me.

jreback · 2013-09-24T18:07:02Z

doc/source/io.rst

@@ -789,6 +791,19 @@ column widths for contiguous columns:
 The parser will take care of extra white spaces around the columns
 so it's ok to have extra separation between the columns in the file.

+If your data file has correctly separated columns using the delimiter provided
+to the ``read_fwf`` function - like in the case of the ``bar.csv`` file, where


I would put a similar blurb in v0.13.0.txt....to 'announce' this new feature

jreback · 2013-09-24T20:40:15Z

@alefnula pls rebase and squash

@cpcloud comments?

jreback · 2013-09-28T01:52:22Z

@cpcloud @jtratner @y-p @wesm comments?

I think this is ok

jtratner · 2013-09-28T02:39:54Z

doc/source/io.rst

@@ -789,6 +791,21 @@ column widths for contiguous columns:
 The parser will take care of extra white spaces around the columns
 so it's ok to have extra separation between the columns in the file.

+.. versionadded:: 0.13.0
+
+If your data file has correctly separated columns using the delimiter provided


How about making this paragraph shorter.

By default,read_fwfwill try to infer the file's colspecs by using the first 100 rows of the file.

@jtratner I wanted to point out that the columns must be clearly (visually) separated. You cannot have:

12foo 23bar 34baz

And expect to get two columns using the sniffing. Also they must be a valid fwf, so this:

12 foo 12344 bar 334 baz

Even though it's clearly separated is not a valid fwf.

By default, read_fwf will try to infer the file's colspecs by using the first 100 rows of the file, but only if the columns are delimited by whitespace and are aligned between rows.

jtratner · 2013-09-28T02:49:48Z

@alefnula you need to add test cases for what happens if you can't or shouldn't be able to sniff the file (what happens, does it raise an error?) - there must exist files for which this doesn't work. Are there any pathological cases where this sniffing could get confused?

Also, please make sure you add a test case where you dynamically build a very long string (i.e., a very 'wide' file) and then test to make sure it can be sniffed appropriately.

jtratner · 2013-09-28T02:56:15Z

also what happens with different delimiters (can the delimiter be a regular expression?), and unicode (UTF-8, non-UTF-8, variable-width - you can look at some of the other tests for cases where the width of the letters vary, maybe try for some chinese characters?).

jtratner · 2013-09-28T02:56:40Z

also multicharacter delimiters, if those are accepted.

alefnula · 2013-09-28T11:05:23Z

@jtratner OK so to recap:

Add optional to colspecs docs.
Move all test*.fwf files directly to test_parser.py as strings.
In the case of files that cannot be sniffed properly, you will just end up with two columns merged into one or one column separated into two. No exception will be thrown.
For example:
```
N    A          Name
123foo   Joe Doe
345bar   Joe Smith 
455baz   Abe Lincoln
```
Even though the header suggests there are three columns, the parser is not able to found the boundary between the first and the second. Also the third column will be split into two since first name and last name are clearly separated.
If you have just one name with four letters:
```
N    A          Name
123foo   Joe Doe
345bar   Joe Smith 
455baz   Abe Lincoln
333brb   Jack Lastname
```
Then the parser will correctly parse the third column, because the clear separation is broken. (In this case it would be the second column since the first two cannot be separated).
About the very wide files. I tested this and it works just fine. The width of the file doesn't make any difference in parsing, because I'm not counting anything. It will just create a bigger one dimensional ndarray. I didn't thought it's necessary the test something that doesn't check any edge case or does anything differently. But I'll add that if you think it's necessary. No prob :)
Multiple characters delimiters will work fine, but strings and regexes wont. So telling the parser something like delimiter='+~' will slit the line on + and ~ characters. But telling the parser to split whenever it finds a +-~ sequence wont. That didn't worked even before this change. Here is an untouched line of code from before that shows why:
```
return [line[fromm:to].strip(self.delimiter)
        for (fromm, to) in self.colspecs]
```
There is no regex strip, at least not a trivial one, so I think that this is just an unnecessary complication and shouldn't be implemented. So I'll just add tests that check for correctness of this.
Add tests that check unicode delimiters and unicode files.

jtratner · 2013-09-28T11:25:50Z

Wide files - you don't need to add that, I just wanted to check that it worked. Somehow I thought that this was setup to accept regexes, but I think I was getting confused with csv's separator.

To confirm, you're saying that read_fwf with 'infer' will never raise an Exception, it will only produce poor results?

alefnula · 2013-09-28T11:29:47Z

@jtratner Yes, that's correct. If it raises an exception it's a bug in my code.

jtratner · 2013-09-28T11:32:02Z

okay, that all sounds fine. I do want to make sure you test with unicode - but you're working with string and not bytes anyways, so probably okay in Python 3. I don't have a great grasp on whether using something that's variable width unicode would mess up what you're doing.

alefnula · 2013-09-28T11:36:58Z

@jtratner I'll add tests for that. So just to be clear. I should do the:

1 - Shorten and fix docs
2 - Move files directly to tests.
5 - Add more tests for multi character delimiters.
6 - Add more tests for variable width unicode.

jtratner · 2013-09-28T11:40:40Z

@alefnula Happy to clarify.

Yes.
Yes.

5. If you already have a test for multi character delimiters, not necessary to add more. (I must have missed it). Please do add a test that uses a variable-width unicode delimiter though.
6. Yes.

Thanks!

alefnula · 2013-09-28T14:55:53Z

@jtratner All done.

cpcloud · 2013-09-28T15:06:34Z

@alefnula just need a rebase :)

alefnula · 2013-09-28T15:14:52Z

@cpcloud I rebased and squashed it. It's just one commit. Or you mean something else?

cpcloud · 2013-09-28T15:16:36Z

@alefnula Sorry. I wasn't being clear. There's a merge conflict with upstream; probably with doc/source/release.rst. Need to rebase on top of upstream/master and resolve the merge conflicts.

jtratner · 2013-09-28T15:50:56Z

pandas/io/parsers.py

@@ -1945,29 +1951,63 @@ class FixedWidthReader(object):
    """
    A reader of fixed-width lines.
    """
-    def __init__(self, f, colspecs, filler, thousands=None, encoding=None):


btw - were filler, thousands and encoding just doing nothing here?

thousands and encoding were not used, and i renamed the filler to delimiter for consistency with other functions.

are you sure that encoding wasn't used? How does it know how to handle non-utf8?

if it's doing it earlier - my bad :p. Anyways, as soon as you make that one test work for both 2 and 3 I'm good with merging this.

alefnula · 2013-09-28T15:53:55Z

@cpcloud Rebased.

Implemented an algorithm that uses a bitmask to detect the gaps between the columns. The reader buffers the lines used for detection in case the input stream is not seekable.

alefnula · 2013-09-29T21:06:05Z

Rebased onto master (resolved merge conflicts).

jreback · 2013-09-29T21:13:03Z

@cpcloud @jtratner

ok by me

wesm · 2013-09-30T07:00:16Z

How clever. I like it. This is fine by me

ENH: Added colspecs detection to read_fwf

jreback · 2013-09-30T10:18:29Z

@alefnula thanks for this!

ghost · 2013-09-30T11:15:56Z

The new detection routine fails for cases that the #4488 snippet works fine for.
Specifically, this routine requires that all rows have at least one-space seperation
between cols.

For example:

id  foo 
1   a
1   a
1   a
123a

This works with #4488 but fails with this code due to the last line. It's a common example of a
fwf file and so this detection routine really isn't useful for a large subset of datasets in the wild. a shame.

Also, the delimiter name change is misleading. By definition fwf files have no delimiter, the filler
keyword name was the right thing, and the routine in #4488 also supported auto-detection for it.

jreback reviewed Sep 23, 2013
View reviewed changes

jreback reviewed Sep 24, 2013
View reviewed changes

jtratner reviewed Sep 28, 2013
View reviewed changes

alefnula mentioned this pull request Sep 28, 2013

ENH: Safer stream handling for Python 2 #5020

Closed

ENH: Added automatic colspecs detection to read_fwf (GH4488)

9f5e5ff

Implemented an algorithm that uses a bitmask to detect the gaps between the columns. The reader buffers the lines used for detection in case the input stream is not seekable.

jreback added a commit that referenced this pull request Sep 30, 2013

Merge pull request #4955 from alefnula/iss4488

c8ab2dd

ENH: Added colspecs detection to read_fwf

jreback merged commit c8ab2dd into pandas-dev:master Sep 30, 2013

alefnula deleted the iss4488 branch September 30, 2013 11:07

ghost mentioned this pull request Sep 30, 2013

read_fwf colspecs autodetection could be more flexible #5056

Closed

jreback mentioned this pull request Dec 20, 2013

auto colspec option for fwf #3392

Closed

ENH: Added colspecs detection to read_fwf #4955

ENH: Added colspecs detection to read_fwf #4955

Conversation

alefnula commented Sep 23, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 23, 2013

alefnula commented Sep 23, 2013

jtratner commented Sep 23, 2013

alefnula commented Sep 23, 2013

jreback commented Sep 23, 2013

jreback commented Sep 23, 2013

alefnula commented Sep 23, 2013

jreback commented Sep 23, 2013

alefnula commented Sep 23, 2013

jreback commented Sep 23, 2013

alefnula commented Sep 23, 2013

alefnula commented Sep 24, 2013

jreback commented Sep 24, 2013

alefnula commented Sep 24, 2013

jreback commented Sep 24, 2013

alefnula commented Sep 24, 2013

Choose a reason for hiding this comment

jreback commented Sep 24, 2013

jreback commented Sep 28, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtratner commented Sep 28, 2013

jtratner commented Sep 28, 2013

jtratner commented Sep 28, 2013

alefnula commented Sep 28, 2013

jtratner commented Sep 28, 2013

alefnula commented Sep 28, 2013

jtratner commented Sep 28, 2013

alefnula commented Sep 28, 2013

jtratner commented Sep 28, 2013

alefnula commented Sep 28, 2013

cpcloud commented Sep 28, 2013

alefnula commented Sep 28, 2013

cpcloud commented Sep 28, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alefnula commented Sep 28, 2013

alefnula commented Sep 29, 2013

jreback commented Sep 29, 2013

wesm commented Sep 30, 2013

jreback commented Sep 30, 2013

ghost commented Sep 30, 2013