Skip to content

ENH: Added colspecs detection to read_fwf #4955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 30, 2013
Merged

ENH: Added colspecs detection to read_fwf #4955

merged 1 commit into from
Sep 30, 2013

Conversation

alefnula
Copy link
Contributor

closes #4488

Implemented an algorithm that uses a bitmask to detect the gaps between the columns.
Also the reader buffers the lines used for detection in case it's input is not seek-able.
Added tests.

self.f = f
self.colspecs = colspecs
self.filler = filler # Empty characters between fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened to encoding? is thousands necessary? (I get filler can prob be removed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were not used anywhere in this class so I removed them. They are used in the PythonParser no the FixedWidthReader. Also I renamed the filler to delimiter to be consistent with other functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok....encoding should stay in....its prob not tested....can you add something for that? (see the tests for read_csv)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ecoding is used in the FixedWidthFieldParser not FixedWidthReader. All tests passed. I could return it if you want, but as i said it's not used anywhere...

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

@alefnula pls add an extry in release notes and v0.13.0 example (and put the same example in io.rst). you can do these at the end....just an FYI

@alefnula
Copy link
Contributor Author

@jreback Yes, I was waiting for some feedback first. :)

@jtratner
Copy link
Contributor

@alefnula apparently my comment got lost: what happens if you have many many columns (e.g., 10,000) or a first row that is very long (maybe 100,000 characters or something)? Will your implementation still work?

@alefnula
Copy link
Contributor Author

@jtratner It will work. But it won't be very efficient if you have 10k columns. Characters lengths of the row are not such a big deal, I will allocate just one ndarray of 100.000 ints, which is 800k. That is the only structure I need... And the buffer. But I could remove the buffering if it could be guaranteed that the item passed to the reader is seekable.

Maybe the PythonParser._make_reader should take care of buffering...

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

@alefnula I think buffering is fine as well as what you are doing. The time you spend determining widths is a) much lower than human time doing it!, b) almost always less than actually reading the file....

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

what about doing an API like: widths='infer' to enable this? (I understand you then don't get specific rows/number of rows....but I view that as a real edge case)

@alefnula
Copy link
Contributor Author

@jreback As I said, whatever you like :) I would just use: colspecs='infer' since I'm actually inferring the colspecs...

Or maybe if either of the two is 'infer'.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

@alefnula how useful is to inspect certain rows do you think? (as opposed to just top 100)

@alefnula
Copy link
Contributor Author

@jreback At first I thought it could be useful, but when I started writing the tests I found out that it's much more painful to determine the correct rows from which the colspecs can be inferred than to actually count the column widths. So, now that I played with it a little, I think it's completely useless :D Except maybe in some really, really rare cases where one of rows determine the width.

But even then I'll inver correctly using this algorithm... When I added that parameter, I had another algorithm in mind, but it turned out that it wasn't correct.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

ok in that case I would then turn auto detect on infer is passed for either widths or column spec

@alefnula
Copy link
Contributor Author

@jreback Deal. But that will wait have to wait until tomorrow, it's 2AM here :D

@alefnula
Copy link
Contributor Author

@jreback I would maybe just change one more thing: enable just the colspecs parameter to accept 'infer'. It's duplicating if both parameters accept 'infer' and may confuse the user to think that we do something differently if he passes widths='infer' compared to colspecs='infer'.

Maybe even let it be the default if none of the above is specified?

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

@alefnula that's fine.....

@alefnula
Copy link
Contributor Author

@jreback Which part? Just the colspecs excepting 'infer' or auto detection being the default fallback mechanism?

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

i like only accepting 'infer' on colspecs. I would also agree that if both colspecs and width are None then inferring is fine. (so basically the default for colspecs is then defaulted to infer. if for some reason the user sets colspecs to None then the existing error message can come up

@alefnula
Copy link
Contributor Author

Everything done. If I missed something or have to add something more please tell me.

@@ -789,6 +791,19 @@ column widths for contiguous columns:
The parser will take care of extra white spaces around the columns
so it's ok to have extra separation between the columns in the file.

If your data file has correctly separated columns using the delimiter provided
to the ``read_fwf`` function - like in the case of the ``bar.csv`` file, where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put a similar blurb in v0.13.0.txt....to 'announce' this new feature

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

@alefnula pls rebase and squash

@cpcloud comments?

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

@cpcloud @jtratner @y-p @wesm comments?

I think this is ok

@@ -789,6 +791,21 @@ column widths for contiguous columns:
The parser will take care of extra white spaces around the columns
so it's ok to have extra separation between the columns in the file.

.. versionadded:: 0.13.0

If your data file has correctly separated columns using the delimiter provided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this paragraph shorter.

By default,read_fwfwill try to infer the file's colspecs by using the first 100 rows of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner I wanted to point out that the columns must be clearly (visually) separated. You cannot have:

12foo
23bar
34baz

And expect to get two columns using the sniffing. Also they must be a valid fwf, so this:

12 foo
12344 bar
334 baz

Even though it's clearly separated is not a valid fwf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, read_fwf will try to infer the file's colspecs by using the first 100 rows of the file, but only if the columns are delimited by whitespace and are aligned between rows.

@jtratner
Copy link
Contributor

@alefnula you need to add test cases for what happens if you can't or shouldn't be able to sniff the file (what happens, does it raise an error?) - there must exist files for which this doesn't work. Are there any pathological cases where this sniffing could get confused?

Also, please make sure you add a test case where you dynamically build a very long string (i.e., a very 'wide' file) and then test to make sure it can be sniffed appropriately.

@jtratner
Copy link
Contributor

also what happens with different delimiters (can the delimiter be a regular expression?), and unicode (UTF-8, non-UTF-8, variable-width - you can look at some of the other tests for cases where the width of the letters vary, maybe try for some chinese characters?).

@jtratner
Copy link
Contributor

also multicharacter delimiters, if those are accepted.

@alefnula
Copy link
Contributor Author

@jtratner OK so to recap:

  1. Add optional to colspecs docs.

  2. Move all test*.fwf files directly to test_parser.py as strings.

  3. In the case of files that cannot be sniffed properly, you will just end up with two columns merged into one or one column separated into two. No exception will be thrown.
    For example:

    N    A          Name
    123foo   Joe Doe
    345bar   Joe Smith 
    455baz   Abe Lincoln
    

    Even though the header suggests there are three columns, the parser is not able to found the boundary between the first and the second. Also the third column will be split into two since first name and last name are clearly separated.
    If you have just one name with four letters:

    N    A          Name
    123foo   Joe Doe
    345bar   Joe Smith 
    455baz   Abe Lincoln
    333brb   Jack Lastname
    

    Then the parser will correctly parse the third column, because the clear separation is broken. (In this case it would be the second column since the first two cannot be separated).

  4. About the very wide files. I tested this and it works just fine. The width of the file doesn't make any difference in parsing, because I'm not counting anything. It will just create a bigger one dimensional ndarray. I didn't thought it's necessary the test something that doesn't check any edge case or does anything differently. But I'll add that if you think it's necessary. No prob :)

  5. Multiple characters delimiters will work fine, but strings and regexes wont. So telling the parser something like delimiter='+~' will slit the line on + and ~ characters. But telling the parser to split whenever it finds a +-~ sequence wont. That didn't worked even before this change. Here is an untouched line of code from before that shows why:

    return [line[fromm:to].strip(self.delimiter)
            for (fromm, to) in self.colspecs]

    There is no regex strip, at least not a trivial one, so I think that this is just an unnecessary complication and shouldn't be implemented. So I'll just add tests that check for correctness of this.

  6. Add tests that check unicode delimiters and unicode files.

@jtratner
Copy link
Contributor

Wide files - you don't need to add that, I just wanted to check that it worked. Somehow I thought that this was setup to accept regexes, but I think I was getting confused with csv's separator.

To confirm, you're saying that read_fwf with 'infer' will never raise an Exception, it will only produce poor results?

@alefnula
Copy link
Contributor Author

@jtratner Yes, that's correct. If it raises an exception it's a bug in my code.

@jtratner
Copy link
Contributor

okay, that all sounds fine. I do want to make sure you test with unicode - but you're working with string and not bytes anyways, so probably okay in Python 3. I don't have a great grasp on whether using something that's variable width unicode would mess up what you're doing.

@alefnula
Copy link
Contributor Author

@jtratner I'll add tests for that. So just to be clear. I should do the:

1 - Shorten and fix docs
2 - Move files directly to tests.
5 - Add more tests for multi character delimiters.
6 - Add more tests for variable width unicode.

@jtratner
Copy link
Contributor

@alefnula Happy to clarify.

  1. Yes.
  2. Yes.

5. If you already have a test for multi character delimiters, not necessary to add more. (I must have missed it). Please do add a test that uses a variable-width unicode delimiter though.
6. Yes.

Thanks!

@alefnula
Copy link
Contributor Author

@jtratner All done.

@cpcloud
Copy link
Member

cpcloud commented Sep 28, 2013

@alefnula just need a rebase :)

@alefnula
Copy link
Contributor Author

@cpcloud I rebased and squashed it. It's just one commit. Or you mean something else?

@cpcloud
Copy link
Member

cpcloud commented Sep 28, 2013

@alefnula Sorry. I wasn't being clear. There's a merge conflict with upstream; probably with doc/source/release.rst. Need to rebase on top of upstream/master and resolve the merge conflicts.

@@ -1945,29 +1951,63 @@ class FixedWidthReader(object):
"""
A reader of fixed-width lines.
"""
def __init__(self, f, colspecs, filler, thousands=None, encoding=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw - were filler, thousands and encoding just doing nothing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thousands and encoding were not used, and i renamed the filler to delimiter for consistency with other functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure that encoding wasn't used? How does it know how to handle non-utf8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's doing it earlier - my bad :p. Anyways, as soon as you make that one test work for both 2 and 3 I'm good with merging this.

@alefnula
Copy link
Contributor Author

@cpcloud Rebased.

Implemented an algorithm that uses a bitmask to detect the gaps between the columns.
The reader buffers the lines used for detection in case the input stream is not seekable.
@alefnula
Copy link
Contributor Author

Rebased onto master (resolved merge conflicts).

@jreback
Copy link
Contributor

jreback commented Sep 29, 2013

@cpcloud @jtratner

ok by me

@wesm
Copy link
Member

wesm commented Sep 30, 2013

How clever. I like it. This is fine by me

jreback added a commit that referenced this pull request Sep 30, 2013
ENH: Added colspecs detection to read_fwf
@jreback jreback merged commit c8ab2dd into pandas-dev:master Sep 30, 2013
@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@alefnula thanks for this!

@alefnula alefnula deleted the iss4488 branch September 30, 2013 11:07
@ghost
Copy link

ghost commented Sep 30, 2013

The new detection routine fails for cases that the #4488 snippet works fine for.
Specifically, this routine requires that all rows have at least one-space seperation
between cols.

For example:

id  foo 
1   a
1   a
1   a
123a

This works with #4488 but fails with this code due to the last line. It's a common example of a
fwf file and so this detection routine really isn't useful for a large subset of datasets in the wild. a shame.

Also, the delimiter name change is misleading. By definition fwf files have no delimiter, the filler
keyword name was the right thing, and the routine in #4488 also supported auto-detection for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto-detect field widths in read_fwf when unspecified
5 participants