read_fwf colspecs autodetection could be more flexible #5056

ghost · 2013-09-30T11:22:25Z

id foo 
1  a
1  a
1  a
123a

This works with the snippet in the orig #4488 but fails with the merged #4955 due to the last line.
It's a common example of a fwf file.

The text was updated successfully, but these errors were encountered:

jtratner · 2013-09-30T11:37:25Z

So what should happen in this case? Are you thinking that detection should be entirely based on first line?

id foo 
123a
123a
123a
123a

jtratner · 2013-09-30T11:44:33Z

Also, as an aside, this is probably a clearer example (when it's at the end, seems like it's a special footer)

id foo 
1  a
123a
1  a
1  a

ghost · 2013-09-30T11:51:09Z

Not at all. #4488 handles the first case by looking for a pattern which is common to a majority of records,
#4955 fails if there's a single record in the whole dataset that occupies the whole column.

In your example, the first row is indeed the only hint as to the width of the columns,
and it may be interpreted as a single column with a header label of "id foo".

jreback · 2013-09-30T11:59:25Z

cc @alefnula

I think that if detection is 'infer' and "doesn't work", then should raise.

not sure if you can make an accurate determination in all cases though?

ghost · 2013-09-30T12:01:30Z

True, can't make it work for all cases. I think the first example is more common then yours.
For the case where there is a full single-character column seperating the fields, as required by #4955,
#4488 will (almost) always work.

jreback · 2013-09-30T12:07:14Z

cc @alefnula

I think you can do a percentage computation on the mask that you are computing. If they are say the same in say 90% of the lines then you can go with that (or even higher per). I think that would allow for some (but not most) no-delimeter lines

alefnula · 2013-09-30T12:36:48Z

Yes, this is a problem. The current implementation cannot detect the columns properly if they are not separated by the filler. But the suggested implementation in issue #4488 didn't work for any case I tested :( For the example you just wrote here it returns [8]. And even for simple cases like:

1 2 3
4 5 6
7 8 9

It returns [2, 3]...

But this is not the primary reason I haven't tried to fix it. The much bigger problem with the splitting approach is that it cannot handle the cases where columns are strings which can contain the filler, for example:

id  Firstname and Lastname  Age
1   Douglas Adams           44
1   Arthur Conan Doyle      55

The approach I used supports these cases.

So basically these are the reason why I chose to use the bitmask instead of splitting and counting. (and also it's more memory and time efficient - but this is not that important if we read only 100 lines of data). The only drawback is that it cannot handle the columns that are glued together, but that is clearly stated in the documentation.

Maybe an approach where the outliers will be discarded can be implemented. But I'll have to find a way to do it efficiently...

jtratner · 2013-09-30T12:42:29Z

Ultimately, you can still pass colspecs directly in ambiguous cases.

alefnula · 2013-09-30T12:43:22Z

Also I don't think this is a bug. The behavior of inferring columns is clearly documented, and the function behaves as the documentation states it does...
But, as everything, it can be improved :)

jreback · 2013-09-30T12:44:20Z

@alefnula can you squash the mas column wise and compare how many equal as per of total rows? (not sure if that is really useful.....) just a thought

ghost · 2013-09-30T12:44:48Z

Not true. There may be a bug in in #4488 (or not) but the underlying approach should handle strings
such as those just fine.
The issue is that it has a free parameter (the threshold) which has no "right" value, unless you know the distribution of all datasets out there (we don't).
It seems to me that #4955 is a private case of the threshold scheme in #4488, but with a threshold of 0%,
and I'm pretty sure whatever the "best" value is, it's not that.

Documented behaviour is great, but ultimately it's preferable to support more cases rather
then document the fact that you don't, no?

ghost · 2013-09-30T13:43:17Z

Removed bug label. It's a common case that isn't handled, which is a shame but not a bug.

alefnula · 2013-09-30T13:47:59Z

@jreback I dont actually think that this problem can be solved in a way that will satisfy everyone :( since it's too ambiguous. The provided example:

id foo 
1  a
1  a
123a

Can be split in two different (both valid) ways:

 id  foo      id  foo
  1    a       1    a
  1    a  or   1    a
123    a      12   3a

And here is an even worse case.

If we try to implement any kind of heuristic that will determine where to split, there will be a case when we will get it wrong. The user will, maybe, get the correct number of columns but the data will be incorrect. So he will get a false conclusion (especially if he has a large data set that he cannot skim through) that we correctly split the columns, but the data will not be correct.
In the current implementation, all these ambiguous cases will produce wrong number of column in the beginning, so the user will immediately know that something is not right...
My opinion is that it's better to get a completely wrong number of columns in the first place, then to get the right number of columns, but have incorrect data in them because of the ambiguous cases.

jreback · 2013-09-30T13:50:00Z

@alefnula agreed....false positive is bad here. I was just wondering if you can detect the false positive and raise a helpful error message?

alefnula · 2013-09-30T14:22:57Z

@jreback In the current implementation you wont get false positives. Either the number column will be correct and you will get valid data or the number of columns wont be correct...
The thing that I may be able to implement is to "suspect" that some lines are ambiguous and issue a warning:

Line 44 is suspicious. If the inferred columns are incorrect, check the line 44 or provide a correct colspecs argument.

ghost · 2013-09-30T14:39:43Z

In all the examples you provided #4488 would do the right thing.

alefnula · 2013-09-30T14:52:16Z

@y-p The problem is that there is no right thing :)

Which of the following is right?

c     a foo 
12    1   5
23    1   6
45 6123   7

c   a foo 
12  1   5
23  1   6
456 1 237

c   a foo 
12  1   5
23  1   6
45 61 237

c    a foo 
12   1   5
23   1   6
456 12  37

c    a foo 
12   1   5
23   1   6
45 612  37

Also #4488 does not work for any case I tried :( Maybe you pasted the wrong version of the code there?

ghost · 2013-09-30T15:13:28Z

None of them, the right one is:

c   a   foo 
12  1   5
23  1   6
456 123 7

Both #4488 and your private case of it take the rightmost "filler" column as the edge of each column.
Honestly, most of the fwf datasets I've encountered would not work with #4955, that's the only
reason I'm pushing for this.

How about adding a detect_threshold kwarg and setting the defval to match the behaviour of
#4955? a conservative default is fine by me, just don't be too conservative to be useful.

alefnula · 2013-09-30T15:42:31Z

Oh, now I see what you aim for :) Correct me if I'm wrong, but your point is that most of the fwf files are aligned left. If this is the case, we had a completely different experience with fwf files, since most of the ones I saw were aligned right so I would probably interpret this one as:

Maybe a aligned=left|right|None kwarg? In case of None I'll do the thing I'm doing now, and in the case of left or rigth I can be a little bit "smarter".

ghost · 2013-09-30T15:53:29Z

No, it has nothing to do with alignment. Your code is a fine default, I'll add in a kwarg
threshold when I get a chance, preserving this default behaviour. np.

Correction: an aligned kwarg makes sense, but still won't handle the cases I'm concerned with,
unless there's a threshold argument.

jtratner · 2013-09-30T21:07:46Z

@y-p if you don't have the column headers, you can't know which one is
correct and all of the examples could be valid. Only if the column headers
are present (and you can assume that they are both left-aligned and don't
have spaces in their name) can you unambiguously decide where to split.
(especially if the data for a column has filler in it but is supposed to
stay together)

ghost · 2013-10-01T10:16:32Z

@jtratner the example @alefnula gave aligns some of the column headings differently
from the data, that's misleading. Remove the bogus headings and assume
All columns are aligned the same way (valid assumption, in my experience), Is there
still ambiguity there?

jtratner · 2013-10-01T11:21:39Z

I think you're right that the alignment should be consistent, but the
example still seems dependent upon the assumption that the first row are
column headings, which isn't always true with fwf that I've seen.

ghost · 2013-10-01T13:32:22Z

The algo I had in mind doesn't isn't dependent on col headings in any way.
We've had plenty of discussion on this at this point, I suggest we leave it
for now until I or someone else provides actual code to review.
@alefnula's code is a fine enhancement so we're ahead already. In the future,
we can do more.

pushed to 0.14, it's not a blocker. (or someday)

mroeschke · 2023-03-31T19:03:33Z

Since there hasn't been much activity for this feature since and there appears to be significant complexity for determining the correct behavior I am going to close for now, but can reopen if there's more interest

jreback modified the milestones: 0.15.0, 0.14.0 Feb 26, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

jbrockmendel added IO Fixed Width read_fwf and removed IO Data IO issues that don't fit into a more specific label labels Dec 11, 2019

jbrockmendel removed the IO CSV read_csv, to_csv label Dec 20, 2019

mroeschke added the Enhancement label Apr 5, 2020

mroeschke removed this from the Someday milestone Oct 13, 2022

mroeschke closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_fwf colspecs autodetection could be more flexible #5056

read_fwf colspecs autodetection could be more flexible #5056

ghost commented Sep 30, 2013

jtratner commented Sep 30, 2013

jtratner commented Sep 30, 2013

ghost commented Sep 30, 2013

jreback commented Sep 30, 2013

ghost commented Sep 30, 2013

jreback commented Sep 30, 2013

alefnula commented Sep 30, 2013

jtratner commented Sep 30, 2013

alefnula commented Sep 30, 2013

jreback commented Sep 30, 2013

ghost commented Sep 30, 2013

ghost commented Sep 30, 2013

alefnula commented Sep 30, 2013

jreback commented Sep 30, 2013

alefnula commented Sep 30, 2013

ghost commented Sep 30, 2013

alefnula commented Sep 30, 2013

ghost commented Sep 30, 2013

alefnula commented Sep 30, 2013

ghost commented Sep 30, 2013

jtratner commented Sep 30, 2013

ghost commented Oct 1, 2013

jtratner commented Oct 1, 2013

ghost commented Oct 1, 2013

mroeschke commented Mar 31, 2023

read_fwf colspecs autodetection could be more flexible #5056

read_fwf colspecs autodetection could be more flexible #5056

Comments

ghost commented Sep 30, 2013

jtratner commented Sep 30, 2013

jtratner commented Sep 30, 2013

ghost commented Sep 30, 2013

jreback commented Sep 30, 2013

ghost commented Sep 30, 2013

jreback commented Sep 30, 2013

alefnula commented Sep 30, 2013

jtratner commented Sep 30, 2013

alefnula commented Sep 30, 2013

jreback commented Sep 30, 2013

ghost commented Sep 30, 2013

ghost commented Sep 30, 2013

alefnula commented Sep 30, 2013

jreback commented Sep 30, 2013

alefnula commented Sep 30, 2013

ghost commented Sep 30, 2013

alefnula commented Sep 30, 2013

ghost commented Sep 30, 2013

alefnula commented Sep 30, 2013

ghost commented Sep 30, 2013

jtratner commented Sep 30, 2013

ghost commented Oct 1, 2013

jtratner commented Oct 1, 2013

ghost commented Oct 1, 2013

mroeschke commented Mar 31, 2023