Skip to content

TST: read_csv silently drops empty row #21995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dahlbaek opened this issue Jul 20, 2018 · 12 comments
Open

TST: read_csv silently drops empty row #21995

dahlbaek opened this issue Jul 20, 2018 · 12 comments
Labels
Docs IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action

Comments

@dahlbaek
Copy link
Contributor

While going over some tests, I noticed the following here:

        df = DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
                       index=['one', 'two'],
                       columns=MultiIndex.from_tuples(
                           [('a', 'q'), ('a', 'r'), ('a', 's'),
                            ('b', 't'), ('c', 'u'), ('c', 'v')]))


        # to_csv
        data = """,a,a,a,b,c,c
,q,r,s,t,u,v
,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12"""


        result = self.read_csv(StringIO(data), header=[0, 1], index_col=0)
        tm.assert_frame_equal(df, result)

Is this really expected behaviour, that pandas.read_csv should silently discard an empty row following the header if multiple rows are passed as headers?

@WillAyd
Copy link
Member

WillAyd commented Jul 20, 2018

Well there is a parameter for skip_blank_lines which is True by default, but setting it to False didn't change your example which seems off.

Investigation and PRs certainly welcome!

@WillAyd WillAyd added Bug IO CSV read_csv, to_csv labels Jul 20, 2018
@dahlbaek
Copy link
Contributor Author

dahlbaek commented Jul 20, 2018

I think skip_blank_lines is related to truly blank lines, not lines that contain separator characters. Consider

from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n\n\n1,2")
pd.read_csv(filepath_or_buffer)

as opposed to

from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n,\n,\n1,2")
pd.read_csv(filepath_or_buffer)

or

from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n\n\n1,2")
pd.read_csv(filepath_or_buffer, skip_blank_lines=False)

@dahlbaek
Copy link
Contributor Author

I guess this is what git-blame is for! Related: #5298

As far as I can tell, this had to do with supporting the output format of pandas.DataFrame.to_csv. I dusted off an old version of pandas and ran the following:

df = pd.DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
               index=['one', 'two'],
               columns=pd.MultiIndex.from_tuples(
                   [('a', 'q'), ('a', 'r'), ('a', 's'),
                    ('b', 't'), ('c', 'u'), ('c', 'v')]))
print(result.to_csv(), "\npandas version:", pd.__version__)

Output:

,a,a,a,b,c,c
,q,r,s,t,u,v
,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12

pandas version: 0.14.0

When running the same code on master, I get

,a,a,a,b,c,c
,q,r,s,t,u,v
one,1,2,3,4,5,6
two,7,8,9,10,11,12

pandas version: 0.24.0.dev0+332.g1f6ddc4

Looks like this 'feature' is no longer needed, right?

@WillAyd
Copy link
Member

WillAyd commented Jul 20, 2018

Thanks for the investigation. I would agree with you that the "feature" in question is not necessary. PRs welcome!

@dahlbaek
Copy link
Contributor Author

Great, I'll give it a go then!

@dahlbaek
Copy link
Contributor Author

dahlbaek commented Jul 21, 2018

This is not as clear cut as i had hoped. Modifying the above example to

df = pd.DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
               index=pd.Index(['one', 'two'], name='index_name'),
               columns=pd.MultiIndex.from_tuples(
                   [('a', 'q'), ('a', 'r'), ('a', 's'),
                    ('b', 't'), ('c', 'u'), ('c', 'v')]))
print(df.to_csv())

yields the output

,a,a,a,b,c,c
,q,r,s,t,u,v
index_name,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12

Thus, the current behaviour seems to be that DataFrame.to_csv will change the output format according to the presence of the name/names attribute of DataFrame.index. However, pandas.read_csv cannot know how to correctly parse csv files then. Consider

df_nan = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples([
        ('first', 'foo'),
        ('second', 'bar'),
    ]),
)
print(
    "dataframe:", df_nan,
    "\ncsv representation:", df_nan.to_csv(),
    "\nroundtrip dataframe:", pd.read_csv(StringIO(df_nan.to_csv()), index_col=[0, 1]),
    sep="\n",
)

which outputs

dataframe:
              0    1
first  foo  NaN  NaN
second bar  1.0  2.0

csv representation:
,,0,1
first,foo,,
second,bar,1.0,2.0


roundtrip dataframe:
                     0    1
       Unnamed: 1
first  foo         NaN  NaN
second bar         1.0  2.0

Suprisingly, the output changes if a multiindex is supplied for the header:

df_nan = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples([
        ('first', 'foo'),
        ('second', 'bar'),
    ]),
    columns=pd.MultiIndex.from_tuples([
        ('a', 'baz'),
        ('b', 'baz'),
    ]),
)
print(
    "dataframe:", df_nan,
    "\ncsv representation:", df_nan.to_csv(),
    "\nroundtrip dataframe:", pd.read_csv(StringIO(df_nan.to_csv()),
                                          index_col=[0, 1],
                                          header=[0,1]),
    sep="\n",
)

outputs

dataframe:
              a    b
            baz  baz
first  foo  NaN  NaN
second bar  1.0  2.0

csv representation:
,,a,b
,,baz,baz
first,foo,,
second,bar,1.0,2.0


roundtrip dataframe:
              a    b
            baz  baz
first  foo
second bar  1.0  2.0

Note that this is still not the parsing I would expect; here pandas is interpreting the third line as a names attribute for the index multiindex.

To me, the default settings are surprising. As far as I can tell, the problem is that the csv format is not sufficiently rich to naturally encode the name/names attributes of multiindices. In view of that, the behaviour I would expect is:

When outputting to csv with a multiindex header, pandas strips name/names attributes, as if the function

def strip_level_names(df):
    df_copy = df.copy()
    index = df_copy.index
    header = df_copy.columns
    if issubclass(type(index), pd.MultiIndex):
        index = pd.MultiIndex.from_tuples(list(index))
    elif issubclass(type(index), pd.Index):
        index = pd.Index(list(index))
    if issubclass(type(header), pd.MultiIndex):
        header = pd.MultiIndex.from_tuples(list(header))
    elif issubclass(type(header), pd.Index):
        header = pd.Index(list(header))
    return pd.DataFrame(data=df_copy.values, index=index, columns=header)

had been called before to_csv. Setting a keyword parameter level_names=True changes the behaviour such that the output contains header and index name/names by always including an extra row for the index name/names (similar to current functionality, but explicit and tuneable instead of implicit). Similarly, read_csv would support a level_names keyword, in order to determine how to parse csv files.

This way the behaviour would be completely predictable, with the level_names keyword allowing the user to opt-in for pandas specific formating to include index level name/names if necessary.

@dahlbaek
Copy link
Contributor Author

dahlbaek commented Jul 21, 2018

So, here's the behaviour I would expect:

Input:

df = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples([
        ('first', 'foo'),
        ('second', 'bar'),
    ]),
)
print(df.to_csv())

Output (no change needed):

,,0,1
first,foo,,
second,bar,1.0,2.0

Input:

df = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples(
        [
            ('first', 'foo'),
            ('second', 'bar'),
        ],
        names=["lvl_0", "lvl_1"]
    ),
    columns=["a", "b"]
)
print(df.to_csv())

Output (no change needed):

lvl_0,lvl_1,a,b
first,foo,,
second,bar,1.0,2.0

Input:

df = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples(
        [
            ('first', 'foo'),
            ('second', 'bar'),
        ],
    ),
    columns=pd.MultiIndex.from_tuples(
        [
            ("a", "A"),
            ("b", "A"),
        ]
    )
)
print(df.to_csv())

Output (no change needed):

,,a,b
,,A,A
first,foo,,
second,bar,1.0,2.0
print(df.to_csv(level_names=True))

Output (change needed):

,,a,b
,,A,A
,,,
first,foo,,
second,bar,1.0,2.0

Input:

df = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples(
        [
            ('first', 'foo'),
            ('second', 'bar'),
        ],
        names=["idx_lvl_0", "idx_lvl_1"],
    ),
    columns=pd.MultiIndex.from_tuples(
        [
            ("a", "A"),
            ("b", "A"),
        ],
        names=["hdr_lvl_0", "hdr_lvl_1"],
    )
)
print(df.to_csv())

Output (change needed):

,,a,b
,,A,A
first,foo,,
second,bar,1.0,2.0
print(df.to_csv(level_names=True))

Output (change needed):

hdr_lvl_0,,a,b
hdr_lvl_1,,A,A
idx_lvl_0,idx_lvl_1,,
first,foo,,
second,bar,1.0,2.0

In the other direction, I would expect pd.read_csv with appropriate values for header, index_col and level_names to correctly reconstruct the original dataframes from the csv output (possibly omitting level names according to situation, but never failing to reconstruct the data itself).

@jreback
Copy link
Contributor

jreback commented Jul 24, 2018

@dahlbaek I am not sure why you are trying to make csv a high-fidelity format, its not. You have to do the best you can with the options and data in the file. We 'write' this format, why should we not read it?

cc @gfyoung

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2018

Yeah...missing data situations are likely to introduce corner cases here and there. @dahlbaek : welcome to investigate and see, but be careful to not introduce too make special cases.

@dahlbaek
Copy link
Contributor Author

dahlbaek commented Jul 25, 2018

@jreback I'm not sure what you mean that csv is not a high-fidelity format. As far as I know, it is the main workhorse format for import/export used by postgres for instance, see COPY and Populating a Database. This happens to be a usecase that is relevant to my interests, as I sometimes want to move data between pandas and postgres . What formats do you consider high-fidelity?

@jreback @gfyoung Of course it is entirely up to you which csv formats pandas should accept/support, and pandas is great whichever one(s) you pick.

However:

I do believe that the csv family is a main workhorse for a lot of data analysis today—not least because so many tools can read and write csv files (like… well… pandas). I would wager that to many users of pandas, the csv format is the main language that pandas uses to communicate with the outside world. As such, I do think there is good reason to support the more 'standard' dialects, and a great place to start seems to be the Frictionless Data Tabular Specification. Which, of course, pandas already supports to a high degree!

Quoting the Readme.md, pandas

has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

I should think that robust and predictable handling of the most classical and wide spread open data format is a prerequisite in order to achieve that goal!

Just to make my point of view completely clear: I am not proposing that pandas should not be able to encode level names in csv files. I am proposing that header/index level names should be encoded in a consistent and predictable manner, or not at all. This is possible. The pandas specific format (with index level names in the row following the header row(s)) is consistent and predictable (and pandas does use this output format sometimes). Similarly, the more 'standard' format where level names of the header/index is omitted is consistent and predictable (and pandas does use this output format sometimes). But mixing the two formats without providing a keyword argument (or a hint in the docs) for the user inevitably leads to unexpected behaviour.

Perhaps a middle ground could just be a section in the read_csv/to_csv docstrings describing how one might ensure consistent csv input/output (without any source code changes), coupled with tests that enforce that precise consistent behaviour?

@mroeschke mroeschke added the Testing pandas testing functions or related to the test suite label Jun 21, 2021
@mroeschke mroeschke added Docs Needs Tests Unit test(s) needed to prevent regressions and removed Bug Testing pandas testing functions or related to the test suite labels Jul 1, 2022
@Michael-E-Rose
Copy link

The problem is not only when you want multiple rows as header, but also when you want no header at all.

My file:


aa
aa adopt
aa ambiguity
aa bank
aa banks
aa baseball
aa becomes

My code:

terms = pd.read_csv(TERM_FOLDER/'terms.csv', header=None, na_filter=False)
terms.head()

My output:

              0
0            aa
1      aa adopt
2  aa ambiguity
3       aa bank
4      aa banks

Interestingly, with terms = pd.read_csv(TERM_FOLDER/'terms.csv', header=None, na_filter=False, skip_blank_lines=False) I get a pandas.errors.EmptyDataError: No columns to parse from file. Is this related or must I file a new issue?

@gfyoung
Copy link
Member

gfyoung commented Nov 7, 2022

@Michael-E-Rose: since the parameters are relatively close, I would keep here for now.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Needs Tests Unit test(s) needed to prevent regressions labels Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants