-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TST: read_csv silently drops empty row #21995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Well there is a parameter for Investigation and PRs certainly welcome! |
I think
as opposed to
or
|
I guess this is what As far as I can tell, this had to do with supporting the output format of df = pd.DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
index=['one', 'two'],
columns=pd.MultiIndex.from_tuples(
[('a', 'q'), ('a', 'r'), ('a', 's'),
('b', 't'), ('c', 'u'), ('c', 'v')]))
print(result.to_csv(), "\npandas version:", pd.__version__) Output:
When running the same code on master, I get
Looks like this 'feature' is no longer needed, right? |
Thanks for the investigation. I would agree with you that the "feature" in question is not necessary. PRs welcome! |
Great, I'll give it a go then! |
This is not as clear cut as i had hoped. Modifying the above example to df = pd.DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
index=pd.Index(['one', 'two'], name='index_name'),
columns=pd.MultiIndex.from_tuples(
[('a', 'q'), ('a', 'r'), ('a', 's'),
('b', 't'), ('c', 'u'), ('c', 'v')]))
print(df.to_csv()) yields the output
Thus, the current behaviour seems to be that df_nan = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples([
('first', 'foo'),
('second', 'bar'),
]),
)
print(
"dataframe:", df_nan,
"\ncsv representation:", df_nan.to_csv(),
"\nroundtrip dataframe:", pd.read_csv(StringIO(df_nan.to_csv()), index_col=[0, 1]),
sep="\n",
) which outputs
Suprisingly, the output changes if a multiindex is supplied for the header: df_nan = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples([
('first', 'foo'),
('second', 'bar'),
]),
columns=pd.MultiIndex.from_tuples([
('a', 'baz'),
('b', 'baz'),
]),
)
print(
"dataframe:", df_nan,
"\ncsv representation:", df_nan.to_csv(),
"\nroundtrip dataframe:", pd.read_csv(StringIO(df_nan.to_csv()),
index_col=[0, 1],
header=[0,1]),
sep="\n",
) outputs
Note that this is still not the parsing I would expect; here pandas is interpreting the third line as a To me, the default settings are surprising. As far as I can tell, the problem is that the csv format is not sufficiently rich to naturally encode the When outputting to csv with a multiindex header, pandas strips def strip_level_names(df):
df_copy = df.copy()
index = df_copy.index
header = df_copy.columns
if issubclass(type(index), pd.MultiIndex):
index = pd.MultiIndex.from_tuples(list(index))
elif issubclass(type(index), pd.Index):
index = pd.Index(list(index))
if issubclass(type(header), pd.MultiIndex):
header = pd.MultiIndex.from_tuples(list(header))
elif issubclass(type(header), pd.Index):
header = pd.Index(list(header))
return pd.DataFrame(data=df_copy.values, index=index, columns=header) had been called before This way the behaviour would be completely predictable, with the |
So, here's the behaviour I would expect: Input: df = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples([
('first', 'foo'),
('second', 'bar'),
]),
)
print(df.to_csv()) Output (no change needed):
Input: df = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples(
[
('first', 'foo'),
('second', 'bar'),
],
names=["lvl_0", "lvl_1"]
),
columns=["a", "b"]
)
print(df.to_csv()) Output (no change needed):
Input: df = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples(
[
('first', 'foo'),
('second', 'bar'),
],
),
columns=pd.MultiIndex.from_tuples(
[
("a", "A"),
("b", "A"),
]
)
)
print(df.to_csv()) Output (no change needed):
print(df.to_csv(level_names=True)) Output (change needed):
Input: df = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples(
[
('first', 'foo'),
('second', 'bar'),
],
names=["idx_lvl_0", "idx_lvl_1"],
),
columns=pd.MultiIndex.from_tuples(
[
("a", "A"),
("b", "A"),
],
names=["hdr_lvl_0", "hdr_lvl_1"],
)
)
print(df.to_csv()) Output (change needed):
print(df.to_csv(level_names=True)) Output (change needed):
In the other direction, I would expect |
Yeah...missing data situations are likely to introduce corner cases here and there. @dahlbaek : welcome to investigate and see, but be careful to not introduce too make special cases. |
@jreback I'm not sure what you mean that csv is not a high-fidelity format. As far as I know, it is the main workhorse format for import/export used by postgres for instance, see COPY and Populating a Database. This happens to be a usecase that is relevant to my interests, as I sometimes want to move data between pandas and postgres . What formats do you consider high-fidelity? @jreback @gfyoung Of course it is entirely up to you which csv formats However: I do believe that the csv family is a main workhorse for a lot of data analysis today—not least because so many tools can read and write csv files (like… well… Quoting the
I should think that robust and predictable handling of the most classical and wide spread open data format is a prerequisite in order to achieve that goal! Just to make my point of view completely clear: I am not proposing that Perhaps a middle ground could just be a section in the |
The problem is not only when you want multiple rows as header, but also when you want no header at all. My file:
My code:
My output:
Interestingly, with |
@Michael-E-Rose: since the parameters are relatively close, I would keep here for now. |
While going over some tests, I noticed the following here:
Is this really expected behaviour, that
pandas.read_csv
should silently discard an empty row following the header if multiple rows are passed as headers?The text was updated successfully, but these errors were encountered: