Skip to content

DataFrame.to_csv/read_csv inconsistency #4595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Aug 18, 2013 · 19 comments
Closed

DataFrame.to_csv/read_csv inconsistency #4595

ghost opened this issue Aug 18, 2013 · 19 comments
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@ghost
Copy link

ghost commented Aug 18, 2013

This is not a bug; maybe it is more of a philosophical issue.

When I download data from Yahoo, mimicking the code on p. 139 of the book (python for data analysis) it creates a data frame. The index for this data frame is a time series index - timestamps provided by Yahoo.

If I save this DataFrame to disk using to_csv and then read it back using read_csv the resulting data frame now has a 'normal' range type index [0,1,2...n] and a new column labelled 'Date' which now contains the dates.

This means that code written to analyze this data needs to be different depending on whether it has been saved or not. I can deal with it as is, but I think this is not the best design; DataFrames should be invariant under saving.

Just my $ 0.02.

Respectfully,

Greg St. George

@cpcloud
Copy link
Member

cpcloud commented Aug 18, 2013

I wonder if it would be reasonable to try to convert dates automatically and leave as a string if that fails.

It also might be reasonable to set index_col=0 as the default similar to header=0 although personally I like being forced to declare what I want as an index. For more complex code bases that use pandas I think this is more maintainable. For tutorials and other interactive analyses this may not be ideal.

These are both backwards incompatible API changes though, and over here we really try hard to be backwards-compatible.

@gregstg Also realize that with pandas's flexibility it's hard to be everything to everyone. This means that some things, especially IO code, will not be invertible, i.e., df.to_csv('csv.csv'); df2 = read_csv('csv.csv'), df2 != df.

That said, we do aim to please our users as much as possible.

@ghost
Copy link
Author

ghost commented Aug 18, 2013

Firstly, I can't reproduce this with padnas 0.12(-ish):

In [63]: import pandas.io.data as web
    ...: df = web.get_data_yahoo('IBM', '1/1/2000', '1/1/2010')
    ...: print type(df.index),df.columns
    ...: df.to_csv('1.csv')
    ...: print type(df.index),df.from_csv('1.csv').columns
<class 'pandas.tseries.index.DatetimeIndex'> Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype=object)
<class 'pandas.tseries.index.DatetimeIndex'> Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype=object)

Perhaps you're using an earlier version?

Secondly, dataframes are "invariant" when roundtrip-ed, but that's not what to_csv()/read_csv() do.
To de/serialize dataframes to disk, use df.save()/df.load(). to_csv only exports data in csv format.
Dataframes have a richer datamodel then csv itself supports, so it's not surprising that some
information is lost going there and back again.

Finally, note that pd.read_csv() has the index_col option to automatically rehydrate the index,
and can infer date types automatically with parse_dates.

@jtratner
Copy link
Contributor

@y-p in your example, the second df.index is still from the original data dump, not the result of load_csv

@jtratner
Copy link
Contributor

but it still works:

In [63]: import pandas.io.data as web
    ...: df = web.get_data_yahoo('IBM', '1/1/2000', '1/1/2010')
    ...: print type(df.index),df.columns
    ...: df.to_csv('1.csv')
    ...: new_df = df.from_csv('1.csv')
    ...: print type(new_df.index),new_df.columns
<class 'pandas.tseries.index.DatetimeIndex'> Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype=object)
<class 'pandas.tseries.index.DatetimeIndex'> Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype=object)

@ghost
Copy link
Author

ghost commented Aug 18, 2013

I checked things in the console first, then misedited the snippet manually. my bad.

@ghost ghost closed this as completed Aug 18, 2013
@cpcloud
Copy link
Member

cpcloud commented Aug 18, 2013

@y-p Why the close? I think the OP has a point w.r.t. datetimes. Should I open another issue for that? Or do you not agree?

@ghost
Copy link
Author

ghost commented Aug 18, 2013

What point is that? I don't see anything remaining. It's not even reproducible with the
current version.
re your comments on changing default behaviour here - I'm against.

But by all means, reopen if you want to.

@jtratner
Copy link
Contributor

@cpcloud note that if you load the dataset from csv, it comes back as a datetime index (maybe it doesn't on older versions of pandas) so the specific problem OP brought up has been resolved or is due to some kind of error.

@cpcloud
Copy link
Member

cpcloud commented Aug 18, 2013

@jtratner No, it doesn't, which is the point I'm talking about. I'm aware of the parse_dates parameter.

In [9]: df = web.get_data_yahoo('IBM', '1/1/2000', '1/1/2010')

In [10]: df.to_csv('csv.csv')

In [11]: res = read_csv('csv.csv', index_col=0)

In [12]: type(res.index)
Out[12]: pandas.core.index.Index

In [13]: pd.__version__
Out[13]: '0.12.0-199-g4c8ad82'

@cpcloud
Copy link
Member

cpcloud commented Aug 18, 2013

@y-p For the record, i'm also against changing index_col=0 as well, as per my comments above.

The previous comment does show an inconsistency with from_csv, which I thought was deprecated.

In [8]: type(df.from_csv('csv.csv', index_col=0).index)
Out[8]: pandas.tseries.index.DatetimeIndex

@cpcloud
Copy link
Member

cpcloud commented Aug 18, 2013

Was read_csv ever supposed to work like from_csv in this respect? I'll happily bisect it but I don't want to start any archaeology without knowing if this is on purpose, since we have the parse_dates parameter.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2013

IIUC from_csv is supposed to be the inverse of to_csv (and to accomplish this it does call read_csv with a couple of arguments)

however, this is quite confusing and there is an issue to deprecate from_csv entirely

and as @y-p points outs the csv format is not an invariant format (at least if you don't want to tag it with extra meta deta which makes it pretty non-generic)

so what I think we should do is this

  • deprecate from_csv

and either

  • add a flavor argument to read_csv/to_csv that allows for invertibility (which basically provides a default that has parse_dates = [ 0 ]

or

  • just note in the docs/doc string that certain options are required to maintain an invariant reproduction

@cpcloud
Copy link
Member

cpcloud commented Aug 18, 2013

+1 for a doc note here

@ghost
Copy link
Author

ghost commented Aug 18, 2013

Both df.from_csv and pd.read_csv have parse_dates, with differing default values. horrors. (dupe #3418)
Am still personally fine with df->csv->df mangling dataframes in this way.

I would suggest just leaving things as they are. a doc note doesn't hurt (and doesn't
help much either). if deprecate == remove, I think this is too small a wart to justify
breaking existing code.

#4191

@ghost
Copy link
Author

ghost commented Aug 18, 2013

Folks:

Someone said that the phenomenon I mentioned is not reproducible in the
current version.
Not so. I am sorry not to have posted an example of what I meant.

In[5]: pd.version
Out[5]: '0.12.0'

In[6]: all_data ={}

In[7]: for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
all_data[ticker] = web.get_data_yahoo(ticker,
'1/1/2013','8/1/2013')

In[8]: price = DataFrame({tic: data['Adj Close'] for tic, data in
all_data.iteritems()})

In[9]: price.index[0:3]
Out[9]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-02 00:00:00, ..., 2013-01-04 00:00:00]
Length: 3, Freq: None, Timezone: None

In[10]: price.to_csv('price2.csv')

In[11]: price2 = pd.read_csv('price2.csv')

In[12]: price2.index[0:3]
Out[12]: Int64Index([0, 1, 2], dtype=int64)

This is a paste from Canopy, where I have entered the 'In' statements by
hand.

This example is basically from p 139 of Wes McKinney's text. The point is,
if you create the price DataFrame
and then spend several hours writing functions to filter it, they will
inevitably use the index as it exists, and then
when you come back the next day after saving your data base, your code will
not work. This was my experience.
Actually, since i am far from from bright about these things, it actually
took me quite awhile to figure out what
was going on. So I think it worth at least a 'note' somewhere.

It was pointed out, and I am grateful for this, that I should have used
df.save() and df.load(). If you use these
functions the form of the dataframe will indeed be preserved (although I
got a deprecated notice when using
load...). My point then is perhaps that people reading Wes's book will all
have this experience. In the chapter
6 on Data Loading etc. and previously, the functions used to save and read
dataframes are the ones I have used.
I couldn't find that df.load and df.save were even mentioned there, though
I may well be wrong. So maybe my
comment is more for Wes, but also maybe the issue is not worth bothering
him about, so I won't.

Respectfully,

Greg St. George

On Sun, Aug 18, 2013 at 10:10 AM, y-p [email protected] wrote:

Both df.from_csv and pd.read_csv have parse_dates, with differing default
values. horrors.
Am still personally fine with df->csv->df mangling dataframes in this way.

I would suggest just leaving things as they are. a doc note doesn't hurt
(and doesn't
help much either). if deprecate == remove, I think this is too small a
wart to justify
breaking existing code.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4595#issuecomment-22833501
.

@jtratner
Copy link
Contributor

Greg - sorry about that, I guess I was wrong about what was going on! I
hope I didn't come off as too rude or dismissive as a result.

@ghost
Copy link
Author

ghost commented Aug 20, 2013

Jeff,

Not at all, no offense taken.

Greg

On Sun, Aug 18, 2013 at 3:23 PM, Jeff Tratner [email protected]:

Greg - sorry about that, I guess I was wrong about what was going on! I
hope I didn't come off as too rude or dismissive as a result.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4595#issuecomment-22840147
.

@zrothberg
Copy link

I would just like to note this is still an issue in 2018. Is there a work around other then formatting the to_csv.

@jorisvandenbossche
Copy link
Member

And I would like to note that csv is not a format meant for perfect roundtripping as you loose by definition type information. If you don't want to do manual formatting on reading in, there are plenty of methods that try to do better roundtripping (parquet, feather, json table schema, ..)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

5 participants