DataFrame.to_csv/read_csv inconsistency #4595

ghost · 2013-08-18T00:07:44Z

This is not a bug; maybe it is more of a philosophical issue.

When I download data from Yahoo, mimicking the code on p. 139 of the book (python for data analysis) it creates a data frame. The index for this data frame is a time series index - timestamps provided by Yahoo.

If I save this DataFrame to disk using to_csv and then read it back using read_csv the resulting data frame now has a 'normal' range type index [0,1,2...n] and a new column labelled 'Date' which now contains the dates.

This means that code written to analyze this data needs to be different depending on whether it has been saved or not. I can deal with it as is, but I think this is not the best design; DataFrames should be invariant under saving.

Just my $ 0.02.

Respectfully,

Greg St. George

cpcloud · 2013-08-18T03:46:05Z

I wonder if it would be reasonable to try to convert dates automatically and leave as a string if that fails.

It also might be reasonable to set index_col=0 as the default similar to header=0 although personally I like being forced to declare what I want as an index. For more complex code bases that use pandas I think this is more maintainable. For tutorials and other interactive analyses this may not be ideal.

These are both backwards incompatible API changes though, and over here we really try hard to be backwards-compatible.

@gregstg Also realize that with pandas's flexibility it's hard to be everything to everyone. This means that some things, especially IO code, will not be invertible, i.e., df.to_csv('csv.csv'); df2 = read_csv('csv.csv'), df2 != df.

That said, we do aim to please our users as much as possible.

ghost · 2013-08-18T10:57:07Z

Firstly, I can't reproduce this with padnas 0.12(-ish):

In [63]: import pandas.io.data as web
    ...: df = web.get_data_yahoo('IBM', '1/1/2000', '1/1/2010')
    ...: print type(df.index),df.columns
    ...: df.to_csv('1.csv')
    ...: print type(df.index),df.from_csv('1.csv').columns
<class 'pandas.tseries.index.DatetimeIndex'> Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype=object)
<class 'pandas.tseries.index.DatetimeIndex'> Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype=object)

Perhaps you're using an earlier version?

Secondly, dataframes are "invariant" when roundtrip-ed, but that's not what to_csv()/read_csv() do.
To de/serialize dataframes to disk, use df.save()/df.load(). to_csv only exports data in csv format.
Dataframes have a richer datamodel then csv itself supports, so it's not surprising that some
information is lost going there and back again.

Finally, note that pd.read_csv() has the index_col option to automatically rehydrate the index,
and can infer date types automatically with parse_dates.

jtratner · 2013-08-18T12:07:48Z

@y-p in your example, the second df.index is still from the original data dump, not the result of load_csv

jtratner · 2013-08-18T12:10:03Z

but it still works:

In [63]: import pandas.io.data as web
    ...: df = web.get_data_yahoo('IBM', '1/1/2000', '1/1/2010')
    ...: print type(df.index),df.columns
    ...: df.to_csv('1.csv')
    ...: new_df = df.from_csv('1.csv')
    ...: print type(new_df.index),new_df.columns
<class 'pandas.tseries.index.DatetimeIndex'> Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype=object)
<class 'pandas.tseries.index.DatetimeIndex'> Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype=object)

ghost · 2013-08-18T12:35:27Z

I checked things in the console first, then misedited the snippet manually. my bad.

cpcloud · 2013-08-18T14:47:43Z

@y-p Why the close? I think the OP has a point w.r.t. datetimes. Should I open another issue for that? Or do you not agree?

ghost · 2013-08-18T15:24:37Z

What point is that? I don't see anything remaining. It's not even reproducible with the
current version.
re your comments on changing default behaviour here - I'm against.

But by all means, reopen if you want to.

jtratner · 2013-08-18T15:25:55Z

@cpcloud note that if you load the dataset from csv, it comes back as a datetime index (maybe it doesn't on older versions of pandas) so the specific problem OP brought up has been resolved or is due to some kind of error.

cpcloud · 2013-08-18T15:28:34Z

@jtratner No, it doesn't, which is the point I'm talking about. I'm aware of the parse_dates parameter.

In [9]: df = web.get_data_yahoo('IBM', '1/1/2000', '1/1/2010')

In [10]: df.to_csv('csv.csv')

In [11]: res = read_csv('csv.csv', index_col=0)

In [12]: type(res.index)
Out[12]: pandas.core.index.Index

In [13]: pd.__version__
Out[13]: '0.12.0-199-g4c8ad82'

cpcloud · 2013-08-18T15:33:25Z

@y-p For the record, i'm also against changing index_col=0 as well, as per my comments above.

The previous comment does show an inconsistency with from_csv, which I thought was deprecated.

In [8]: type(df.from_csv('csv.csv', index_col=0).index)
Out[8]: pandas.tseries.index.DatetimeIndex

cpcloud · 2013-08-18T15:35:27Z

Was read_csv ever supposed to work like from_csv in this respect? I'll happily bisect it but I don't want to start any archaeology without knowing if this is on purpose, since we have the parse_dates parameter.

jreback · 2013-08-18T15:59:19Z

IIUC from_csv is supposed to be the inverse of to_csv (and to accomplish this it does call read_csv with a couple of arguments)

however, this is quite confusing and there is an issue to deprecate from_csv entirely

and as @y-p points outs the csv format is not an invariant format (at least if you don't want to tag it with extra meta deta which makes it pretty non-generic)

so what I think we should do is this

deprecate from_csv

and either

add a flavor argument to read_csv/to_csv that allows for invertibility (which basically provides a default that has parse_dates = [ 0 ]

or

just note in the docs/doc string that certain options are required to maintain an invariant reproduction

cpcloud · 2013-08-18T16:01:21Z

+1 for a doc note here

ghost · 2013-08-18T16:10:38Z

Both df.from_csv and pd.read_csv have parse_dates, with differing default values. horrors. (dupe #3418)
Am still personally fine with df->csv->df mangling dataframes in this way.

I would suggest just leaving things as they are. a doc note doesn't hurt (and doesn't
help much either). if deprecate == remove, I think this is too small a wart to justify
breaking existing code.

#4191

ghost · 2013-08-18T20:50:33Z

Folks:

Someone said that the phenomenon I mentioned is not reproducible in the
current version.
Not so. I am sorry not to have posted an example of what I meant.

In[5]: pd.version
Out[5]: '0.12.0'

In[6]: all_data ={}

In[7]: for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
all_data[ticker] = web.get_data_yahoo(ticker,
'1/1/2013','8/1/2013')

In[8]: price = DataFrame({tic: data['Adj Close'] for tic, data in
all_data.iteritems()})

In[9]: price.index[0:3]
Out[9]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-02 00:00:00, ..., 2013-01-04 00:00:00]
Length: 3, Freq: None, Timezone: None

In[10]: price.to_csv('price2.csv')

In[11]: price2 = pd.read_csv('price2.csv')

In[12]: price2.index[0:3]
Out[12]: Int64Index([0, 1, 2], dtype=int64)

This is a paste from Canopy, where I have entered the 'In' statements by
hand.

This example is basically from p 139 of Wes McKinney's text. The point is,
if you create the price DataFrame
and then spend several hours writing functions to filter it, they will
inevitably use the index as it exists, and then
when you come back the next day after saving your data base, your code will
not work. This was my experience.
Actually, since i am far from from bright about these things, it actually
took me quite awhile to figure out what
was going on. So I think it worth at least a 'note' somewhere.

It was pointed out, and I am grateful for this, that I should have used
df.save() and df.load(). If you use these
functions the form of the dataframe will indeed be preserved (although I
got a deprecated notice when using
load...). My point then is perhaps that people reading Wes's book will all
have this experience. In the chapter
6 on Data Loading etc. and previously, the functions used to save and read
dataframes are the ones I have used.
I couldn't find that df.load and df.save were even mentioned there, though
I may well be wrong. So maybe my
comment is more for Wes, but also maybe the issue is not worth bothering
him about, so I won't.

Respectfully,

Greg St. George

On Sun, Aug 18, 2013 at 10:10 AM, y-p [email protected] wrote:

Both df.from_csv and pd.read_csv have parse_dates, with differing default
values. horrors.
Am still personally fine with df->csv->df mangling dataframes in this way.

I would suggest just leaving things as they are. a doc note doesn't hurt
(and doesn't
help much either). if deprecate == remove, I think this is too small a
wart to justify
breaking existing code.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4595#issuecomment-22833501
.

jtratner · 2013-08-18T21:22:58Z

Greg - sorry about that, I guess I was wrong about what was going on! I
hope I didn't come off as too rude or dismissive as a result.

ghost · 2013-08-20T03:48:13Z

Jeff,

Not at all, no offense taken.

Greg

On Sun, Aug 18, 2013 at 3:23 PM, Jeff Tratner [email protected]:

Greg - sorry about that, I guess I was wrong about what was going on! I
hope I didn't come off as too rude or dismissive as a result.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4595#issuecomment-22840147
.

zrothberg · 2018-02-03T00:01:41Z

I would just like to note this is still an issue in 2018. Is there a work around other then formatting the to_csv.

jorisvandenbossche · 2018-02-03T10:09:13Z

And I would like to note that csv is not a format meant for perfect roundtripping as you loose by definition type information. If you don't want to do manual formatting on reading in, there are plenty of methods that try to do better roundtripping (parquet, feather, json table schema, ..)

ghost closed this as completed Aug 18, 2013

jreback mentioned this issue May 4, 2021

ENH: df.to_csv() and pd.read_csv() defaults do not give back original data #41311

Closed

jreback mentioned this issue Nov 27, 2021

ENH: Different behavior of pandas when saving and restoring from a CSV file #44639

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.to_csv/read_csv inconsistency #4595

DataFrame.to_csv/read_csv inconsistency #4595

ghost commented Aug 18, 2013

cpcloud commented Aug 18, 2013

ghost commented Aug 18, 2013

jtratner commented Aug 18, 2013

jtratner commented Aug 18, 2013

ghost commented Aug 18, 2013

cpcloud commented Aug 18, 2013

ghost commented Aug 18, 2013

jtratner commented Aug 18, 2013

cpcloud commented Aug 18, 2013

cpcloud commented Aug 18, 2013

cpcloud commented Aug 18, 2013

jreback commented Aug 18, 2013

cpcloud commented Aug 18, 2013

ghost commented Aug 18, 2013

ghost commented Aug 18, 2013

jtratner commented Aug 18, 2013

ghost commented Aug 20, 2013

zrothberg commented Feb 3, 2018

jorisvandenbossche commented Feb 3, 2018

DataFrame.to_csv/read_csv inconsistency #4595

DataFrame.to_csv/read_csv inconsistency #4595

Comments

ghost commented Aug 18, 2013

cpcloud commented Aug 18, 2013

ghost commented Aug 18, 2013

jtratner commented Aug 18, 2013

jtratner commented Aug 18, 2013

ghost commented Aug 18, 2013

cpcloud commented Aug 18, 2013

ghost commented Aug 18, 2013

jtratner commented Aug 18, 2013

cpcloud commented Aug 18, 2013

cpcloud commented Aug 18, 2013

cpcloud commented Aug 18, 2013

jreback commented Aug 18, 2013

cpcloud commented Aug 18, 2013

ghost commented Aug 18, 2013

ghost commented Aug 18, 2013

jtratner commented Aug 18, 2013

ghost commented Aug 20, 2013

zrothberg commented Feb 3, 2018

jorisvandenbossche commented Feb 3, 2018