Skip to content

Feature request: write and read dtypes in csv #19378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dkapitan opened this issue Jan 24, 2018 · 4 comments
Closed

Feature request: write and read dtypes in csv #19378

dkapitan opened this issue Jan 24, 2018 · 4 comments

Comments

@dkapitan
Copy link

So here's something I would like. As an avid pandas user, I'd like to be able to write and read csv's to and from a dataframe including the dtypes of each column.

Reading up on pandas, I thought this does the trick in the most Pythonic way:

import ast
import pandas as pd

# dataframe as example
df = pd.DataFrame(data={'int': [1, 2, 3],
                        'float': [1.0, 2.0, 3.0],
                        'bool': [True, False, True],
                        'date': ['2018-03-01', '1973-09-09', '2009-05-20',]},)
df.date = df.date.astype('datetime64[ns]')

# write .csv with comment that lists dtypes
with open('test.csv', 'w') as f:
    f.write('#' + str(df.dtypes.apply(lambda x: x.name).to_dict()) + '\n')
    df.to_csv(f, index=False, )

# read .csv with comment line to parse dates and dtypes
import ast
from collections import Counter
with open('test.csv', 'r') as f:
    type_header = f.readline()
    dtypes = ast.literal_eval(type_header[types.index('#') + 1:type_header.index('}\n')+1])
    parse_dates = [k for k,v in dtypes.items() if v in ['datetime64[ns]', 'datetime64[ns, tz]', 'timedelta[ns]']]
    dtypes = {k: v for k,v in dtypes.items() if k not in parse_dates}
    foo = pd.read_csv(f, comment='#', dtype=dtypes, parse_dates=parse_dates)

foo.dtypes.all() == df.dtypes.all()

Is this something which is worth including, or is it not generic enough and should I just hack my own extension on the Dataframe class?

@chris-b1
Copy link
Contributor

Really I think your best bet is using a storage format that knows types, like HDF5 or parquet - anything on top of CSV is going to be inherently adhoc - not to mention the performance benefits.

@chris-b1
Copy link
Contributor

If you need a human-readable format with type descriptions, the newer json schema stuff would also be worth a look;
http://pandas.pydata.org/pandas-docs/stable/io.html#table-schema

@TomAugspurger
Copy link
Contributor

Agreed with @chris-b1, you might also look at http://csvy.org/ if you're unable to avoid CSVs.

In general, while pandas should be able to read all kinds of messy CSV formats, I don't think we should write invalid CSVs, which your example does with the "comment".

@dkapitan
Copy link
Author

Thanks @TomAugspurger @chris-b1. Definitely better solutions to my needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants