Skip to content

BUG: Don't overflow PeriodIndex in to_csv #15984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 13, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1344,6 +1344,7 @@ I/O
- Bug in ``pd.read_csv()`` in which invalid values for ``nrows`` and ``chunksize`` were allowed (:issue:`15767`)
- Bug in ``pd.read_csv()`` for the Python engine in which unhelpful error messages were being raised when parsing errors occurred (:issue:`15910`)
- Bug in ``pd.read_csv()`` in which the ``skipfooter`` parameter was not being properly validated (:issue:`15925`)
- Bug in ``pd.to_csv()`` in which there was numeric overflow when a timestamp index was being written (:issue:`15982`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp -> Period I think.

Copy link
Member Author

@gfyoung gfyoung Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used timestamp because that seemed more clear (to the everyday user) than PeriodIndex. What do you think?

- Bug in ``pd.tools.hashing.hash_pandas_object()`` in which hashing of categoricals depended on the ordering of categories, instead of just their values. (:issue:`15143`)
- Bug in ``.to_json()`` where ``lines=True`` and contents (keys or values) contain escaped characters (:issue:`15096`)
- Bug in ``.to_json()`` causing single byte ascii characters to be expanded to four byte unicode (:issue:`15344`)
Expand Down
5 changes: 1 addition & 4 deletions pandas/formats/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -1564,10 +1564,7 @@ def __init__(self, obj, path_or_buf=None, sep=",", na_rep='',
self.chunksize = int(chunksize)

self.data_index = obj.index
if isinstance(obj.index, PeriodIndex):
self.data_index = obj.index.to_timestamp()

if (isinstance(self.data_index, DatetimeIndex) and
if (isinstance(self.data_index, (DatetimeIndex, PeriodIndex)) and
date_format is not None):
self.data_index = Index([x.strftime(date_format) if notnull(x) else
'' for x in self.data_index])
Expand Down
21 changes: 20 additions & 1 deletion pandas/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1820,7 +1820,26 @@ def _format_with_header(self, header, na_rep='NaN', **kwargs):
return header + result

def to_native_types(self, slicer=None, **kwargs):
""" slice and dice then format """
"""
Format specified values of `self` and return them.
Parameters
----------
slicer : int, array-like
An indexer into `self` that specifies which values
are used in the formatting process.
kwargs : dict
Options for specifying how the values should be formatted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would be ok to actually list these options (with there defautls) in the signature itself. I don't recall why I didn't do this originally. followup PR for this though.

Copy link
Member Author

@gfyoung gfyoung Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derived classes change the signature with different defaults. That's why **kwargs is provided in the base implementation.

These options include the following:
1) na_rep : str
The value that serves as a placeholder for NULL values
2) quoting : bool or None
Whether or not there are quoted values in `self`
3) date_format : str
The format used to represent date-like values
"""

values = self
if slicer is not None:
values = values[slicer]
Expand Down
28 changes: 28 additions & 0 deletions pandas/tests/frame/test_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -1143,3 +1143,31 @@ def test_to_csv_quoting(self):
df = df.set_index(['a', 'b'])
expected = '"a","b","c"\n"1","3","5"\n"2","4","6"\n'
self.assertEqual(df.to_csv(quoting=csv.QUOTE_ALL), expected)

def test_period_index_date_overflow(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test NaT. I think we have very few tests for outputing a PI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

# see gh-15982

dates = ["1990-01-01", "2000-01-01", "3005-01-01"]
index = pd.PeriodIndex(dates, freq="D")

df = pd.DataFrame([4, 5, 6], index=index)
result = df.to_csv()

expected = ',0\n1990-01-01,4\n2000-01-01,5\n3005-01-01,6\n'
assert result == expected

date_format = "%m-%d-%Y"
result = df.to_csv(date_format=date_format)

expected = ',0\n01-01-1990,4\n01-01-2000,5\n01-01-3005,6\n'
assert result == expected

# Overflow with pd.NaT
dates = ["1990-01-01", pd.NaT, "3005-01-01"]
index = pd.PeriodIndex(dates, freq="D")

df = pd.DataFrame([4, 5, 6], index=index)
result = df.to_csv()

expected = ',0\n1990-01-01,4\n,5\n3005-01-01,6\n'
assert result == expected
47 changes: 47 additions & 0 deletions pandas/tests/indexes/datetimes/test_formats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
from pandas import DatetimeIndex

import numpy as np

import pandas.util.testing as tm
import pandas as pd


def test_to_native_types():
index = DatetimeIndex(freq='1D', periods=3, start='2017-01-01')

# First, with no arguments.
expected = np.array(['2017-01-01', '2017-01-02',
'2017-01-03'], dtype=object)

result = index.to_native_types()
tm.assert_numpy_array_equal(result, expected)

# No NaN values, so na_rep has no effect
result = index.to_native_types(na_rep='pandas')
tm.assert_numpy_array_equal(result, expected)

# Make sure slicing works
expected = np.array(['2017-01-01', '2017-01-03'], dtype=object)

result = index.to_native_types([0, 2])
tm.assert_numpy_array_equal(result, expected)

# Make sure date formatting works
expected = np.array(['01-2017-01', '01-2017-02',
'01-2017-03'], dtype=object)

result = index.to_native_types(date_format='%m-%Y-%d')
tm.assert_numpy_array_equal(result, expected)

# NULL object handling should work
index = DatetimeIndex(['2017-01-01', pd.NaT, '2017-01-03'])
expected = np.array(['2017-01-01', 'NaT', '2017-01-03'], dtype=object)

result = index.to_native_types()
tm.assert_numpy_array_equal(result, expected)

expected = np.array(['2017-01-01', 'pandas',
'2017-01-03'], dtype=object)

result = index.to_native_types(na_rep='pandas')
tm.assert_numpy_array_equal(result, expected)
48 changes: 48 additions & 0 deletions pandas/tests/indexes/period/test_formats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from pandas import PeriodIndex

import numpy as np

import pandas.util.testing as tm
import pandas as pd


def test_to_native_types():
index = PeriodIndex(['2017-01-01', '2017-01-02',
'2017-01-03'], freq='D')

# First, with no arguments.
expected = np.array(['2017-01-01', '2017-01-02',
'2017-01-03'], dtype='<U10')

result = index.to_native_types()
tm.assert_numpy_array_equal(result, expected)

# No NaN values, so na_rep has no effect
result = index.to_native_types(na_rep='pandas')
tm.assert_numpy_array_equal(result, expected)

# Make sure slicing works
expected = np.array(['2017-01-01', '2017-01-03'], dtype='<U10')

result = index.to_native_types([0, 2])
tm.assert_numpy_array_equal(result, expected)

# Make sure date formatting works
expected = np.array(['01-2017-01', '01-2017-02',
'01-2017-03'], dtype='<U10')

result = index.to_native_types(date_format='%m-%Y-%d')
tm.assert_numpy_array_equal(result, expected)

# NULL object handling should work
index = PeriodIndex(['2017-01-01', pd.NaT, '2017-01-03'], freq='D')
expected = np.array(['2017-01-01', 'NaT', '2017-01-03'], dtype=object)

result = index.to_native_types()
tm.assert_numpy_array_equal(result, expected)

expected = np.array(['2017-01-01', 'pandas',
'2017-01-03'], dtype=object)

result = index.to_native_types(na_rep='pandas')
tm.assert_numpy_array_equal(result, expected)