Skip to content

BUG: Don't overflow PeriodIndex in to_csv #15984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 13, 2017

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Apr 12, 2017

Don't overflow time-stamp objects if being used as index in to_csv.

Closes #15982.

@@ -1260,6 +1260,7 @@ I/O
- Bug in ``pd.read_csv()`` in which invalid values for ``nrows`` and ``chunksize`` were allowed (:issue:`15767`)
- Bug in ``pd.read_csv()`` for the Python engine in which unhelpful error messages were being raised when parsing errors occurred (:issue:`15910`)
- Bug in ``pd.read_csv()`` in which the ``skipfooter`` parameter was not being properly validated (:issue:`15925`)
- Bug in ``pd.to_csv()`` in which there was numeric overflow when a timestamp index was being written (:issue:`15982`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp -> Period I think.

Copy link
Member Author

@gfyoung gfyoung Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used timestamp because that seemed more clear (to the everyday user) than PeriodIndex. What do you think?

@codecov
Copy link

codecov bot commented Apr 12, 2017

Codecov Report

Merging #15984 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15984      +/-   ##
==========================================
+ Coverage      91%      91%   +<.01%     
==========================================
  Files         145      145              
  Lines       49641    49639       -2     
==========================================
  Hits        45175    45175              
+ Misses       4466     4464       -2
Flag Coverage Δ
#multiple 88.77% <100%> (ø) ⬆️
#single 40.53% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/indexes/base.py 96.09% <ø> (ø) ⬆️
pandas/formats/format.py 95.4% <100%> (+0.05%) ⬆️
pandas/core/common.py 91.03% <0%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2ed595...5939a24. Read the comment docs.

@@ -1143,3 +1143,23 @@ def test_to_csv_quoting(self):
df = df.set_index(['a', 'b'])
expected = '"a","b","c"\n"1","3","5"\n"2","4","6"\n'
self.assertEqual(df.to_csv(quoting=csv.QUOTE_ALL), expected)

def test_period_index_date_overflow(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test NaT. I think we have very few tests for outputing a PI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@jreback jreback added IO CSV read_csv, to_csv Period Period data type labels Apr 12, 2017
@jreback
Copy link
Contributor

jreback commented Apr 12, 2017

can you also tests .to_native_types() for a PI (and DTI if not already tested).

@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2017

can you also tests .to_native_types() for a PI (and DTI if not already tested).

Slightly outside of the scope of this PR, but I can squeeze that in with another commit.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2017

Slightly outside of the scope of this PR, but I can squeeze that in with another commit.

Ha, tests are never out of scope!

@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2017

Ha, tests are never out of scope!

Well, I was focusing to_csv, not to_native_types 😄

@gfyoung gfyoung force-pushed the period-datetime-overflow branch 3 times, most recently from 20cdf1e to 06cf781 Compare April 12, 2017 22:00
@gfyoung
Copy link
Member Author

gfyoung commented Apr 13, 2017

@jreback , @TomAugspurger : All comments have been addressed, and everything is green.

An indexer into `self` that specifies which values
are used in the formatting process.
kwargs : dict
Options for specifying how the values should be formatted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would be ok to actually list these options (with there defautls) in the signature itself. I don't recall why I didn't do this originally. followup PR for this though.

Copy link
Member Author

@gfyoung gfyoung Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derived classes change the signature with different defaults. That's why **kwargs is provided in the base implementation.

df = pd.DataFrame([4, 5, 6], index=index)

buf = StringIO()
df.to_csv(buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are looking for something to clean :>

there is a bit of a dichotomy between this test file and pandas/formats/test_formats.py where this tests 'higher' level functionaility and test_formats somewhat lower, but this is probably not consistent, and certainly not documented (maybe just a comment at the top of the file would be enough).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I'm not sure I really see any issue with this right now. What sort of comment (and which file) were you thinking exactly?

expected = np.array(['01-2017-01', '01-2017-02',
'01-2017-03'], dtype=object)

result = index.to_native_types(date_format='%m-%Y-%d')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, can you move these to test_formats.py (new file), in this directory, same with the period ones. I don't these are actually tested anywhere :>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they aren't, but they are now with this PR 😄

index = PeriodIndex(['2017-01-01', '2017-01-02',
'2017-01-03'], freq='D')

# First, with no arguments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@jreback
Copy link
Contributor

jreback commented Apr 13, 2017

mostly about putting tests in better places, otherwise lgtm.

index = pd.PeriodIndex(dates, freq="D")
df = pd.DataFrame([4, 5, 6], index=index)

buf = StringIO()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering outloud whether this particular testing idiom could be improved somehow. a possibilitiy

with tm.assert_format_equal(expected) as buf:
     df.to_csv(buf)

we do this a LOT in testing output formatting (suggestions on name welcome), certainly a separate PR (to do changes in the rest of the codebase), though could be introduced here if you'd like

Copy link
Member Author

@gfyoung gfyoung Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I look at this, since df.to_csv DOES return if buf is None, complicating the testing with writing to a buffer seems unnecessary. In the future, we should encourage any tests of to_csv to do the following:

expected = "foo"
result = df.to_csv()
assert result == expected

That makes your point somewhat moot, as I think this way is much cleaner.

@gfyoung gfyoung force-pushed the period-datetime-overflow branch 3 times, most recently from ac7e911 to 7165fdb Compare April 13, 2017 15:41
@gfyoung
Copy link
Member Author

gfyoung commented Apr 13, 2017

@jreback : Build machines running into SSL errors when cloning repository. [ci skip] for now.

@gfyoung gfyoung force-pushed the period-datetime-overflow branch from 7165fdb to 5939a24 Compare April 13, 2017 16:36
@gfyoung
Copy link
Member Author

gfyoung commented Apr 13, 2017

Addressed all comments, and everything is green.

@jreback jreback added this to the 0.20.0 milestone Apr 13, 2017
@jreback jreback merged commit 7ee73ff into pandas-dev:master Apr 13, 2017
@jreback
Copy link
Contributor

jreback commented Apr 13, 2017

thanks @gfyoung !

@gfyoung gfyoung deleted the period-datetime-overflow branch April 13, 2017 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants