Skip to content

Support writing CSV to GCS #22704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 12, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ Other Enhancements
- :func:`to_csv` now supports ``compression`` keyword when a file handle is passed. (:issue:`21227`)
- :meth:`Index.droplevel` is now implemented also for flat indexes, for compatibility with :class:`MultiIndex` (:issue:`21115`)
- :meth:`Series.droplevel` and :meth:`DataFrame.droplevel` are now implemented (:issue:`20342`)
- Added support for reading from Google Cloud Storage via the ``gcsfs`` library (:issue:`19454`)
- Added support for reading from/writing to Google Cloud Storage via the ``gcsfs`` library (:issue:`19454`, :issue:`23094`)
- :func:`to_gbq` and :func:`read_gbq` signature and documentation updated to
reflect changes from the `Pandas-GBQ library version 0.6.0
<https://pandas-gbq.readthedocs.io/en/latest/changelog.html#changelog-0-6-0>`__.
Expand Down
7 changes: 4 additions & 3 deletions pandas/io/formats/csvs.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,9 @@
ABCMultiIndex, ABCPeriodIndex, ABCDatetimeIndex, ABCIndexClass)

from pandas.io.common import (
_expand_user,
_get_handle,
_infer_compression,
_stringify_path,
get_filepath_or_buffer,
UnicodeWriter,
)

Expand All @@ -45,7 +44,9 @@ def __init__(self, obj, path_or_buf=None, sep=",", na_rep='',
if path_or_buf is None:
path_or_buf = StringIO()

self.path_or_buf = _expand_user(_stringify_path(path_or_buf))
self.path_or_buf, _, _, _ = get_filepath_or_buffer(
path_or_buf, encoding=encoding, compression=compression, mode=mode
)
self.sep = sep
self.na_rep = na_rep
self.float_format = float_format
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/io/test_gcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,21 @@ def test_read_csv_gcs(mock):
assert_frame_equal(df1, df2)


@td.skip_if_no('gcsfs')
def test_to_csv_gcs(mock):
df1 = DataFrame({'int': [1, 3], 'float': [2.0, np.nan], 'str': ['t', 's'],
'dt': date_range('2018-06-18', periods=2)})
with mock.patch('gcsfs.GCSFileSystem') as MockFileSystem:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be missing the point but if you are patching this what is actually getting tested for gcs?

Copy link
Contributor Author

@bnaul bnaul Sep 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's kind of the same problem that we discussed in #20729. This does at least test the logic that I touched here; I think ultimately what the mocks assume is that gcsfs.GCSFileSystem can read/write strings and everything else is using the real pandas methods.

s = StringIO()
instance = MockFileSystem.return_value
instance.open.return_value = s

df1.to_csv('gs://test/test.csv', index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you are explicitly stating index=True here instead of using the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.to_csv(f) and pd.read_csv(f) handle the index differently so I wanted to be extra clear that the index is also being checked in the round tripping

df2 = read_csv(StringIO(s.getvalue()), parse_dates=['dt'], index_col=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above


assert_frame_equal(df1, df2)


@td.skip_if_no('gcsfs')
def test_gcs_get_filepath_or_buffer(mock):
df1 = DataFrame({'int': [1, 3], 'float': [2.0, np.nan], 'str': ['t', 's'],
Expand Down