Skip to content

ENH: add fsspec support #34266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
94e717f
Add remote file io using fsspec.
Apr 14, 2020
fd7e072
Attempt refactor and clean
May 19, 2020
302ba13
Merge branch 'master' into feature/add-fsspec-support
May 20, 2020
9e6d3b2
readd and adapt s3/gcs tests
May 21, 2020
4564c8d
remove gc from test
May 21, 2020
0654537
Simpler is_fsspec
May 21, 2020
8d45cbb
add test
May 21, 2020
006e736
Answered most points
May 28, 2020
724ebd8
Implemented suggestions
May 28, 2020
9da1689
lint
May 28, 2020
a595411
Add versions info
May 29, 2020
6dd1e92
Update some deps
May 29, 2020
6e13df7
issue link syntax
May 29, 2020
3262063
More specific test versions
Jun 2, 2020
4bc2411
Account for alternate S3 protocols, and ignore type error
Jun 2, 2020
68644ab
Add comment to mypy ignore insrtuction
Jun 2, 2020
32bc586
more mypy
Jun 2, 2020
037ef2c
more black
Jun 2, 2020
c3c3075
Make storage_options a dict rather than swallowing kwargs
Jun 3, 2020
85d6452
More requested changes
Jun 5, 2020
263dd3b
Remove fsspec from locale tests
Jun 10, 2020
d0afbc3
tweak
Jun 10, 2020
6a587a5
Merge branch 'master' into feature/add-fsspec-support
Jun 10, 2020
b2992c1
Merge branch 'master' into feature/add-fsspec-support
Jun 11, 2020
9c03745
requested changes
Jun 11, 2020
7982e7b
add gcsfs to environment.yml
Jun 12, 2020
946297b
rerun deps script
Jun 12, 2020
145306e
Merge branch 'master' into feature/add-fsspec-support
Jun 12, 2020
06e5a3a
account for passed filesystem again
Jun 12, 2020
8f3854c
specify should_close
Jun 12, 2020
50c08c8
lint
Jun 12, 2020
9b20dc6
Except http passed to fsspec in parquet
Jun 12, 2020
eb90fe8
lint
Jun 12, 2020
b3e2cd2
Merge branch 'master' into feature/add-fsspec-support
Jun 16, 2020
4977a00
redo whatsnew
Jun 16, 2020
29a9785
simplify parquet write
Jun 18, 2020
565031b
Retry S3 file probe with timeout, in test_to_s3
Jun 18, 2020
606ce11
expand user in non-fsspec paths for parquet; add test for this
Jun 19, 2020
60b80a6
reorder imports!
Jun 19, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions ci/deps/azure-36-locale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ dependencies:

# pandas dependencies
- beautifulsoup4
- gcsfs
- html5lib
- ipython
- jinja2
Expand All @@ -31,7 +30,6 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs
- scipy
- xarray
- xlrd
Expand Down
1 change: 0 additions & 1 deletion ci/deps/azure-37-locale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs
- scipy
- xarray
- xlrd
Expand Down
5 changes: 3 additions & 2 deletions ci/deps/azure-windows-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ dependencies:
# pandas dependencies
- beautifulsoup4
- bottleneck
- gcsfs
- fsspec>=0.7.4
- gcsfs>=0.6.0
- html5lib
- jinja2
- lxml
Expand All @@ -28,7 +29,7 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs
- s3fs>=0.4.0
- scipy
- sqlalchemy
- xlrd
Expand Down
5 changes: 3 additions & 2 deletions ci/deps/travis-36-cov.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ dependencies:
- cython>=0.29.16
- dask
- fastparquet>=0.3.2
- gcsfs
- fsspec>=0.7.4
- gcsfs>=0.6.0
- geopandas
- html5lib
- matplotlib
Expand All @@ -35,7 +36,7 @@ dependencies:
- pytables
- python-snappy
- pytz
- s3fs
- s3fs>=0.4.0
- scikit-learn
- scipy
- sqlalchemy
Expand Down
2 changes: 0 additions & 2 deletions ci/deps/travis-36-locale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ dependencies:
- blosc=1.14.3
- python-blosc
- fastparquet=0.3.2
- gcsfs=0.2.2
- html5lib
- ipython
- jinja2
Expand All @@ -33,7 +32,6 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs=0.3.0
- scipy
- sqlalchemy=1.1.4
- xarray=0.10
Expand Down
3 changes: 2 additions & 1 deletion ci/deps/travis-36-slow.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ dependencies:

# pandas dependencies
- beautifulsoup4
- fsspec>=0.7.4
- html5lib
- lxml
- matplotlib
Expand All @@ -25,7 +26,7 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs
- s3fs>=0.4.0
- scipy
- sqlalchemy
- xlrd
Expand Down
3 changes: 2 additions & 1 deletion ci/deps/travis-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ dependencies:

# pandas dependencies
- botocore>=1.11
- fsspec>=0.7.4
- numpy
- python-dateutil
- nomkl
- pyarrow
- pytz
- s3fs
- s3fs>=0.4.0
- tabulate
- pyreadstat
- pip
Expand Down
5 changes: 3 additions & 2 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -267,8 +267,9 @@ SQLAlchemy 1.1.4 SQL support for databases other tha
SciPy 0.19.0 Miscellaneous statistical functions
XLsxWriter 0.9.8 Excel writing
blosc Compression for HDF5
fsspec 0.7.4 Handling files aside from local and HTTP
fastparquet 0.3.2 Parquet reading / writing
gcsfs 0.2.2 Google Cloud Storage access
gcsfs 0.6.0 Google Cloud Storage access
html5lib HTML parser for read_html (see :ref:`note <optional_html>`)
lxml 3.8.0 HTML parser for read_html (see :ref:`note <optional_html>`)
matplotlib 2.2.2 Visualization
Expand All @@ -282,7 +283,7 @@ pyreadstat SPSS files (.sav) reading
pytables 3.4.3 HDF5 reading / writing
pyxlsb 1.0.6 Reading for xlsb files
qtpy Clipboard I/O
s3fs 0.3.0 Amazon S3 access
s3fs 0.4.0 Amazon S3 access
tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_)
xarray 0.8.2 pandas-like API for N-dimensional data
xclip Clipboard I/O on linux
Expand Down
126 changes: 125 additions & 1 deletion doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,23 @@ If needed you can adjust the bins with the argument ``offset`` (a Timedelta) tha
For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.


fsspec now used for filesystem handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For reading and writing to filesystems other than local and reading from HTTP(S),
the optional dependency ``fsspec`` will be used to dispatch operations (:issue:`33452`).
This will give unchanged
functionality for S3 and GCS storage, which were already supported, but also add
support for several other storage implementations such as `Azure Datalake and Blob`_,
SSH, FTP, dropbox and github. For docs and capabilities, see the `fsspec docs`_.

The existing capability to interface with S3 and GCS will be unaffected by this
change, as ``fsspec`` will still bring in the same packages as before.

.. _Azure Datalake and Blob: https://github.com/dask/adlfs

.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/

.. _whatsnew_110.enhancements.other:

Other enhancements
Expand Down Expand Up @@ -297,10 +314,117 @@ Other enhancements

.. ---------------------------------------------------------------------------

.. _whatsnew_110.api:
Increased minimum versions for dependencies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some minimum supported versions of dependencies were updated (:issue:`33718`, :issue:`29766`, :issue:`29723`, pytables >= 3.4.3).
If installed, we now require:

+-----------------+-----------------+----------+---------+
| Package | Minimum Version | Required | Changed |
+=================+=================+==========+=========+
| numpy | 1.15.4 | X | X |
+-----------------+-----------------+----------+---------+
| pytz | 2015.4 | X | |
+-----------------+-----------------+----------+---------+
| python-dateutil | 2.7.3 | X | X |
+-----------------+-----------------+----------+---------+
| bottleneck | 1.2.1 | | |
+-----------------+-----------------+----------+---------+
| numexpr | 2.6.2 | | |
+-----------------+-----------------+----------+---------+
| pytest (dev) | 4.0.2 | | |
+-----------------+-----------------+----------+---------+

For `optional libraries <https://dev.pandas.io/docs/install.html#dependencies>`_ the general recommendation is to use the latest version.
The following table lists the lowest version per library that is currently being tested throughout the development of pandas.
Optional libraries below the lowest tested version may still work, but are not considered supported.

+-----------------+-----------------+---------+
| Package | Minimum Version | Changed |
+=================+=================+=========+
| beautifulsoup4 | 4.6.0 | |
+-----------------+-----------------+---------+
| fastparquet | 0.3.2 | |
+-----------------+-----------------+---------+
| fsspec | 0.7.4 | X |
+-----------------+-----------------+---------+
| gcsfs | 0.6.0 | X |
+-----------------+-----------------+---------+
| lxml | 3.8.0 | |
+-----------------+-----------------+---------+
| matplotlib | 2.2.2 | |
+-----------------+-----------------+---------+
| numba | 0.46.0 | |
+-----------------+-----------------+---------+
| openpyxl | 2.5.7 | |
+-----------------+-----------------+---------+
| pyarrow | 0.13.0 | |
+-----------------+-----------------+---------+
| pymysql | 0.7.1 | |
+-----------------+-----------------+---------+
| pytables | 3.4.3 | X |
+-----------------+-----------------+---------+
| s3fs | 0.4.0 | X |
+-----------------+-----------------+---------+
| scipy | 1.2.0 | X |
+-----------------+-----------------+---------+
| sqlalchemy | 1.1.4 | |
+-----------------+-----------------+---------+
| xarray | 0.8.2 | |
+-----------------+-----------------+---------+
| xlrd | 1.1.0 | |
+-----------------+-----------------+---------+
| xlsxwriter | 0.9.8 | |
+-----------------+-----------------+---------+
| xlwt | 1.2.0 | |
+-----------------+-----------------+---------+
| pandas-gbq | 1.2.0 | X |
+-----------------+-----------------+---------+

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Development Changes
^^^^^^^^^^^^^^^^^^^

- The minimum version of Cython is now the most recent bug-fix version (0.29.16) (:issue:`33334`).

.. _whatsnew_110.api.other:

Other API changes
^^^^^^^^^^^^^^^^^

- :meth:`Series.describe` will now show distribution percentiles for ``datetime`` dtypes, statistics ``first`` and ``last``
will now be ``min`` and ``max`` to match with numeric dtypes in :meth:`DataFrame.describe` (:issue:`30164`)
- Added :meth:`DataFrame.value_counts` (:issue:`5377`)
- :meth:`Groupby.groups` now returns an abbreviated representation when called on large dataframes (:issue:`1135`)
- ``loc`` lookups with an object-dtype :class:`Index` and an integer key will now raise ``KeyError`` instead of ``TypeError`` when key is missing (:issue:`31905`)
- Using a :func:`pandas.api.indexers.BaseIndexer` with ``count``, ``min``, ``max``, ``median``, ``skew``, ``cov``, ``corr`` will now return correct results for any monotonic :func:`pandas.api.indexers.BaseIndexer` descendant (:issue:`32865`)
- Added a :func:`pandas.api.indexers.FixedForwardWindowIndexer` class to support forward-looking windows during ``rolling`` operations.
-

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- :meth:`DataFrame.swaplevels` now raises a ``TypeError`` if the axis is not a :class:`MultiIndex`.
Previously an ``AttributeError`` was raised (:issue:`31126`)
- :meth:`DataFrame.xs` now raises a ``TypeError`` if a ``level`` keyword is supplied and the axis is not a :class:`MultiIndex`.
Previously an ``AttributeError`` was raised (:issue:`33610`)
- :meth:`DataFrameGroupby.mean` and :meth:`SeriesGroupby.mean` (and similarly for :meth:`~DataFrameGroupby.median`, :meth:`~DataFrameGroupby.std` and :meth:`~DataFrameGroupby.var`)
now raise a ``TypeError`` if a not-accepted keyword argument is passed into it.
Previously a ``UnsupportedFunctionCall`` was raised (``AssertionError`` if ``min_count`` passed into :meth:`~DataFrameGroupby.median`) (:issue:`31485`)
- :meth:`DataFrame.at` and :meth:`Series.at` will raise a ``TypeError`` instead of a ``ValueError`` if an incompatible key is passed, and ``KeyError`` if a missing key is passed, matching the behavior of ``.loc[]`` (:issue:`31722`)
- Passing an integer dtype other than ``int64`` to ``np.array(period_index, dtype=...)`` will now raise ``TypeError`` instead of incorrectly using ``int64`` (:issue:`32255`)
- Passing an invalid ``fill_value`` to :meth:`Categorical.take` raises a ``ValueError`` instead of ``TypeError`` (:issue:`33660`)
- Combining a ``Categorical`` with integer categories and which contains missing values
with a float dtype column in operations such as :func:`concat` or :meth:`~DataFrame.append`
will now result in a float column instead of an object dtyped column (:issue:`33607`)
- :meth:`Series.to_timestamp` now raises a ``TypeError`` if the axis is not a :class:`PeriodIndex`. Previously an ``AttributeError`` was raised (:issue:`33327`)
- :meth:`Series.to_period` now raises a ``TypeError`` if the axis is not a :class:`DatetimeIndex`. Previously an ``AttributeError`` was raised (:issue:`33327`)
- :func: `pandas.api.dtypes.is_string_dtype` no longer incorrectly identifies categorical series as string.
- :func:`read_excel` no longer takes ``**kwds`` arguments. This means that passing in keyword ``chunksize`` now raises a ``TypeError``
(previously raised a ``NotImplementedError``), while passing in keyword ``encoding`` now raises a ``TypeError`` (:issue:`34464`)
- :func: `merge` now checks ``suffixes`` parameter type to be ``tuple`` and raises ``TypeError``, whereas before a ``list`` or ``set`` were accepted and that the ``set`` could produce unexpected results (:issue:`33740`)
- :class:`Period` no longer accepts tuples for the ``freq`` argument (:issue:`34658`)

``MultiIndex.get_indexer`` interprets `method` argument differently
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
4 changes: 3 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,9 @@ dependencies:

- pyqt>=5.9.2 # pandas.read_clipboard
- pytables>=3.4.3 # pandas.read_hdf, DataFrame.to_hdf
- s3fs # pandas.read_csv... when using 's3://...' path
- s3fs>=0.4.0 # file IO when using 's3://...' path
- fsspec>=0.7.4 # for generic remote file operations
- gcsfs>=0.6.0 # file IO when using 'gcs://...' path
- sqlalchemy # pandas.read_sql, DataFrame.to_sql
- xarray # DataFrame.to_xarray
- cftime # Needed for downstream xarray.CFTimeIndex test
Expand Down
5 changes: 3 additions & 2 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@
VERSIONS = {
"bs4": "4.6.0",
"bottleneck": "1.2.1",
"fsspec": "0.7.4",
"fastparquet": "0.3.2",
"gcsfs": "0.2.2",
"gcsfs": "0.6.0",
"lxml.etree": "3.8.0",
"matplotlib": "2.2.2",
"numexpr": "2.6.2",
Expand All @@ -20,7 +21,7 @@
"pytables": "3.4.3",
"pytest": "5.0.1",
"pyxlsb": "1.0.6",
"s3fs": "0.3.0",
"s3fs": "0.4.0",
"scipy": "1.2.0",
"sqlalchemy": "1.1.4",
"tables": "3.4.3",
Expand Down
Loading