Skip to content

[WIP] ENH: add Pyarrow csv engine #38370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 119 commits into from
Closed
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
f22ff46
add arrow engine to read_csv
lithomas1 Feb 9, 2020
8ae43e4
fix failing test
lithomas1 Feb 9, 2020
09074df
formatting and revert unnecessary change
lithomas1 Feb 9, 2020
6be276d
remove bloat and more formatting changes
lithomas1 Feb 9, 2020
df4fa7e
Whatsnew
lithomas1 Feb 9, 2020
9cd9a6f
Merge remote-tracking branch 'upstream/master' into add-arrow-engine
lithomas1 Feb 9, 2020
ecaf3fd
Get tests up and running
lithomas1 Feb 10, 2020
b3c3287
Some fixes
lithomas1 Feb 10, 2020
474baf4
Add asvs and xfail some tests
lithomas1 Feb 11, 2020
2cd9937
address comments
lithomas1 Feb 20, 2020
48ff255
Merge branch 'master' into add-arrow-engine
lithomas1 Feb 20, 2020
3d15a56
fix typo
lithomas1 Feb 20, 2020
c969373
Merge branch 'add-arrow-engine' of github-other.com:lithomas1/pandas …
lithomas1 Feb 20, 2020
98aa134
some fixes
lithomas1 Feb 29, 2020
b9c6d2c
Fix bug
lithomas1 Apr 5, 2020
67c5db6
Fix merge conflicts
lithomas1 Apr 5, 2020
7f891a6
New benchmark and fix more tests
lithomas1 Apr 10, 2020
11fc737
Merge branch 'master' into add-arrow-engine
lithomas1 Apr 10, 2020
23425f7
More cleanups
lithomas1 Apr 10, 2020
d9b7a1f
Merge master
lithomas1 Apr 10, 2020
b8adf3c
Merge branch 'add-arrow-engine' of github-other.com:lithomas1/pandas …
lithomas1 Apr 11, 2020
01c0394
Formatting fixes and typo correction
lithomas1 Apr 11, 2020
ba5620f
skip pyarrow tests if not installed
lithomas1 Apr 12, 2020
2570c82
Address comments
lithomas1 Apr 12, 2020
b3a1f66
Get some more tests to pass
lithomas1 Apr 14, 2020
d46ceed
Fix some bugs and cleanups
lithomas1 Apr 17, 2020
d67925c
Merge branch 'master' into add-arrow-engine
lithomas1 Apr 17, 2020
6378459
Perform version checks for submodule imports too
lithomas1 May 20, 2020
9d64882
Refresh with newer pyarrow
lithomas1 May 20, 2020
852ecf9
Merge branch 'master' into add-arrow-engine
lithomas1 May 20, 2020
93382b4
Start xfailing tests
lithomas1 May 21, 2020
f1bb4e2
Get all tests to run & some fixes
lithomas1 May 27, 2020
14c13ab
Merge branch 'master' into add-arrow-engine
lithomas1 May 27, 2020
7876b4e
Lint and CI
lithomas1 May 29, 2020
4426642
Merge branch 'master' into add-arrow-engine
lithomas1 May 29, 2020
008acab
parse_dates support and fixups of some tests
lithomas1 Jun 3, 2020
2dddae7
Date parsing fixes and address comments
lithomas1 Jun 13, 2020
261ef6a
Merge branch 'master' into add-arrow-engine
lithomas1 Jun 13, 2020
88e200a
Clean/Address comments/Update docs
lithomas1 Jun 29, 2020
bf063ab
Merge branch 'master' into add-arrow-engine
lithomas1 Jun 29, 2020
ede2799
Fix typo
lithomas1 Jun 29, 2020
e8eff08
Fix doc failures
lithomas1 Jul 8, 2020
87cfcf5
Merge remote-tracking branch 'upstream/master' into add-arrow-engine
simonjayhawkins Oct 22, 2020
55139ee
wip
simonjayhawkins Oct 22, 2020
c1aeecf
more xfails and skips
simonjayhawkins Oct 22, 2020
62fc9d6
Merge branch 'master' into add-arrow-engine
lithomas1 Oct 28, 2020
b53a620
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 28, 2020
f13113d
Fix typos
lithomas1 Oct 28, 2020
f9ce2e4
Doc fixes and more typo fixes
lithomas1 Oct 28, 2020
4158d6a
Green?
lithomas1 Nov 2, 2020
d34e75f
Merge branch 'master' into add-arrow-engine
lithomas1 Nov 17, 2020
6a37695
merge master
arw2019 Dec 8, 2020
10be581
xfail tests
arw2019 Dec 8, 2020
fcc7e04
xfail test
arw2019 Dec 8, 2020
d7959a1
fix import
arw2019 Dec 8, 2020
e37d126
xfail tests
arw2019 Dec 8, 2020
3bc4775
skip tests
arw2019 Dec 9, 2020
7097bcb
merge
arw2019 Dec 9, 2020
17a502d
skip tests
arw2019 Dec 9, 2020
e27d7ef
C408 failure
arw2019 Dec 9, 2020
4e638e9
skip tests
arw2019 Dec 9, 2020
4f7ebd0
simplify import_optional_dependency code
arw2019 Dec 10, 2020
69b3b42
move whatsnew to 1.3
arw2019 Dec 10, 2020
9d5cf24
clean _get_options_with_defaults
arw2019 Dec 10, 2020
2d4a0aa
clean _clean_options
arw2019 Dec 10, 2020
e46b95d
clean _read
arw2019 Dec 10, 2020
1844a6c
extract kwd validation from __init__
arw2019 Dec 10, 2020
94178e4
revert mistaken refactor
arw2019 Dec 11, 2020
13a2488
typing
arw2019 Dec 11, 2020
a98cffd
REF: ArrowParserWrapper.read
arw2019 Dec 11, 2020
a32e3a5
REF: ArrowParserWrapper.read
arw2019 Dec 11, 2020
89416cc
remove optional dependency code
arw2019 Dec 12, 2020
a1bd010
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Dec 12, 2020
9687990
REF: ArrowParserWrapper.read
arw2019 Dec 12, 2020
98f2061
REF: ArrowParserWrapper.read
arw2019 Dec 12, 2020
ec01fad
REF: ArrowParserWrapper.read
arw2019 Dec 12, 2020
7b9572b
rewrite docs
arw2019 Dec 12, 2020
6773a71
rewrite docs
arw2019 Dec 12, 2020
d63f5d0
remove datetime hadling
arw2019 Dec 12, 2020
9ff95ad
skiprows cannot be None
arw2019 Dec 12, 2020
6133a4c
REF: ArrowParserWrapper.read
arw2019 Dec 12, 2020
454892f
REF: ArrowParserWrapper.read
arw2019 Dec 12, 2020
e050394
skip all pyarrow csv datetime tests
arw2019 Dec 12, 2020
09fca60
rewrite benchmarks
arw2019 Dec 12, 2020
7aa5378
merge master
arw2019 Dec 16, 2020
ac3cf7d
merge master
arw2019 Dec 19, 2020
f9bf5f1
typo
arw2019 Dec 19, 2020
922bf4f
typo
arw2019 Dec 19, 2020
1252a05
test reorg
arw2019 Dec 19, 2020
0af7291
test reorg
arw2019 Dec 19, 2020
361aab6
test reorg
arw2019 Dec 19, 2020
75de071
test reorg
arw2019 Dec 19, 2020
a1dfcb2
test reorg
arw2019 Dec 19, 2020
2433170
test reorg
arw2019 Dec 19, 2020
1a9f185
test reorg
arw2019 Dec 19, 2020
16d37db
test reorg
arw2019 Dec 19, 2020
e124df0
pyarrow_xfail->pyarrow_skip
arw2019 Dec 31, 2020
75d099b
merge master
arw2019 Dec 31, 2020
fe253ba
pyarrow_xfail->pyarrow_skip
arw2019 Dec 31, 2020
72c7c44
pyarrow_xfail->pyarrow_skip
arw2019 Dec 31, 2020
2671007
xfail more tests
arw2019 Dec 31, 2020
73ca5d4
xfail more tests
arw2019 Dec 31, 2020
0666042
merge master
arw2019 Jan 1, 2021
639ca28
update refactoredt tests
arw2019 Jan 1, 2021
1994fad
float precision tests
arw2019 Jan 2, 2021
566f1b4
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
cd9b300
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
4a7dc0f
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
3b24fe7
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
c33bf46
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
dc9530b
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
d83b2e0
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
6205bed
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
c4b3bb7
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
a77b33e
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
04c8d21
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
8bb6959
TST/REF: io/parsers/test_common.py
arw2019 Jan 2, 2021
d9478d6
TST/REF: remove test_common.py
arw2019 Jan 2, 2021
565f71f
merge master
arw2019 Jan 4, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 49 additions & 18 deletions asv_bench/benchmarks/io/csv.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from io import StringIO
from io import BytesIO, StringIO
import random
import string

Expand Down Expand Up @@ -146,10 +146,10 @@ def time_read_csv(self, bad_date_value):
class ReadCSVSkipRows(BaseIO):

fname = "__test__.csv"
params = [None, 10000]
param_names = ["skiprows"]
params = ([None, 10000], ["c", "pyarrow"])
param_names = ["skiprows", "engine"]

def setup(self, skiprows):
def setup(self, skiprows, engine):
N = 20000
index = tm.makeStringIndex(N)
df = DataFrame(
Expand All @@ -164,8 +164,8 @@ def setup(self, skiprows):
)
df.to_csv(self.fname)

def time_skipprows(self, skiprows):
read_csv(self.fname, skiprows=skiprows)
def time_skipprows(self, skiprows, engine):
read_csv(self.fname, skiprows=skiprows, engine=engine)


class ReadUint64Integers(StringIORewind):
Expand Down Expand Up @@ -254,9 +254,33 @@ def time_read_csv_python_engine(self, sep, decimal, float_precision):
names=list("abc"),
)

def time_read_csv_arrow(self, sep, decimal, float_precision):
read_csv(
self.data(self.StringIO_input),
sep=sep,
header=None,
names=list("abc"),
)

class ReadCSVCategorical(BaseIO):

class ReadCSVEngine(StringIORewind):
params = ["c", "python", "pyarrow"]
param_names = ["engine"]

def setup(self, engine):
data = ["A,B,C,D,E"] + (["1,2,3,4,5"] * 100000)
self.StringIO_input = StringIO("\n".join(data))
# simulate reading from file
self.BytesIO_input = BytesIO(self.StringIO_input.read().encode("utf-8"))

def time_read_stringcsv(self, engine):
read_csv(self.data(self.StringIO_input), engine=engine)

def time_read_bytescsv(self, engine):
read_csv(self.data(self.BytesIO_input), engine=engine)


class ReadCSVCategorical(BaseIO):
fname = "__test__.csv"

def setup(self):
Expand All @@ -273,7 +297,10 @@ def time_convert_direct(self):


class ReadCSVParseDates(StringIORewind):
def setup(self):
params = ["c", "python"]
param_names = ["engine"]

def setup(self, engine):
data = """{},19:00:00,18:56:00,0.8100,2.8100,7.2000,0.0000,280.0000\n
{},20:00:00,19:56:00,0.0100,2.2100,7.2000,0.0000,260.0000\n
{},21:00:00,20:56:00,-0.5900,2.2100,5.7000,0.0000,280.0000\n
Expand All @@ -284,18 +311,20 @@ def setup(self):
data = data.format(*two_cols)
self.StringIO_input = StringIO(data)

def time_multiple_date(self):
def time_multiple_date(self, engine):
read_csv(
self.data(self.StringIO_input),
engine=engine,
sep=",",
header=None,
names=list(string.digits[:9]),
parse_dates=[[1, 2], [1, 3]],
)

def time_baseline(self):
def time_baseline(self, engine):
read_csv(
self.data(self.StringIO_input),
engine=engine,
sep=",",
header=None,
parse_dates=[1],
Expand All @@ -304,17 +333,18 @@ def time_baseline(self):


class ReadCSVCachedParseDates(StringIORewind):
params = ([True, False],)
param_names = ["do_cache"]
params = ([True, False], ["c", "pyarrow", "python"])
param_names = ["do_cache", "engine"]

def setup(self, do_cache):
def setup(self, do_cache, engine):
data = ("\n".join(f"10/{year}" for year in range(2000, 2100)) + "\n") * 10
self.StringIO_input = StringIO(data)

def time_read_csv_cached(self, do_cache):
def time_read_csv_cached(self, do_cache, engine):
try:
read_csv(
self.data(self.StringIO_input),
engine=engine,
header=None,
parse_dates=[0],
cache_dates=do_cache,
Expand Down Expand Up @@ -344,22 +374,23 @@ def mem_parser_chunks(self):


class ReadCSVParseSpecialDate(StringIORewind):
params = (["mY", "mdY", "hm"],)
param_names = ["value"]
params = (["mY", "mdY", "hm"], ["c", "pyarrow", "python"])
param_names = ["value", "engine"]
objects = {
"mY": "01-2019\n10-2019\n02/2000\n",
"mdY": "12/02/2010\n",
"hm": "21:34\n",
}

def setup(self, value):
def setup(self, value, engine):
count_elem = 10000
data = self.objects[value] * count_elem
self.StringIO_input = StringIO(data)

def time_read_special_date(self, value):
def time_read_special_date(self, value, engine):
read_csv(
self.data(self.StringIO_input),
engine=engine,
sep=",",
header=None,
names=["Date"],
Expand Down
25 changes: 17 additions & 8 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -158,9 +158,11 @@ dtype : Type name or dict of column -> type, default ``None``
(unsupported with ``engine='python'``). Use ``str`` or ``object`` together
with suitable ``na_values`` settings to preserve and
not interpret dtype.
engine : {``'c'``, ``'python'``}
Parser engine to use. The C engine is faster while the Python engine is
currently more feature-complete.
engine : {``'c'``, ``'pyarrow'``, ``'python'``}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded 1.3

Parser engine to use. In terms of performance, the pyarrow engine,
which requires ``pyarrow`` >= 0.15.0, is faster than the C engine, which
is faster than the python engine. However, the pyarrow and C engines
are currently less feature complete than their Python counterpart.
converters : dict, default ``None``
Dict of functions for converting values in certain columns. Keys can either be
integers or column labels.
Expand Down Expand Up @@ -1602,11 +1604,18 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
Specifying the parser engine
''''''''''''''''''''''''''''

Under the hood pandas uses a fast and efficient parser implemented in C as well
as a Python implementation which is currently more feature-complete. Where
possible pandas uses the C parser (specified as ``engine='c'``), but may fall
back to Python if C-unsupported options are specified. Currently, C-unsupported
options include:
Currently, pandas supports using three engines, the C engine, the python engine,
and an optional pyarrow engine(requires ``pyarrow`` >= 0.15). In terms of performance
the pyarrow engine is fastest, followed by the C and Python engines. However,
the pyarrow engine is much less robust than the C engine, which in turn lacks a
couple of features present in the Python parser.

Where possible pandas uses the C parser (specified as ``engine='c'``), but may fall
back to Python if C-unsupported options are specified. If pyarrow unsupported options are
specified while using ``engine='pyarrow'``, the parser will error out
(a full list of unsupported options is available at ``pandas.io.parsers._pyarrow_unsupported``).

Currently, C-unsupported options include:

* ``sep`` other than a single character (e.g. regex separators)
* ``skipfooter``
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@ If needed you can adjust the bins with the argument ``offset`` (a :class:`Timede

For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.


fsspec now used for filesystem handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
5 changes: 5 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,11 @@ example where the index name is preserved:
The same is true for :class:`MultiIndex`, but the logic is applied separately on a
level-by-level basis.

read_csv() now accepts pyarrow as an engine
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:func:`pandas.read_csv` now accepts engine="pyarrow" as an argument, allowing for faster csv parsing on multicore machines
with pyarrow>=0.15 installed. See the :doc:`I/O docs </user_guide/io>` for more info. (:issue:`23697`)
.. _whatsnew_120.groupby_ewm:

Groupby supports EWM operations directly
Expand Down
26 changes: 20 additions & 6 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import distutils.version
import importlib
import sys
import types
from typing import Optional
import warnings

# Update install.rst when updating versions!
Expand Down Expand Up @@ -43,6 +45,7 @@
"pandas_gbq": "pandas-gbq",
"sqlalchemy": "SQLAlchemy",
"jinja2": "Jinja2",
"pyarrow.csv": "pyarrow",
}


Expand All @@ -58,7 +61,11 @@ def _get_version(module: types.ModuleType) -> str:


def import_optional_dependency(
name: str, extra: str = "", raise_on_missing: bool = True, on_version: str = "raise"
name: str,
extra: str = "",
raise_on_missing: bool = True,
on_version: str = "raise",
min_version: Optional[str] = None,
):
"""
Import an optional dependency.
Expand All @@ -70,8 +77,7 @@ def import_optional_dependency(
Parameters
----------
name : str
The module name. This should be top-level only, so that the
version may be checked.
The module name.
extra : str
Additional text to include in the ImportError message.
raise_on_missing : bool, default True
Expand All @@ -85,6 +91,8 @@ def import_optional_dependency(
* ignore: Return the module, even if the version is too old.
It's expected that users validate the version locally when
using ``on_version="ignore"`` (see. ``io/html.py``)
min_version: Optional[str]
Specify the minimum version

Returns
-------
Expand All @@ -109,10 +117,16 @@ def import_optional_dependency(
raise ImportError(msg) from None
else:
return None

minimum_version = VERSIONS.get(name)
# Handle submodules: if we have submodule, grab parent module from sys.modules
parent = name.split(".")[0]
if parent != name:
install_name = parent
module_to_get = sys.modules[install_name]
else:
module_to_get = module
minimum_version = min_version if min_version is not None else VERSIONS.get(name)
if minimum_version:
version = _get_version(module)
version = _get_version(module_to_get)
if distutils.version.LooseVersion(version) < minimum_version:
assert on_version in {"warn", "raise", "ignore"}
msg = (
Expand Down
Loading