Skip to content

Commit df87e14

Browse files
BUG: Slice Arrow buffer before passing it to numpy (#40896)
Merge branch 'master' into issue-40896
2 parents ff85a80 + 3513f59 commit df87e14

File tree

95 files changed

+1797
-540
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

95 files changed

+1797
-540
lines changed

.pre-commit-config.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ repos:
3535
exclude: ^pandas/_libs/src/(klib|headers)/
3636
args: [--quiet, '--extensions=c,h', '--headers=h', --recursive, '--filter=-readability/casting,-runtime/int,-build/include_subdir']
3737
- repo: https://gitlab.com/pycqa/flake8
38-
rev: 3.9.0
38+
rev: 3.9.1
3939
hooks:
4040
- id: flake8
4141
additional_dependencies:
@@ -75,7 +75,7 @@ repos:
7575
hooks:
7676
- id: yesqa
7777
additional_dependencies:
78-
- flake8==3.9.0
78+
- flake8==3.9.1
7979
- flake8-comprehensions==3.1.0
8080
- flake8-bugbear==21.3.2
8181
- pandas-dev-flaker==0.2.0

LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ BSD 3-Clause License
33
Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
44
All rights reserved.
55

6-
Copyright (c) 2011-2020, Open source contributors.
6+
Copyright (c) 2011-2021, Open source contributors.
77

88
Redistribution and use in source and binary forms, with or without
99
modification, are permitted provided that the following conditions are met:

asv_bench/benchmarks/groupby.py

+28
Original file line numberDiff line numberDiff line change
@@ -505,6 +505,34 @@ def time_frame_agg(self, dtype, method):
505505
self.df.groupby("key").agg(method)
506506

507507

508+
class CumminMax:
509+
param_names = ["dtype", "method"]
510+
params = [
511+
["float64", "int64", "Float64", "Int64"],
512+
["cummin", "cummax"],
513+
]
514+
515+
def setup(self, dtype, method):
516+
N = 500_000
517+
vals = np.random.randint(-10, 10, (N, 5))
518+
null_vals = vals.astype(float, copy=True)
519+
null_vals[::2, :] = np.nan
520+
null_vals[::3, :] = np.nan
521+
df = DataFrame(vals, columns=list("abcde"), dtype=dtype)
522+
null_df = DataFrame(null_vals, columns=list("abcde"), dtype=dtype)
523+
keys = np.random.randint(0, 100, size=N)
524+
df["key"] = keys
525+
null_df["key"] = keys
526+
self.df = df
527+
self.null_df = null_df
528+
529+
def time_frame_transform(self, dtype, method):
530+
self.df.groupby("key").transform(method)
531+
532+
def time_frame_transform_many_nulls(self, dtype, method):
533+
self.null_df.groupby("key").transform(method)
534+
535+
508536
class RankWithTies:
509537
# GH 21237
510538
param_names = ["dtype", "tie_method"]

doc/source/_static/style/hq_ax1.png

5.95 KB
Loading
5.96 KB
Loading

doc/source/_static/style/hq_props.png

6.09 KB
Loading

doc/source/development/roadmap.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,8 @@ instead of comparing as False).
7171

7272
Long term, we want to introduce consistent missing data handling for all data
7373
types. This includes consistent behavior in all operations (indexing, arithmetic
74-
operations, comparisons, etc.). We want to eventually make the new semantics the
75-
default.
74+
operations, comparisons, etc.). There has been discussion of eventually making
75+
the new semantics the default.
7676

7777
This has been discussed at
7878
`github #28095 <https://github.com/pandas-dev/pandas/issues/28095>`__ (and

doc/source/getting_started/install.rst

+15
Original file line numberDiff line numberDiff line change
@@ -362,6 +362,21 @@ pyarrow 0.15.0 Parquet, ORC, and feather reading /
362362
pyreadstat SPSS files (.sav) reading
363363
========================= ================== =============================================================
364364

365+
.. _install.warn_orc:
366+
367+
.. warning::
368+
369+
* If you want to use :func:`~pandas.read_orc`, it is highly recommended to install pyarrow using conda.
370+
The following is a summary of the environment in which :func:`~pandas.read_orc` can work.
371+
372+
========================= ================== =============================================================
373+
System Conda PyPI
374+
========================= ================== =============================================================
375+
Linux Successful Failed(pyarrow==3.0 Successful)
376+
macOS Successful Failed
377+
Windows Failed Failed
378+
========================= ================== =============================================================
379+
365380
Access data in the cloud
366381
^^^^^^^^^^^^^^^^^^^^^^^^
367382

doc/source/getting_started/intro_tutorials/01_table_oriented.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ these are by default not taken into account by the :func:`~DataFrame.describe` m
176176

177177
Many pandas operations return a ``DataFrame`` or a ``Series``. The
178178
:func:`~DataFrame.describe` method is an example of a pandas operation returning a
179-
pandas ``Series``.
179+
pandas ``Series`` or a pandas ``DataFrame``.
180180

181181
.. raw:: html
182182

doc/source/user_guide/io.rst

+5
Original file line numberDiff line numberDiff line change
@@ -5443,6 +5443,11 @@ Similar to the :ref:`parquet <io.parquet>` format, the `ORC Format <https://orc.
54435443
for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
54445444
ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library.
54455445

5446+
.. warning::
5447+
5448+
* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
5449+
* :func:`~pandas.read_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies <install.warn_orc>`.
5450+
54465451
.. _io.sql:
54475452

54485453
SQL queries

doc/source/user_guide/style.ipynb

+61-16
Original file line numberDiff line numberDiff line change
@@ -1006,7 +1006,30 @@
10061006
"cell_type": "markdown",
10071007
"metadata": {},
10081008
"source": [
1009-
"We expect certain styling functions to be common enough that we've included a few \"built-in\" to the `Styler`, so you don't have to write them yourself."
1009+
"Some styling functions are common enough that we've \"built them in\" to the `Styler`, so you don't have to write them and apply them yourself. The current list of such functions is:\n",
1010+
"\n",
1011+
" - [.highlight_null][nullfunc]: for use with identifying missing data. \n",
1012+
" - [.highlight_min][minfunc] and [.highlight_max][maxfunc]: for use with identifying extremeties in data.\n",
1013+
" - [.highlight_between][betweenfunc] and [.highlight_quantile][quantilefunc]: for use with identifying classes within data.\n",
1014+
" - [.background_gradient][bgfunc]: a flexible method for highlighting cells based or their, or other, values on a numeric scale.\n",
1015+
" - [.bar][barfunc]: to display mini-charts within cell backgrounds.\n",
1016+
" \n",
1017+
"The individual documentation on each function often gives more examples of their arguments.\n",
1018+
"\n",
1019+
"[nullfunc]: ../reference/api/pandas.io.formats.style.Styler.highlight_null.rst\n",
1020+
"[minfunc]: ../reference/api/pandas.io.formats.style.Styler.highlight_min.rst\n",
1021+
"[maxfunc]: ../reference/api/pandas.io.formats.style.Styler.highlight_max.rst\n",
1022+
"[betweenfunc]: ../reference/api/pandas.io.formats.style.Styler.highlight_between.rst\n",
1023+
"[quantilefunc]: ../reference/api/pandas.io.formats.style.Styler.highlight_quantile.rst\n",
1024+
"[bgfunc]: ../reference/api/pandas.io.formats.style.Styler.background_gradient.rst\n",
1025+
"[barfunc]: ../reference/api/pandas.io.formats.style.Styler.bar.rst"
1026+
]
1027+
},
1028+
{
1029+
"cell_type": "markdown",
1030+
"metadata": {},
1031+
"source": [
1032+
"### Highlight Null"
10101033
]
10111034
},
10121035
{
@@ -1017,14 +1040,14 @@
10171040
"source": [
10181041
"df2.iloc[0,2] = np.nan\n",
10191042
"df2.iloc[4,3] = np.nan\n",
1020-
"df2.loc[:4].style.highlight_null(null_color='red')"
1043+
"df2.loc[:4].style.highlight_null(null_color='yellow')"
10211044
]
10221045
},
10231046
{
10241047
"cell_type": "markdown",
10251048
"metadata": {},
10261049
"source": [
1027-
"You can create \"heatmaps\" with the `background_gradient` method. These require matplotlib, and we'll use [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) to get a nice colormap."
1050+
"### Highlight Min or Max"
10281051
]
10291052
},
10301053
{
@@ -1033,17 +1056,15 @@
10331056
"metadata": {},
10341057
"outputs": [],
10351058
"source": [
1036-
"import seaborn as sns\n",
1037-
"cm = sns.light_palette(\"green\", as_cmap=True)\n",
1038-
"\n",
1039-
"df2.style.background_gradient(cmap=cm)"
1059+
"df2.loc[:4].style.highlight_max(axis=1, props='color:white; font-weight:bold; background-color:darkblue;')"
10401060
]
10411061
},
10421062
{
10431063
"cell_type": "markdown",
10441064
"metadata": {},
10451065
"source": [
1046-
"`Styler.background_gradient` takes the keyword arguments `low` and `high`. Roughly speaking these extend the range of your data by `low` and `high` percent so that when we convert the colors, the colormap's entire range isn't used. This is useful so that you can actually read the text still."
1066+
"### Highlight Between\n",
1067+
"This method accepts ranges as float, or NumPy arrays or Series provided the indexes match."
10471068
]
10481069
},
10491070
{
@@ -1052,8 +1073,16 @@
10521073
"metadata": {},
10531074
"outputs": [],
10541075
"source": [
1055-
"# Uses the full color range\n",
1056-
"df2.loc[:4].style.background_gradient(cmap='viridis')"
1076+
"left = pd.Series([1.0, 0.0, 1.0], index=[\"A\", \"B\", \"D\"])\n",
1077+
"df2.loc[:4].style.highlight_between(left=left, right=1.5, axis=1, props='color:white; background-color:purple;')"
1078+
]
1079+
},
1080+
{
1081+
"cell_type": "markdown",
1082+
"metadata": {},
1083+
"source": [
1084+
"### Highlight Quantile\n",
1085+
"Useful for detecting the highest or lowest percentile values"
10571086
]
10581087
},
10591088
{
@@ -1062,17 +1091,21 @@
10621091
"metadata": {},
10631092
"outputs": [],
10641093
"source": [
1065-
"# Compress the color range\n",
1066-
"df2.loc[:4].style\\\n",
1067-
" .background_gradient(cmap='viridis', low=.5, high=0)\\\n",
1068-
" .highlight_null('red')"
1094+
"df2.loc[:4].style.highlight_quantile(q_left=0.85, axis=None, color='yellow')"
1095+
]
1096+
},
1097+
{
1098+
"cell_type": "markdown",
1099+
"metadata": {},
1100+
"source": [
1101+
"### Background Gradient"
10691102
]
10701103
},
10711104
{
10721105
"cell_type": "markdown",
10731106
"metadata": {},
10741107
"source": [
1075-
"There's also `.highlight_min` and `.highlight_max`, which is almost identical to the user defined version we created above, and also a `.highlight_null` method. "
1108+
"You can create \"heatmaps\" with the `background_gradient` method. These require matplotlib, and we'll use [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) to get a nice colormap."
10761109
]
10771110
},
10781111
{
@@ -1081,7 +1114,19 @@
10811114
"metadata": {},
10821115
"outputs": [],
10831116
"source": [
1084-
"df2.loc[:4].style.highlight_max(axis=0)"
1117+
"import seaborn as sns\n",
1118+
"cm = sns.light_palette(\"green\", as_cmap=True)\n",
1119+
"\n",
1120+
"df2.style.background_gradient(cmap=cm)"
1121+
]
1122+
},
1123+
{
1124+
"cell_type": "markdown",
1125+
"metadata": {},
1126+
"source": [
1127+
"[.background_gradient][bgfunc] has a number of keyword arguments to customise the gradients and colors. See its documentation.\n",
1128+
"\n",
1129+
"[bgfunc]: ../reference/api/pandas.io.formats.style.Styler.background_gradient.rst"
10851130
]
10861131
},
10871132
{

doc/source/user_guide/visualization.rst

+26-6
Original file line numberDiff line numberDiff line change
@@ -1458,25 +1458,23 @@ Horizontal and vertical error bars can be supplied to the ``xerr`` and ``yerr``
14581458
* As a ``str`` indicating which of the columns of plotting :class:`DataFrame` contain the error values.
14591459
* As raw values (``list``, ``tuple``, or ``np.ndarray``). Must be the same length as the plotting :class:`DataFrame`/:class:`Series`.
14601460

1461-
Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a ``N`` length :class:`Series`, a ``2xN`` array should be provided indicating lower and upper (or left and right) errors. For a ``MxN`` :class:`DataFrame`, asymmetrical errors should be in a ``Mx2xN`` array.
1462-
14631461
Here is an example of one way to easily plot group means with standard deviations from the raw data.
14641462

14651463
.. ipython:: python
14661464
14671465
# Generate the data
14681466
ix3 = pd.MultiIndex.from_arrays(
14691467
[
1470-
["a", "a", "a", "a", "b", "b", "b", "b"],
1471-
["foo", "foo", "bar", "bar", "foo", "foo", "bar", "bar"],
1468+
["a", "a", "a", "a", "a", "b", "b", "b", "b", "b"],
1469+
["foo", "foo", "foo", "bar", "bar", "foo", "foo", "bar", "bar", "bar"],
14721470
],
14731471
names=["letter", "word"],
14741472
)
14751473
14761474
df3 = pd.DataFrame(
14771475
{
1478-
"data1": [3, 2, 4, 3, 2, 4, 3, 2],
1479-
"data2": [6, 5, 7, 5, 4, 5, 6, 5],
1476+
"data1": [9, 3, 2, 4, 3, 2, 4, 6, 3, 2],
1477+
"data2": [9, 6, 5, 7, 5, 4, 5, 6, 5, 1],
14801478
},
14811479
index=ix3,
14821480
)
@@ -1499,6 +1497,28 @@ Here is an example of one way to easily plot group means with standard deviation
14991497
15001498
plt.close("all")
15011499
1500+
Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a ``N`` length :class:`Series`, a ``2xN`` array should be provided indicating lower and upper (or left and right) errors. For a ``MxN`` :class:`DataFrame`, asymmetrical errors should be in a ``Mx2xN`` array.
1501+
1502+
Here is an example of one way to plot the min/max range using asymmetrical error bars.
1503+
1504+
.. ipython:: python
1505+
1506+
mins = gp3.min()
1507+
maxs = gp3.max()
1508+
1509+
# errors should be positive, and defined in the order of lower, upper
1510+
errors = [[means[c] - mins[c], maxs[c] - means[c]] for c in df3.columns]
1511+
1512+
# Plot
1513+
fig, ax = plt.subplots()
1514+
@savefig errorbar_asymmetrical_example.png
1515+
means.plot.bar(yerr=errors, ax=ax, capsize=4, rot=0);
1516+
1517+
.. ipython:: python
1518+
:suppress:
1519+
1520+
plt.close("all")
1521+
15021522
.. _visualization.table:
15031523

15041524
Plotting tables

0 commit comments

Comments
 (0)