Skip to content

Commit a831aa4

Browse files
committed
merging master
2 parents ec67841 + 4ec6925 commit a831aa4

File tree

109 files changed

+2706
-1660
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+2706
-1660
lines changed

asv_bench/benchmarks/strings.py

+11-6
Original file line numberDiff line numberDiff line change
@@ -230,16 +230,21 @@ def time_contains(self, dtype, regex):
230230

231231
class Split:
232232

233-
params = [True, False]
234-
param_names = ["expand"]
233+
params = (["str", "string", "arrow_string"], [True, False])
234+
param_names = ["dtype", "expand"]
235+
236+
def setup(self, dtype, expand):
237+
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401
235238

236-
def setup(self, expand):
237-
self.s = Series(tm.makeStringIndex(10 ** 5)).str.join("--")
239+
try:
240+
self.s = Series(tm.makeStringIndex(10 ** 5), dtype=dtype).str.join("--")
241+
except ImportError:
242+
raise NotImplementedError
238243

239-
def time_split(self, expand):
244+
def time_split(self, dtype, expand):
240245
self.s.str.split("--", expand=expand)
241246

242-
def time_rsplit(self, expand):
247+
def time_rsplit(self, dtype, expand):
243248
self.s.str.rsplit("--", expand=expand)
244249

245250

doc/source/ecosystem.rst

+29
Original file line numberDiff line numberDiff line change
@@ -405,6 +405,35 @@ Blaze provides a standard API for doing computations with various
405405
in-memory and on-disk backends: NumPy, pandas, SQLAlchemy, MongoDB, PyTables,
406406
PySpark.
407407

408+
`Cylon <https://cylondata.org/>`__
409+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
410+
411+
Cylon is a fast, scalable, distributed memory parallel runtime with a pandas
412+
like Python DataFrame API. ”Core Cylon” is implemented with C++ using Apache
413+
Arrow format to represent the data in-memory. Cylon DataFrame API implements
414+
most of the core operators of pandas such as merge, filter, join, concat,
415+
group-by, drop_duplicates, etc. These operators are designed to work across
416+
thousands of cores to scale applications. It can interoperate with pandas
417+
DataFrame by reading data from pandas or converting data to pandas so users
418+
can selectively scale parts of their pandas DataFrame applications.
419+
420+
.. code:: python
421+
422+
from pycylon import read_csv, DataFrame, CylonEnv
423+
from pycylon.net import MPIConfig
424+
425+
# Initialize Cylon distributed environment
426+
config: MPIConfig = MPIConfig()
427+
env: CylonEnv = CylonEnv(config=config, distributed=True)
428+
429+
df1: DataFrame = read_csv('/tmp/csv1.csv')
430+
df2: DataFrame = read_csv('/tmp/csv2.csv')
431+
432+
# Using 1000s of cores across the cluster to compute the join
433+
df3: Table = df1.join(other=df2, on=[0], algorithm="hash", env=env)
434+
435+
print(df3)
436+
408437
`Dask <https://dask.readthedocs.io/en/latest/>`__
409438
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
410439

doc/source/user_guide/basics.rst

-2
Original file line numberDiff line numberDiff line change
@@ -1184,11 +1184,9 @@ a single value and returning a single value. For example:
11841184
11851185
df4
11861186
1187-
11881187
def f(x):
11891188
return len(str(x))
11901189
1191-
11921190
df4["one"].map(f)
11931191
df4.applymap(f)
11941192

doc/source/user_guide/cookbook.rst

-13
Original file line numberDiff line numberDiff line change
@@ -494,15 +494,12 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
494494
495495
S = pd.Series([i / 100.0 for i in range(1, 11)])
496496
497-
498497
def cum_ret(x, y):
499498
return x * (1 + y)
500499
501-
502500
def red(x):
503501
return functools.reduce(cum_ret, x, 1.0)
504502
505-
506503
S.expanding().apply(red, raw=True)
507504
508505
@@ -514,12 +511,10 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
514511
df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, -1, 1, 2]})
515512
gb = df.groupby("A")
516513
517-
518514
def replace(g):
519515
mask = g < 0
520516
return g.where(mask, g[~mask].mean())
521517
522-
523518
gb.transform(replace)
524519
525520
`Sort groups by aggregated data
@@ -551,13 +546,11 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
551546
rng = pd.date_range(start="2014-10-07", periods=10, freq="2min")
552547
ts = pd.Series(data=list(range(10)), index=rng)
553548
554-
555549
def MyCust(x):
556550
if len(x) > 2:
557551
return x[1] * 1.234
558552
return pd.NaT
559553
560-
561554
mhc = {"Mean": np.mean, "Max": np.max, "Custom": MyCust}
562555
ts.resample("5min").apply(mhc)
563556
ts
@@ -803,11 +796,9 @@ Apply
803796
index=["I", "II", "III"],
804797
)
805798
806-
807799
def SeriesFromSubList(aList):
808800
return pd.Series(aList)
809801
810-
811802
df_orgz = pd.concat(
812803
{ind: row.apply(SeriesFromSubList) for ind, row in df.iterrows()}
813804
)
@@ -827,12 +818,10 @@ Rolling Apply to multiple columns where function calculates a Series before a Sc
827818
)
828819
df
829820
830-
831821
def gm(df, const):
832822
v = ((((df["A"] + df["B"]) + 1).cumprod()) - 1) * const
833823
return v.iloc[-1]
834824
835-
836825
s = pd.Series(
837826
{
838827
df.index[i]: gm(df.iloc[i: min(i + 51, len(df) - 1)], 5)
@@ -859,11 +848,9 @@ Rolling Apply to multiple columns where function returns a Scalar (Volume Weight
859848
)
860849
df
861850
862-
863851
def vwap(bars):
864852
return (bars.Close * bars.Volume).sum() / bars.Volume.sum()
865853
866-
867854
window = 5
868855
s = pd.concat(
869856
[

doc/source/user_guide/groupby.rst

-2
Original file line numberDiff line numberDiff line change
@@ -1617,12 +1617,10 @@ column index name will be used as the name of the inserted column:
16171617
}
16181618
)
16191619
1620-
16211620
def compute_metrics(x):
16221621
result = {"b_sum": x["b"].sum(), "c_mean": x["c"].mean()}
16231622
return pd.Series(result, name="metrics")
16241623
1625-
16261624
result = df.groupby("a").apply(compute_metrics)
16271625
16281626
result

doc/source/user_guide/io.rst

-2
Original file line numberDiff line numberDiff line change
@@ -4648,11 +4648,9 @@ chunks.
46484648
46494649
store.append("dfeq", dfeq, data_columns=["number"])
46504650
4651-
46524651
def chunks(l, n):
46534652
return [l[i: i + n] for i in range(0, len(l), n)]
46544653
4655-
46564654
evens = [2, 4, 6, 8, 10]
46574655
coordinates = store.select_as_coordinates("dfeq", "number=evens")
46584656
for c in chunks(coordinates, 2):

doc/source/user_guide/merging.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1578,4 +1578,5 @@ to ``True``.
15781578
You may also keep all the original values even if they are equal.
15791579

15801580
.. ipython:: python
1581+
15811582
df.compare(df2, keep_shape=True, keep_equal=True)

doc/source/user_guide/reshaping.rst

-2
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ Reshaping by pivoting DataFrame objects
1818
1919
import pandas._testing as tm
2020
21-
2221
def unpivot(frame):
2322
N, K = frame.shape
2423
data = {
@@ -29,7 +28,6 @@ Reshaping by pivoting DataFrame objects
2928
columns = ["date", "variable", "value"]
3029
return pd.DataFrame(data, columns=columns)
3130
32-
3331
df = unpivot(tm.makeTimeDataFrame(3))
3432
3533
Data is often stored in so-called "stacked" or "record" format:

doc/source/user_guide/scale.rst

+1
Original file line numberDiff line numberDiff line change
@@ -345,6 +345,7 @@ we need to supply the divisions manually.
345345
Now we can do things like fast random access with ``.loc``.
346346

347347
.. ipython:: python
348+
:okwarning:
348349
349350
ddf.loc["2002-01-01 12:01":"2002-01-01 12:05"].compute()
350351

doc/source/user_guide/sparse.rst

-1
Original file line numberDiff line numberDiff line change
@@ -325,7 +325,6 @@ In the example below, we transform the ``Series`` to a sparse representation of
325325
row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True
326326
)
327327
328-
329328
A
330329
A.todense()
331330
rows

doc/source/user_guide/text.rst

-5
Original file line numberDiff line numberDiff line change
@@ -297,24 +297,19 @@ positional argument (a regex object) and return a string.
297297
# Reverse every lowercase alphabetic word
298298
pat = r"[a-z]+"
299299
300-
301300
def repl(m):
302301
return m.group(0)[::-1]
303302
304-
305303
pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(
306304
pat, repl, regex=True
307305
)
308306
309-
310307
# Using regex groups
311308
pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
312309
313-
314310
def repl(m):
315311
return m.group("two").swapcase()
316312
317-
318313
pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(
319314
pat, repl, regex=True
320315
)

doc/source/user_guide/timeseries.rst

-6
Original file line numberDiff line numberDiff line change
@@ -1422,7 +1422,6 @@ An example of how holidays and holiday calendars are defined:
14221422
MO,
14231423
)
14241424
1425-
14261425
class ExampleCalendar(AbstractHolidayCalendar):
14271426
rules = [
14281427
USMemorialDay,
@@ -1435,7 +1434,6 @@ An example of how holidays and holiday calendars are defined:
14351434
),
14361435
]
14371436
1438-
14391437
cal = ExampleCalendar()
14401438
cal.holidays(datetime.datetime(2012, 1, 1), datetime.datetime(2012, 12, 31))
14411439
@@ -1707,13 +1705,11 @@ We can instead only resample those groups where we have points as follows:
17071705
from functools import partial
17081706
from pandas.tseries.frequencies import to_offset
17091707
1710-
17111708
def round(t, freq):
17121709
# round a Timestamp to a specified freq
17131710
freq = to_offset(freq)
17141711
return pd.Timestamp((t.value // freq.delta.value) * freq.delta.value)
17151712
1716-
17171713
ts.groupby(partial(round, freq="3T")).sum()
17181714
17191715
.. _timeseries.aggregate:
@@ -2255,11 +2251,9 @@ To convert from an ``int64`` based YYYYMMDD representation.
22552251
s = pd.Series([20121231, 20141130, 99991231])
22562252
s
22572253
2258-
22592254
def conv(x):
22602255
return pd.Period(year=x // 10000, month=x // 100 % 100, day=x % 100, freq="D")
22612256
2262-
22632257
s.apply(conv)
22642258
s.apply(conv)[2]
22652259

doc/source/user_guide/window.rst

+1-4
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,6 @@ from present information back to past information. This allows the rolling windo
212212
213213
df
214214
215-
216215
.. _window.custom_rolling_window:
217216

218217
Custom window rolling
@@ -294,13 +293,12 @@ conditions. In these cases it can be useful to perform forward-looking rolling w
294293
This :func:`BaseIndexer <pandas.api.indexers.BaseIndexer>` subclass implements a closed fixed-width
295294
forward-looking rolling window, and we can use it as follows:
296295

297-
.. ipython:: ipython
296+
.. ipython:: python
298297
299298
from pandas.api.indexers import FixedForwardWindowIndexer
300299
indexer = FixedForwardWindowIndexer(window_size=2)
301300
df.rolling(indexer, min_periods=1).sum()
302301
303-
304302
.. _window.rolling_apply:
305303

306304
Rolling apply
@@ -319,7 +317,6 @@ the windows are cast as :class:`Series` objects (``raw=False``) or ndarray objec
319317
s = pd.Series(range(10))
320318
s.rolling(window=4).apply(mad, raw=True)
321319
322-
323320
.. _window.numba_engine:
324321

325322
Numba engine

doc/source/whatsnew/v1.3.0.rst

+10-1
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,7 @@ Other enhancements
226226
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` return a ``BooleanDtype`` for columns with nullable data types (:issue:`33449`)
227227
- Constructing a :class:`DataFrame` or :class:`Series` with the ``data`` argument being a Python iterable that is *not* a NumPy ``ndarray`` consisting of NumPy scalars will now result in a dtype with a precision the maximum of the NumPy scalars; this was already the case when ``data`` is a NumPy ``ndarray`` (:issue:`40908`)
228228
- Add keyword ``sort`` to :func:`pivot_table` to allow non-sorting of the result (:issue:`39143`)
229+
- Add keyword ``dropna`` to :meth:`DataFrame.value_counts` to allow counting rows that include ``NA`` values (:issue:`41325`)
229230
-
230231

231232
.. ---------------------------------------------------------------------------
@@ -643,6 +644,7 @@ Deprecations
643644
- Deprecated the ``level`` keyword for :class:`DataFrame` and :class:`Series` aggregations; use groupby instead (:issue:`39983`)
644645
- The ``inplace`` parameter of :meth:`Categorical.remove_categories`, :meth:`Categorical.add_categories`, :meth:`Categorical.reorder_categories`, :meth:`Categorical.rename_categories`, :meth:`Categorical.set_categories` is deprecated and will be removed in a future version (:issue:`37643`)
645646
- Deprecated :func:`merge` producing duplicated columns through the ``suffixes`` keyword and already existing columns (:issue:`22818`)
647+
- Deprecated setting :attr:`Categorical._codes`, create a new :class:`Categorical` with the desired codes instead (:issue:`40606`)
646648

647649
.. ---------------------------------------------------------------------------
648650
@@ -747,7 +749,7 @@ Strings
747749
^^^^^^^
748750

749751
- Bug in the conversion from ``pyarrow.ChunkedArray`` to :class:`~arrays.StringArray` when the original had zero chunks (:issue:`41040`)
750-
-
752+
- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` ignoring replacements with ``regex=True`` for ``StringDType`` data (:issue:`41333`, :issue:`35977`)
751753

752754
Interval
753755
^^^^^^^^
@@ -787,9 +789,11 @@ Indexing
787789
- Bug in setting ``numpy.timedelta64`` values into an object-dtype :class:`Series` using a boolean indexer (:issue:`39488`)
788790
- Bug in setting numeric values into a into a boolean-dtypes :class:`Series` using ``at`` or ``iat`` failing to cast to object-dtype (:issue:`39582`)
789791
- Bug in :meth:`DataFrame.__setitem__` and :meth:`DataFrame.iloc.__setitem__` raising ``ValueError`` when trying to index with a row-slice and setting a list as values (:issue:`40440`)
792+
- Bug in :meth:`DataFrame.loc` not raising ``KeyError`` when key was not found in :class:`MultiIndex` when levels contain more values than used (:issue:`41170`)
790793
- Bug in :meth:`DataFrame.loc.__setitem__` when setting-with-expansion incorrectly raising when the index in the expanding axis contains duplicates (:issue:`40096`)
791794
- Bug in :meth:`DataFrame.loc` incorrectly matching non-boolean index elements (:issue:`20432`)
792795
- Bug in :meth:`Series.__delitem__` with ``ExtensionDtype`` incorrectly casting to ``ndarray`` (:issue:`40386`)
796+
- Bug in :meth:`DataFrame.__setitem__` raising ``TypeError`` when using a str subclass as the column name with a :class:`DatetimeIndex` (:issue:`37366`)
793797
- Bug in :meth:`Index.get_indexer_non_unique` when index contains multiple ``np.nan`` (:issue:`35392`)
794798

795799
Missing
@@ -807,6 +811,7 @@ MultiIndex
807811
- Bug in :meth:`MultiIndex.intersection` duplicating ``NaN`` in result (:issue:`38623`)
808812
- Bug in :meth:`MultiIndex.equals` incorrectly returning ``True`` when :class:`MultiIndex` containing ``NaN`` even when they are differently ordered (:issue:`38439`)
809813
- Bug in :meth:`MultiIndex.intersection` always returning empty when intersecting with :class:`CategoricalIndex` (:issue:`38653`)
814+
- Bug in :meth:`MultiIndex.reindex` raising ``ValueError`` with empty MultiIndex and indexing only a specific level (:issue:`41170`)
810815

811816
I/O
812817
^^^
@@ -836,6 +841,7 @@ I/O
836841
- Bug in :func:`read_excel` raising ``AttributeError`` with ``MultiIndex`` header followed by two empty rows and no index, and bug affecting :func:`read_excel`, :func:`read_csv`, :func:`read_table`, :func:`read_fwf`, and :func:`read_clipboard` where one blank row after a ``MultiIndex`` header with no index would be dropped (:issue:`40442`)
837842
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
838843
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
844+
- Bug in :func:`read_csv` and :func:`read_excel` not respecting dtype for duplicated column name when ``mangle_dupe_cols`` is set to ``True`` (:issue:`35211`)
839845
- Bug in :func:`read_csv` and :func:`read_table` misinterpreting arguments when ``sys.setprofile`` had been previously called (:issue:`41069`)
840846
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
841847

@@ -852,6 +858,7 @@ Plotting
852858
- Prevent warnings when matplotlib's ``constrained_layout`` is enabled (:issue:`25261`)
853859
- Bug in :func:`DataFrame.plot` was showing the wrong colors in the legend if the function was called repeatedly and some calls used ``yerr`` while others didn't (partial fix of :issue:`39522`)
854860
- Bug in :func:`DataFrame.plot` was showing the wrong colors in the legend if the function was called repeatedly and some calls used ``secondary_y`` and others use ``legend=False`` (:issue:`40044`)
861+
- Bug in :meth:`DataFrame.plot.box` in box plot when ``dark_background`` theme was selected, caps or min/max markers for the plot was not visible (:issue:`40769`)
855862

856863

857864
Groupby/resample/rolling
@@ -893,6 +900,8 @@ Groupby/resample/rolling
893900
- Bug in :meth:`SeriesGroupBy.agg` failing to retain ordered :class:`CategoricalDtype` on order-preserving aggregations (:issue:`41147`)
894901
- Bug in :meth:`DataFrameGroupBy.min` and :meth:`DataFrameGroupBy.max` with multiple object-dtype columns and ``numeric_only=False`` incorrectly raising ``ValueError`` (:issue:41111`)
895902
- Bug in :meth:`DataFrameGroupBy.rank` with the GroupBy object's ``axis=0`` and the ``rank`` method's keyword ``axis=1`` (:issue:`41320`)
903+
- Bug in :meth:`DataFrameGroupBy.__getitem__` with non-unique columns incorrectly returning a malformed :class:`SeriesGroupBy` instead of :class:`DataFrameGroupBy` (:issue:`41427`)
904+
- Bug in :meth:`DataFrameGroupBy.transform` with non-unique columns incorrectly raising ``AttributeError`` (:issue:`41427`)
896905

897906
Reshaping
898907
^^^^^^^^^

0 commit comments

Comments
 (0)