Skip to content

Commit 88b05e8

Browse files
Merge remote-tracking branch 'upstream/master' into nullable_string_dtype
2 parents c095cd4 + 36502e9 commit 88b05e8

File tree

97 files changed

+2009
-974
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+2009
-974
lines changed

codecov.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ coverage:
88
status:
99
project:
1010
default:
11-
target: '82'
11+
target: '72'
1212
patch:
1313
default:
1414
target: '50'

doc/cheatsheet/Pandas_Cheat_Sheet.pdf

9.56 KB
Binary file not shown.
9.14 KB
Binary file not shown.

doc/source/ecosystem.rst

+10-1
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,8 @@ which can be used for a wide variety of time series data mining tasks.
9898
Visualization
9999
-------------
100100

101-
While :ref:`pandas has built-in support for data visualization with matplotlib <visualization>`,
101+
`Pandas has its own Styler class for table visualization <user_guide/style.ipynb>`_, and while
102+
:ref:`pandas also has built-in support for data visualization through charts with matplotlib <visualization>`,
102103
there are a number of other pandas-compatible libraries.
103104

104105
`Altair <https://altair-viz.github.io/>`__
@@ -368,6 +369,14 @@ far exceeding the performance of the native ``df.to_sql`` method. Internally, it
368369
Microsoft's BCP utility, but the complexity is fully abstracted away from the end user.
369370
Rigorously tested, it is a complete replacement for ``df.to_sql``.
370371

372+
`Deltalake <https://pypi.org/project/deltalake>`__
373+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
374+
375+
Deltalake python package lets you access tables stored in
376+
`Delta Lake <https://delta.io/>`__ natively in Python without the need to use Spark or
377+
JVM. It provides the ``delta_table.to_pyarrow_table().to_pandas()`` method to convert
378+
any Delta table into Pandas dataframe.
379+
371380

372381
.. _ecosystem.out-of-core:
373382

doc/source/user_guide/index.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,12 @@ Further information on any specific method can be obtained in the
3838
integer_na
3939
boolean
4040
visualization
41+
style
4142
computation
4243
groupby
4344
window
4445
timeseries
4546
timedeltas
46-
style
4747
options
4848
enhancingperf
4949
scale

doc/source/user_guide/style.ipynb

+794-404
Large diffs are not rendered by default.

doc/source/user_guide/visualization.rst

+6-3
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,12 @@
22

33
{{ header }}
44

5-
*************
6-
Visualization
7-
*************
5+
*******************
6+
Chart Visualization
7+
*******************
8+
9+
This section demonstrates visualization through charting. For information on
10+
visualization of tabular data please see the section on `Table Visualization <style.ipynb>`_.
811

912
We use the standard convention for referencing the matplotlib API:
1013

doc/source/whatsnew/v1.3.0.rst

+26
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,30 @@ both XPath 1.0 and XSLT 1.0 is available. (:issue:`27554`)
110110
111111
For more, see :ref:`io.xml` in the user guide on IO tools.
112112

113+
.. _whatsnew_130.dataframe_honors_copy_with_dict:
114+
115+
DataFrame constructor honors ``copy=False`` with dict
116+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
117+
118+
When passing a dictionary to :class:`DataFrame` with ``copy=False``,
119+
a copy will no longer be made (:issue:`32960`)
120+
121+
.. ipython:: python
122+
123+
arr = np.array([1, 2, 3])
124+
df = pd.DataFrame({"A": arr, "B": arr.copy()}, copy=False)
125+
df
126+
127+
``df["A"]`` remains a view on ``arr``:
128+
129+
.. ipython:: python
130+
131+
arr[0] = 0
132+
assert df.iloc[0, 0] == 0
133+
134+
The default behavior when not passing ``copy`` will remain unchanged, i.e.
135+
a copy will be made.
136+
113137
.. _whatsnew_130.enhancements.other:
114138

115139
Other enhancements
@@ -546,6 +570,8 @@ Conversion
546570
- Bug in creating a :class:`DataFrame` from an empty ``np.recarray`` not retaining the original dtypes (:issue:`40121`)
547571
- Bug in :class:`DataFrame` failing to raise ``TypeError`` when constructing from a ``frozenset`` (:issue:`40163`)
548572
- Bug in :class:`Index` construction silently ignoring a passed ``dtype`` when the data cannot be cast to that dtype (:issue:`21311`)
573+
- Bug in :class:`DataFrame` construction with a dictionary containing an arraylike with ``ExtensionDtype`` and ``copy=True`` failing to make a copy (:issue:`38939`)
574+
-
549575

550576
Strings
551577
^^^^^^^

pandas/_libs/algos.pyx

+4-58
Original file line numberDiff line numberDiff line change
@@ -794,68 +794,14 @@ def backfill(ndarray[algos_t] old, ndarray[algos_t] new, limit=None) -> ndarray:
794794
return indexer
795795

796796

797-
@cython.boundscheck(False)
798-
@cython.wraparound(False)
799797
def backfill_inplace(algos_t[:] values, uint8_t[:] mask, limit=None):
800-
cdef:
801-
Py_ssize_t i, N
802-
algos_t val
803-
uint8_t prev_mask
804-
int lim, fill_count = 0
805-
806-
N = len(values)
807-
808-
# GH#2778
809-
if N == 0:
810-
return
811-
812-
lim = validate_limit(N, limit)
813-
814-
val = values[N - 1]
815-
prev_mask = mask[N - 1]
816-
for i in range(N - 1, -1, -1):
817-
if mask[i]:
818-
if fill_count >= lim:
819-
continue
820-
fill_count += 1
821-
values[i] = val
822-
mask[i] = prev_mask
823-
else:
824-
fill_count = 0
825-
val = values[i]
826-
prev_mask = mask[i]
798+
pad_inplace(values[::-1], mask[::-1], limit=limit)
827799

828800

829-
@cython.boundscheck(False)
830-
@cython.wraparound(False)
831801
def backfill_2d_inplace(algos_t[:, :] values,
832802
const uint8_t[:, :] mask,
833803
limit=None):
834-
cdef:
835-
Py_ssize_t i, j, N, K
836-
algos_t val
837-
int lim, fill_count = 0
838-
839-
K, N = (<object>values).shape
840-
841-
# GH#2778
842-
if N == 0:
843-
return
844-
845-
lim = validate_limit(N, limit)
846-
847-
for j in range(K):
848-
fill_count = 0
849-
val = values[j, N - 1]
850-
for i in range(N - 1, -1, -1):
851-
if mask[j, i]:
852-
if fill_count >= lim:
853-
continue
854-
fill_count += 1
855-
values[j, i] = val
856-
else:
857-
fill_count = 0
858-
val = values[j, i]
804+
pad_2d_inplace(values[:, ::-1], mask[:, ::-1], limit)
859805

860806

861807
@cython.boundscheck(False)
@@ -987,10 +933,10 @@ def rank_1d(
987933
* max: highest rank in group
988934
* first: ranks assigned in order they appear in the array
989935
* dense: like 'min', but rank always increases by 1 between groups
990-
ascending : boolean, default True
936+
ascending : bool, default True
991937
False for ranks by high (1) to low (N)
992938
na_option : {'keep', 'top', 'bottom'}, default 'keep'
993-
pct : boolean, default False
939+
pct : bool, default False
994940
Compute percentage rank of data within each group
995941
na_option : {'keep', 'top', 'bottom'}, default 'keep'
996942
* keep: leave NA values where they are

pandas/_libs/groupby.pyx

+25-43
Original file line numberDiff line numberDiff line change
@@ -402,9 +402,9 @@ def group_any_all(uint8_t[::1] out,
402402
ordering matching up to the corresponding record in `values`
403403
values : array containing the truth value of each element
404404
mask : array indicating whether a value is na or not
405-
val_test : str {'any', 'all'}
405+
val_test : {'any', 'all'}
406406
String object dictating whether to use any or all truth testing
407-
skipna : boolean
407+
skipna : bool
408408
Flag to ignore nan values during truth testing
409409
410410
Notes
@@ -455,11 +455,11 @@ ctypedef fused complexfloating_t:
455455

456456
@cython.wraparound(False)
457457
@cython.boundscheck(False)
458-
def _group_add(complexfloating_t[:, ::1] out,
459-
int64_t[::1] counts,
460-
ndarray[complexfloating_t, ndim=2] values,
461-
const intp_t[:] labels,
462-
Py_ssize_t min_count=0):
458+
def group_add(complexfloating_t[:, ::1] out,
459+
int64_t[::1] counts,
460+
ndarray[complexfloating_t, ndim=2] values,
461+
const intp_t[:] labels,
462+
Py_ssize_t min_count=0):
463463
"""
464464
Only aggregates on axis=0 using Kahan summation
465465
"""
@@ -506,19 +506,13 @@ def _group_add(complexfloating_t[:, ::1] out,
506506
out[i, j] = sumx[i, j]
507507

508508

509-
group_add_float32 = _group_add['float32_t']
510-
group_add_float64 = _group_add['float64_t']
511-
group_add_complex64 = _group_add['float complex']
512-
group_add_complex128 = _group_add['double complex']
513-
514-
515509
@cython.wraparound(False)
516510
@cython.boundscheck(False)
517-
def _group_prod(floating[:, ::1] out,
518-
int64_t[::1] counts,
519-
ndarray[floating, ndim=2] values,
520-
const intp_t[:] labels,
521-
Py_ssize_t min_count=0):
511+
def group_prod(floating[:, ::1] out,
512+
int64_t[::1] counts,
513+
ndarray[floating, ndim=2] values,
514+
const intp_t[:] labels,
515+
Py_ssize_t min_count=0):
522516
"""
523517
Only aggregates on axis=0
524518
"""
@@ -560,19 +554,15 @@ def _group_prod(floating[:, ::1] out,
560554
out[i, j] = prodx[i, j]
561555

562556

563-
group_prod_float32 = _group_prod['float']
564-
group_prod_float64 = _group_prod['double']
565-
566-
567557
@cython.wraparound(False)
568558
@cython.boundscheck(False)
569559
@cython.cdivision(True)
570-
def _group_var(floating[:, ::1] out,
571-
int64_t[::1] counts,
572-
ndarray[floating, ndim=2] values,
573-
const intp_t[:] labels,
574-
Py_ssize_t min_count=-1,
575-
int64_t ddof=1):
560+
def group_var(floating[:, ::1] out,
561+
int64_t[::1] counts,
562+
ndarray[floating, ndim=2] values,
563+
const intp_t[:] labels,
564+
Py_ssize_t min_count=-1,
565+
int64_t ddof=1):
576566
cdef:
577567
Py_ssize_t i, j, N, K, lab, ncounts = len(counts)
578568
floating val, ct, oldmean
@@ -619,17 +609,13 @@ def _group_var(floating[:, ::1] out,
619609
out[i, j] /= (ct - ddof)
620610

621611

622-
group_var_float32 = _group_var['float']
623-
group_var_float64 = _group_var['double']
624-
625-
626612
@cython.wraparound(False)
627613
@cython.boundscheck(False)
628-
def _group_mean(floating[:, ::1] out,
629-
int64_t[::1] counts,
630-
ndarray[floating, ndim=2] values,
631-
const intp_t[::1] labels,
632-
Py_ssize_t min_count=-1):
614+
def group_mean(floating[:, ::1] out,
615+
int64_t[::1] counts,
616+
ndarray[floating, ndim=2] values,
617+
const intp_t[::1] labels,
618+
Py_ssize_t min_count=-1):
633619
cdef:
634620
Py_ssize_t i, j, N, K, lab, ncounts = len(counts)
635621
floating val, count, y, t
@@ -675,10 +661,6 @@ def _group_mean(floating[:, ::1] out,
675661
out[i, j] = sumx[i, j] / count
676662

677663

678-
group_mean_float32 = _group_mean['float']
679-
group_mean_float64 = _group_mean['double']
680-
681-
682664
@cython.wraparound(False)
683665
@cython.boundscheck(False)
684666
def group_ohlc(floating[:, ::1] out,
@@ -1083,10 +1065,10 @@ def group_rank(float64_t[:, ::1] out,
10831065
* max: highest rank in group
10841066
* first: ranks assigned in order they appear in the array
10851067
* dense: like 'min', but rank always increases by 1 between groups
1086-
ascending : boolean, default True
1068+
ascending : bool, default True
10871069
False for ranks by high (1) to low (N)
10881070
na_option : {'keep', 'top', 'bottom'}, default 'keep'
1089-
pct : boolean, default False
1071+
pct : bool, default False
10901072
Compute percentage rank of data within each group
10911073
na_option : {'keep', 'top', 'bottom'}, default 'keep'
10921074
* keep: leave NA values where they are

pandas/_libs/hashtable_class_helper.pxi.in

+9-9
Original file line numberDiff line numberDiff line change
@@ -523,15 +523,15 @@ cdef class {{name}}HashTable(HashTable):
523523
any value "val" satisfying val != val is considered missing.
524524
If na_value is not None, then _additionally_, any value "val"
525525
satisfying val == na_value is considered missing.
526-
ignore_na : boolean, default False
526+
ignore_na : bool, default False
527527
Whether NA-values should be ignored for calculating the uniques. If
528528
True, the labels corresponding to missing values will be set to
529529
na_sentinel.
530530
mask : ndarray[bool], optional
531531
If not None, the mask is used as indicator for missing values
532532
(True = missing, False = valid) instead of `na_value` or
533533
condition "val != val".
534-
return_inverse : boolean, default False
534+
return_inverse : bool, default False
535535
Whether the mapping of the original array values to their location
536536
in the vector of uniques should be returned.
537537

@@ -625,7 +625,7 @@ cdef class {{name}}HashTable(HashTable):
625625
----------
626626
values : ndarray[{{dtype}}]
627627
Array of values of which unique will be calculated
628-
return_inverse : boolean, default False
628+
return_inverse : bool, default False
629629
Whether the mapping of the original array values to their location
630630
in the vector of uniques should be returned.
631631

@@ -906,11 +906,11 @@ cdef class StringHashTable(HashTable):
906906
that is not a string is considered missing. If na_value is
907907
not None, then _additionally_ any value "val" satisfying
908908
val == na_value is considered missing.
909-
ignore_na : boolean, default False
909+
ignore_na : bool, default False
910910
Whether NA-values should be ignored for calculating the uniques. If
911911
True, the labels corresponding to missing values will be set to
912912
na_sentinel.
913-
return_inverse : boolean, default False
913+
return_inverse : bool, default False
914914
Whether the mapping of the original array values to their location
915915
in the vector of uniques should be returned.
916916

@@ -998,7 +998,7 @@ cdef class StringHashTable(HashTable):
998998
----------
999999
values : ndarray[object]
10001000
Array of values of which unique will be calculated
1001-
return_inverse : boolean, default False
1001+
return_inverse : bool, default False
10021002
Whether the mapping of the original array values to their location
10031003
in the vector of uniques should be returned.
10041004

@@ -1181,11 +1181,11 @@ cdef class PyObjectHashTable(HashTable):
11811181
any value "val" satisfying val != val is considered missing.
11821182
If na_value is not None, then _additionally_, any value "val"
11831183
satisfying val == na_value is considered missing.
1184-
ignore_na : boolean, default False
1184+
ignore_na : bool, default False
11851185
Whether NA-values should be ignored for calculating the uniques. If
11861186
True, the labels corresponding to missing values will be set to
11871187
na_sentinel.
1188-
return_inverse : boolean, default False
1188+
return_inverse : bool, default False
11891189
Whether the mapping of the original array values to their location
11901190
in the vector of uniques should be returned.
11911191

@@ -1251,7 +1251,7 @@ cdef class PyObjectHashTable(HashTable):
12511251
----------
12521252
values : ndarray[object]
12531253
Array of values of which unique will be calculated
1254-
return_inverse : boolean, default False
1254+
return_inverse : bool, default False
12551255
Whether the mapping of the original array values to their location
12561256
in the vector of uniques should be returned.
12571257

0 commit comments

Comments
 (0)