diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet.pdf b/doc/cheatsheet/Pandas_Cheat_Sheet.pdf index fb71f869ba22f..3582e0c0dabf9 100644 Binary files a/doc/cheatsheet/Pandas_Cheat_Sheet.pdf and b/doc/cheatsheet/Pandas_Cheat_Sheet.pdf differ diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet.pptx b/doc/cheatsheet/Pandas_Cheat_Sheet.pptx index fd3d699d09f7b..746f508516964 100644 Binary files a/doc/cheatsheet/Pandas_Cheat_Sheet.pptx and b/doc/cheatsheet/Pandas_Cheat_Sheet.pptx differ diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index e72a9d86daeaf..3b1a3c5e380d3 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -98,7 +98,8 @@ which can be used for a wide variety of time series data mining tasks. Visualization ------------- -While :ref:`pandas has built-in support for data visualization with matplotlib `, +`Pandas has its own Styler class for table visualization `_, and while +:ref:`pandas also has built-in support for data visualization through charts with matplotlib `, there are a number of other pandas-compatible libraries. `Altair `__ diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst index 901f42097b911..6b6e212cde635 100644 --- a/doc/source/user_guide/index.rst +++ b/doc/source/user_guide/index.rst @@ -38,12 +38,12 @@ Further information on any specific method can be obtained in the integer_na boolean visualization + style computation groupby window timeseries timedeltas - style options enhancingperf scale diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb index a67bac0c65462..b8119477407c0 100644 --- a/doc/source/user_guide/style.ipynb +++ b/doc/source/user_guide/style.ipynb @@ -4,30 +4,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Styling\n", + "# Table Visualization\n", "\n", - "This document is written as a Jupyter Notebook, and can be viewed or downloaded [here](https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/user_guide/style.ipynb).\n", + "This section demonstrates visualization of tabular data using the [Styler][styler]\n", + "class. For information on visualization with charting please see [Chart Visualization][viz]. This document is written as a Jupyter Notebook, and can be viewed or downloaded [here][download].\n", "\n", - "You can apply **conditional formatting**, the visual styling of a DataFrame\n", - "depending on the data within, by using the ``DataFrame.style`` property.\n", - "This is a property that returns a ``Styler`` object, which has\n", - "useful methods for formatting and displaying DataFrames.\n", - "\n", - "The styling is accomplished using CSS.\n", - "You write \"style functions\" that take scalars, `DataFrame`s or `Series`, and return *like-indexed* DataFrames or Series with CSS `\"attribute: value\"` pairs for the values.\n", - "These functions can be incrementally passed to the `Styler` which collects the styles before rendering.\n", - "\n", - "CSS is a flexible language and as such there may be multiple ways of achieving the same result, with potential\n", - "advantages or disadvantages, which we try to illustrate." + "[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n", + "[viz]: visualization.rst\n", + "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/user_guide/style.ipynb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Styler Object\n", + "## Styler Object and HTML \n", + "\n", + "Styling should be performed after the data in a DataFrame has been processed. The [Styler][styler] creates an HTML `` and leverages CSS styling language to manipulate many parameters including colors, fonts, borders, background, etc. See [here][w3schools] for more information on styling HTML tables. This allows a lot of flexibility out of the box, and even enables web developers to integrate DataFrames into their exiting user interface designs.\n", + " \n", + "The `DataFrame.style` attribute is a property that returns a [Styler][styler] object. It has a `_repr_html_` method defined on it so they are rendered automatically in Jupyter Notebook.\n", "\n", - "The `DataFrame.style` attribute is a property that returns a `Styler` object. `Styler` has a `_repr_html_` method defined on it so they are rendered automatically. If you want the actual HTML back for further processing or for writing to file call the `.render()` method which returns a string." + "[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n", + "[w3schools]: https://www.w3schools.com/html/html_tables.asp" ] }, { @@ -52,12 +50,9 @@ "import pandas as pd\n", "import numpy as np\n", "\n", - "np.random.seed(24)\n", - "df = pd.DataFrame({'A': np.linspace(1, 10, 10)})\n", - "df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],\n", - " axis=1)\n", - "df.iloc[3, 3] = np.nan\n", - "df.iloc[0, 2] = np.nan\n", + "df = pd.DataFrame([[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232]], \n", + " index=pd.Index(['Tumour (Positive)', 'Non-Tumour (Negative)'], name='Actual Label:'), \n", + " columns=pd.MultiIndex.from_product([['Decision Tree', 'Regression', 'Random'],['Tumour', 'Non-Tumour']], names=['Model:', 'Predicted:']))\n", "df.style" ] }, @@ -65,49 +60,105 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The above output looks very similar to the standard DataFrame HTML representation. But we've done some work behind the scenes to attach CSS classes to each cell. We can view these by calling the `.render` method." + "The above output looks very similar to the standard DataFrame HTML representation. But the HTML here has already attached some CSS classes to each cell, even if we haven't yet created any styles. We can view these by calling the [.render()][render] method, which returns the raw HTML as string, which is useful for further processing or adding to a file - read on in [More about CSS and HTML](#More-About-CSS-and-HTML). Below we will show how we can use these to format the DataFrame to be more communicative. For example how we can build `s`:\n", + "\n", + "[render]: ../reference/api/pandas.io.formats.style.Styler.render.rst" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "nbsphinx": "hidden" + }, "outputs": [], "source": [ - "df.style.render().split('\\n')[:10]" + "# Hidden cell to just create the below example: code is covered throughout the guide.\n", + "s = df.style\\\n", + " .hide_columns([('Random', 'Tumour'), ('Random', 'Non-Tumour')])\\\n", + " .format('{:.0f}')\\\n", + " .set_table_styles([{\n", + " 'selector': '',\n", + " 'props': 'border-collapse: separate;'\n", + " },{\n", + " 'selector': 'caption',\n", + " 'props': 'caption-side: bottom; font-size:1.3em;'\n", + " },{\n", + " 'selector': '.index_name',\n", + " 'props': 'font-style: italic; color: darkgrey; font-weight:normal;'\n", + " },{\n", + " 'selector': 'th:not(.index_name)',\n", + " 'props': 'background-color: #000066; color: white;'\n", + " },{\n", + " 'selector': 'th.col_heading',\n", + " 'props': 'text-align: center;'\n", + " },{\n", + " 'selector': 'th.col_heading.level0',\n", + " 'props': 'font-size: 1.5em;'\n", + " },{\n", + " 'selector': 'th.col2',\n", + " 'props': 'border-left: 1px solid white;'\n", + " },{\n", + " 'selector': '.col2',\n", + " 'props': 'border-left: 1px solid #000066;'\n", + " },{\n", + " 'selector': 'td',\n", + " 'props': 'text-align: center; font-weight:bold;'\n", + " },{\n", + " 'selector': '.true',\n", + " 'props': 'background-color: #e6ffe6;'\n", + " },{\n", + " 'selector': '.false',\n", + " 'props': 'background-color: #ffe6e6;'\n", + " },{\n", + " 'selector': '.border-red',\n", + " 'props': 'border: 2px dashed red;'\n", + " },{\n", + " 'selector': '.border-green',\n", + " 'props': 'border: 2px dashed green;'\n", + " },{\n", + " 'selector': 'td:hover',\n", + " 'props': 'background-color: #ffffb3;'\n", + " }])\\\n", + " .set_td_classes(pd.DataFrame([['true border-green', 'false', 'true', 'false border-red', '', ''],\n", + " ['false', 'true', 'false', 'true', '', '']], \n", + " index=df.index, columns=df.columns))\\\n", + " .set_caption(\"Confusion matrix for multiple cancer prediction models.\")\\\n", + " .set_tooltips(pd.DataFrame([['This model has a very strong true positive rate', '', '', \"This model's total number of false negatives is too high\", '', ''],\n", + " ['', '', '', '', '', '']], \n", + " index=df.index, columns=df.columns),\n", + " css_class='pd-tt', props=\n", + " 'visibility: hidden; position: absolute; z-index: 1; border: 1px solid #000066;'\n", + " 'background-color: white; color: #000066; font-size: 0.8em;' \n", + " 'transform: translate(0px, -24px); padding: 0.6em; border-radius: 0.5em;')\n" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "The `row0_col2` is the identifier for that particular cell. We've also prepended each row/column identifier with a UUID unique to each DataFrame so that the style from one doesn't collide with the styling from another within the same notebook or page (you can set the `uuid` if you'd like to tie together the styling of two DataFrames, or remove it if you want to optimise HTML transfer for larger tables)." + "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Building styles\n", + "## Formatting the Display\n", "\n", - "There are 3 primary methods of adding custom styles to DataFrames using CSS and matching it to cells:\n", + "### Formatting Values\n", "\n", - "- Directly linking external CSS classes to your individual cells using `Styler.set_td_classes`.\n", - "- Using `table_styles` to control broader areas of the DataFrame with internal CSS.\n", - "- Using the `Styler.apply` and `Styler.applymap` functions for more specific control with internal CSS. \n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Linking External CSS\n", + "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value. To control the display value, the text is printed in each cell, and we can use the [.format()][formatfunc] method to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table or for individual columns. \n", "\n", - "*New in version 1.2.0*\n", + "Additionally, the format function has a **precision** argument to specifically help formatting floats, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML. The default formatter is configured to adopt pandas' regular `display.precision` option, controllable using `with pd.option_context('display.precision', 2):`\n", "\n", - "If you have designed a website then it is likely you will already have an external CSS file that controls the styling of table and cell objects within your website.\n", + "Here is an example of using the multiple options to control the formatting generally and with specific column formatters.\n", "\n", - "For example, suppose we have an external CSS which controls table properties and has some additional classes to style individual elements (here we manually add one to this notebook):" + "[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n", + "[format]: https://docs.python.org/3/library/string.html#format-specification-mini-language\n", + "[formatfunc]: ../reference/api/pandas.io.formats.style.Styler.format.rst" ] }, { @@ -116,22 +167,28 @@ "metadata": {}, "outputs": [], "source": [ - "from IPython.display import HTML\n", - "style = \\\n", - "\"\"\n", - "HTML(style)" + "df.style.format(precision=0, na_rep='MISSING', \n", + " formatter={('Decision Tree', 'Tumour'): \"{:.2f}\",\n", + " ('Regression', 'Non-Tumour'): lambda x: \"$ {:,.1f}\".format(x*-1e3)\n", + " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now we can manually link these to our DataFrame using the `Styler.set_table_attributes` and `Styler.set_td_classes` methods (note that table level 'table-cls' is overwritten here by Jupyters own CSS, but in HTML the default text color will be grey)." + "### Hiding Data\n", + "\n", + "The index can be hidden from rendering by calling [.hide_index()][hideidx], which might be useful if your index is integer based.\n", + "\n", + "Columns can be hidden from rendering by calling [.hide_columns()][hidecols] and passing in the name of a column, or a slice of columns.\n", + "\n", + "Hiding does not change the integer arrangement of CSS classes, e.g. hiding the first two columns of a DataFrame means the column class indexing will start at `col2`, since `col0` and `col1` are simply ignored.\n", + "\n", + "We can update our `Styler` object to hide some data and format the values.\n", + "\n", + "[hideidx]: ../reference/api/pandas.io.formats.style.Styler.hide_index.rst\n", + "[hidecols]: ../reference/api/pandas.io.formats.style.Styler.hide_columns.rst" ] }, { @@ -140,33 +197,65 @@ "metadata": {}, "outputs": [], "source": [ - "css_classes = pd.DataFrame(data=[['cls1', None], ['cls3', 'cls2 cls3']], index=[0,2], columns=['A', 'C'])\n", - "df.style.\\\n", - " set_table_attributes('class=\"table-cls\"').\\\n", - " set_td_classes(css_classes)" + "s = df.style.format('{:.0f}').hide_columns([('Random', 'Tumour'), ('Random', 'Non-Tumour')])\n", + "s" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# Hidden cell to avoid CSS clashes and latter code upcoding previous formatting \n", + "s.set_uuid('after_hide')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The **advantage** of linking to external CSS is that it can be applied very easily. One can build a DataFrame of (multiple) CSS classes to add to each cell dynamically using traditional `DataFrame.apply` and `DataFrame.applymap` methods, or otherwise, and then add those to the Styler. It will integrate with your website's existing CSS styling.\n", + "## Methods to Add Styles\n", + "\n", + "There are **3 primary methods of adding custom CSS styles** to [Styler][styler]:\n", "\n", - "The **disadvantage** of this approach is that it is not easy to transmit files standalone. For example the external CSS must be included or the styling will simply be lost. It is also, as this example shows, not well suited (at a table level) for Jupyter Notebooks. Also this method cannot be used for exporting to Excel, for example, since the external CSS cannot be referenced either by the exporters or by Excel itself." + "- Using [.set_table_styles()][table] to control broader areas of the table with specified internal CSS. Although table styles allow the flexibility to add CSS selectors and properties controlling all individual parts of the table, they are unwieldy for individual cell specifications. Also, note that table styles cannot be exported to Excel. \n", + "- Using [.set_td_classes()][td_class] to directly link either external CSS classes to your data cells or link the internal CSS classes created by [.set_table_styles()][table]. See [here](#Setting-Classes-and-Linking-to-External-CSS). These cannot be used on column header rows or indexes, and also won't export to Excel. \n", + "- Using the [.apply()][apply] and [.applymap()][applymap] functions to add direct internal CSS to specific data cells. See [here](#Styler-Functions). These cannot be used on column header rows or indexes, but only these methods add styles that will export to Excel. These methods work in a similar way to [DataFrame.apply()][dfapply] and [DataFrame.applymap()][dfapplymap].\n", + "\n", + "[table]: ../reference/api/pandas.io.formats.style.Styler.set_table_styles.rst\n", + "[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n", + "[td_class]: ../reference/api/pandas.io.formats.style.Styler.set_td_classes.rst\n", + "[apply]: ../reference/api/pandas.io.formats.style.Styler.apply.rst\n", + "[applymap]: ../reference/api/pandas.io.formats.style.Styler.applymap.rst\n", + "[dfapply]: ../reference/api/pandas.DataFrame.apply.rst\n", + "[dfapplymap]: ../reference/api/pandas.DataFrame.applymap.rst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Using Table Styles\n", + "## Table Styles\n", + "\n", + "Table styles are flexible enough to control all individual parts of the table, including column headers and indexes. \n", + "However, they can be unwieldy to type for individual data cells or for any kind of conditional formatting, so we recommend that table styles are used for broad styling, such as entire rows or columns at a time.\n", "\n", - "Table styles allow you to control broader areas of the DataFrame, i.e. the whole table or specific columns or rows, with minimal HTML transfer. Much of the functionality of `Styler` uses individual HTML id tags to manipulate the output, which may be inefficient for very large tables. Using `table_styles` and otherwise avoiding using id tags in data cells can greatly reduce the rendered HTML.\n", + "Table styles are also used to control features which can apply to the whole table at once such as greating a generic hover functionality. The `:hover` pseudo-selector, as well as other pseudo-selectors, can only be used this way.\n", "\n", - "Table styles are also used to control features which can apply to the whole table at once such as greating a generic hover functionality. This `:hover` pseudo-selectors, as well as others, can only be used this way.\n", + "To replicate the normal format of CSS selectors and properties (attribute value pairs), e.g. \n", "\n", - "`table_styles` are extremely flexible, but not as fun to type out by hand.\n", - "We hope to collect some useful ones either in pandas, or preferable in a new package that [builds on top](#Extensibility) the tools here." + "```\n", + "tr:hover {\n", + " background-color: #ffff99;\n", + "}\n", + "```\n", + "\n", + "the necessary format to pass styles to [.set_table_styles()][table] is as a list of dicts, each with a CSS-selector tag and CSS-properties. Properties can either be a list of 2-tuples, or a regular CSS-string, for example:\n", + "\n", + "[table]: ../reference/api/pandas.io.formats.style.Styler.set_table_styles.rst" ] }, { @@ -175,23 +264,38 @@ "metadata": {}, "outputs": [], "source": [ - "def hover(hover_color=\"#ffff99\"):\n", - " return {'selector': \"tr:hover\",\n", - " 'props': [(\"background-color\", \"%s\" % hover_color)]}\n", - "\n", - "styles = [\n", - " hover(),\n", - " {'selector': \"th\", 'props': [(\"font-size\", \"150%\"), (\"text-align\", \"center\")]}\n", - "]\n", - "\n", - "df.style.set_table_styles(styles)" + "cell_hover = { # for row hover use instead of
\n", + " 'selector': 'td:hover',\n", + " 'props': [('background-color', '#ffffb3')]\n", + "}\n", + "index_names = {\n", + " 'selector': '.index_name',\n", + " 'props': 'font-style: italic; color: darkgrey; font-weight:normal;'\n", + "}\n", + "headers = {\n", + " 'selector': 'th:not(.index_name)',\n", + " 'props': 'background-color: #000066; color: white;'\n", + "}\n", + "s.set_table_styles([cell_hover, index_names, headers])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# Hidden cell to avoid CSS clashes and latter code upcoding previous formatting \n", + "s.set_uuid('after_tab_styles1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "If `table_styles` is given as a dictionary each key should be a specified column or index value and this will map to specific class CSS selectors of the given column or row." + "Next we just add a couple more styling artifacts targeting specific parts of the table, and we add some internally defined CSS classes that we need for the next section. Be careful here, since we are *chaining methods* we need to explicitly instruct the method **not to** ``overwrite`` the existing styles." ] }, { @@ -200,31 +304,37 @@ "metadata": {}, "outputs": [], "source": [ - "df.style.set_table_styles({\n", - " 'A': [{'selector': '',\n", - " 'props': [('color', 'red')]}],\n", - " 'B': [{'selector': 'td',\n", - " 'props': [('color', 'blue')]}]\n", - "}, axis=0)" + "s.set_table_styles([\n", + " {'selector': 'th.col_heading', 'props': 'text-align: center;'},\n", + " {'selector': 'th.col_heading.level0', 'props': 'font-size: 1.5em;'},\n", + " {'selector': 'td', 'props': 'text-align: center; font-weight: bold;'},\n", + " # internal CSS classes\n", + " {'selector': '.true', 'props': 'background-color: #e6ffe6;'},\n", + " {'selector': '.false', 'props': 'background-color: #ffe6e6;'},\n", + " {'selector': '.border-red', 'props': 'border: 2px dashed red;'},\n", + " {'selector': '.border-green', 'props': 'border: 2px dashed green;'},\n", + "], overwrite=False)" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "nbsphinx": "hidden" + }, "outputs": [], "source": [ - "df.style.set_table_styles({\n", - " 3: [{'selector': 'td',\n", - " 'props': [('color', 'green')]}]\n", - "}, axis=1)" + "# Hidden cell to avoid CSS clashes and latter code upcoding previous formatting \n", + "s.set_uuid('after_tab_styles2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We can also chain all of the above by setting the `overwrite` argument to `False` so that it preserves previous settings. We also show the CSS string input rather than the list of tuples." + "As a convenience method (*since version 1.2.0*) we can also pass a **dict** to [.set_table_styles()][table] which contains row or column keys. Behind the scenes Styler just indexes the keys and adds relevant `.col` or `.row` classes as necessary to the given CSS selectors.\n", + "\n", + "[table]: ../reference/api/pandas.io.formats.style.Styler.set_table_styles.rst" ] }, { @@ -233,27 +343,37 @@ "metadata": {}, "outputs": [], "source": [ - "from pandas.io.formats.style import Styler\n", - "s = Styler(df, cell_ids=False, uuid_len=0).\\\n", - " set_table_styles(styles).\\\n", - " set_table_styles({\n", - " 'A': [{'selector': '',\n", - " 'props': 'color:red;'}],\n", - " 'B': [{'selector': 'td',\n", - " 'props': 'color:blue;'}]\n", - " }, axis=0, overwrite=False).\\\n", - " set_table_styles({\n", - " 3: [{'selector': 'td',\n", - " 'props': 'color:green;font-weight:bold;'}]\n", - " }, axis=1, overwrite=False)\n", - "s" + "s.set_table_styles({\n", + " ('Regression', 'Tumour'): [{'selector': 'th', 'props': 'border-left: 1px solid white'},\n", + " {'selector': 'td', 'props': 'border-left: 1px solid #000066'}]\n", + "}, overwrite=False, axis=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# Hidden cell to avoid CSS clashes and latter code upcoding previous formatting \n", + "s.set_uuid('xyz01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "By using these `table_styles` and the additional `Styler` arguments to optimize the HTML we have compressed these styles to only a few lines withing the \\ tags and none of the \\ cells require any `id` attributes. " + "## Setting Classes and Linking to External CSS\n", + "\n", + "If you have designed a website then it is likely you will already have an external CSS file that controls the styling of table and cell objects within it. You may want to use these native files rather than duplicate all the CSS in python (and duplicate any maintenance work).\n", + "\n", + "### Table Attributes\n", + "\n", + "It is very easy to add a `class` to the main `` using [.set_table_attributes()][tableatt]. This method can also attach inline styles - read more in [CSS Hierarchies](#CSS-Hierarchies).\n", + "\n", + "[tableatt]: ../reference/api/pandas.io.formats.style.Styler.set_table_attributes.rst" ] }, { @@ -262,50 +382,114 @@ "metadata": {}, "outputs": [], "source": [ - "s.render().split('\\n')[:16]" + "out = s.set_table_attributes('class=\"my-table-cls\"').render()\n", + "print(out[out.find('` elements of the `
`. Here we add our `.true` and `.false` classes that we created previously. We will save adding the borders until the [section on tooltips](#Tooltips).\n", + "\n", + "[tdclass]: ../reference/api/pandas.io.formats.style.Styler.set_td_classes.rst\n", + "[styler]: ../reference/api/pandas.io.formats.style.Styler.rst" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cell_color = pd.DataFrame([['true ', 'false ', 'true ', 'false '], \n", + " ['false ', 'true ', 'false ', 'true ']], \n", + " index=df.index, \n", + " columns=df.columns[:4])\n", + "s.set_td_classes(cell_color)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# Hidden cell to avoid CSS clashes and latter code upcoding previous formatting \n", + "s.set_uuid('after_classes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Styler Functions\n", - "\n", - "Thirdly we can use the method to pass your style functions into one of the following methods:\n", - "\n", - "- ``Styler.applymap``: elementwise\n", - "- ``Styler.apply``: column-/row-/table-wise\n", - "\n", - "Both of those methods take a function (and some other keyword arguments) and applies your function to the DataFrame in a certain way.\n", - "`Styler.applymap` works through the DataFrame elementwise.\n", - "`Styler.apply` passes each column or row into your DataFrame one-at-a-time or the entire table at once, depending on the `axis` keyword argument.\n", - "For columnwise use `axis=0`, rowwise use `axis=1`, and for the entire table at once use `axis=None`.\n", - "\n", - "For `Styler.applymap` your function should take a scalar and return a single string with the CSS attribute-value pair.\n", + "## Styler Functions\n", "\n", - "For `Styler.apply` your function should take a Series or DataFrame (depending on the axis parameter), and return a Series or DataFrame with an identical shape where each value is a string with a CSS attribute-value pair.\n", + "We use the following methods to pass your style functions. Both of those methods take a function (and some other keyword arguments) and apply it to the DataFrame in a certain way, rendering CSS styles.\n", "\n", - "The **advantage** of this method is that there is full granular control and the output is isolated and easily transferrable, especially in Jupyter Notebooks.\n", + "- [.applymap()][applymap] (elementwise): accepts a function that takes a single value and returns a string with the CSS attribute-value pair.\n", + "- [.apply()][apply] (column-/row-/table-wise): accepts a function that takes a Series or DataFrame and returns a Series, DataFrame, or numpy array with an identical shape where each element is a string with a CSS attribute-value pair. This method passes each column or row of your DataFrame one-at-a-time or the entire table at once, depending on the `axis` keyword argument. For columnwise use `axis=0`, rowwise use `axis=1`, and for the entire table at once use `axis=None`.\n", "\n", - "The **disadvantage** is that the HTML/CSS required to produce this needs to be directly generated from the Python code and it can lead to inefficient data transfer for large tables.\n", + "This method is powerful for applying multiple, complex logic to data cells. We create a new DataFrame to demonstrate this.\n", "\n", - "Let's see some examples." + "[apply]: ../reference/api/pandas.io.formats.style.Styler.apply.rst\n", + "[applymap]: ../reference/api/pandas.io.formats.style.Styler.applymap.rst" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "np.random.seed(0)\n", + "df2 = pd.DataFrame(np.random.randn(10,4), columns=['A','B','C','D'])\n", + "df2.style" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For example we can build a function that colors text if it is negative, and chain this with a function that partially fades cells of negligible value. Since this looks at each element in turn we use ``applymap``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def style_negative(v, props=''):\n", + " return props if v < 0 else None\n", + "s2 = df2.style.applymap(style_negative, props='color:red;')\\\n", + " .applymap(lambda v: 'opacity: 20%;' if (v < 0.3) and (v > -0.3) else None)\n", + "s2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# Hidden cell to avoid CSS clashes and latter code upcoding previous formatting \n", + "s2.set_uuid('after_applymap')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let's write a simple style function that will color negative numbers red and positive numbers black." + "We can also build a function that highlights the maximum value across rows, cols, and the DataFrame all at once. In this case we use ``apply``. Below we highlight the maximum in a column." ] }, { @@ -314,19 +498,28 @@ "metadata": {}, "outputs": [], "source": [ - "def color_negative_red(val):\n", - " \"\"\"Color negative scalars red.\"\"\"\n", - " css = 'color: red;'\n", - " if val < 0: return css\n", - " return None" + "def highlight_max(s, props=''):\n", + " return np.where(s == np.nanmax(s.values), props, '')\n", + "s2.apply(highlight_max, props='color:white;background-color:darkblue', axis=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# Hidden cell to avoid CSS clashes and latter code upcoding previous formatting \n", + "s2.set_uuid('after_apply')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In this case, the cell's style depends only on its own value.\n", - "That means we should use the `Styler.applymap` method which works elementwise." + "We can use the same function across the different axes, highlighting here the DataFrame maximum in purple, and row maximums in pink." ] }, { @@ -335,28 +528,34 @@ "metadata": {}, "outputs": [], "source": [ - "s = df.style.applymap(color_negative_red)\n", - "s" + "s2.apply(highlight_max, props='color:white;background-color:pink;', axis=1)\\\n", + " .apply(highlight_max, props='color:white;background-color:purple', axis=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Notice the similarity with the standard `df.applymap`, which operates on DataFrames elementwise. We want you to be able to reuse your existing knowledge of how to interact with DataFrames.\n", + "This last example shows how some styles have been overwritten by others. In general the most recent style applied is active but you can read more in the [section on CSS hierarchies](#CSS-Hierarchies). You can also apply these styles to more granular parts of the DataFrame - read more in section on [subset slicing](#Finer-Control-with-Slicing).\n", "\n", - "Notice also that our function returned a string containing the CSS attribute and value, separated by a colon just like in a `'.format(css))" + "# HTML(''.format(css))" ] } ], diff --git a/doc/source/user_guide/visualization.rst b/doc/source/user_guide/visualization.rst index 8b41cc24829c5..7b2c8478e71af 100644 --- a/doc/source/user_guide/visualization.rst +++ b/doc/source/user_guide/visualization.rst @@ -2,9 +2,12 @@ {{ header }} -************* -Visualization -************* +******************* +Chart Visualization +******************* + +This section demonstrates visualization through charting. For information on +visualization of tabular data please see the section on `Table Visualization `_. We use the standard convention for referencing the matplotlib API: diff --git a/doc/source/user_guide/window.rst b/doc/source/user_guide/window.rst index d09c1ab9a1409..be9c04ae5d4f3 100644 --- a/doc/source/user_guide/window.rst +++ b/doc/source/user_guide/window.rst @@ -101,7 +101,7 @@ be calculated with :meth:`~Rolling.apply` by specifying a separate column of wei All windowing operations support a ``min_periods`` argument that dictates the minimum amount of non-``np.nan`` values a window must have; otherwise, the resulting value is ``np.nan``. -``min_peridos`` defaults to 1 for time-based windows and ``window`` for fixed windows +``min_periods`` defaults to 1 for time-based windows and ``window`` for fixed windows .. ipython:: python diff --git a/doc/source/whatsnew/v1.3.0.rst b/doc/source/whatsnew/v1.3.0.rst index 92efb225682b7..1e723493a4cc8 100644 --- a/doc/source/whatsnew/v1.3.0.rst +++ b/doc/source/whatsnew/v1.3.0.rst @@ -302,6 +302,38 @@ cast to ``dtype=object`` (:issue:`38709`) ser2 +.. _whatsnew_130.notable_bug_fixes.rolling_groupby_column: + +GroupBy.rolling no longer returns grouped-by column in values +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The group-by column will now be dropped from the result of a +``groupby.rolling`` operation (:issue:`32262`) + +.. ipython:: python + + df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]}) + df + +*Previous behavior*: + +.. code-block:: ipython + + In [1]: df.groupby("A").rolling(2).sum() + Out[1]: + A B + A + 1 0 NaN NaN + 1 2.0 1.0 + 2 2 NaN NaN + 3 3 NaN NaN + +*New behavior*: + +.. ipython:: python + + df.groupby("A").rolling(2).sum() + .. _whatsnew_130.notable_bug_fixes.rolling_var_precision: Removed artificial truncation in rolling variance and standard deviation @@ -501,6 +533,7 @@ Numeric - Bug in :meth:`DataFrame.mode` and :meth:`Series.mode` not keeping consistent integer :class:`Index` for empty input (:issue:`33321`) - Bug in :meth:`DataFrame.rank` with ``np.inf`` and mixture of ``np.nan`` and ``np.inf`` (:issue:`32593`) - Bug in :meth:`DataFrame.rank` with ``axis=0`` and columns holding incomparable types raising ``IndexError`` (:issue:`38932`) +- Bug in ``rank`` method for :class:`Series`, :class:`DataFrame`, :class:`DataFrameGroupBy`, and :class:`SeriesGroupBy` treating the most negative ``int64`` value as missing (:issue:`32859`) - Bug in :func:`select_dtypes` different behavior between Windows and Linux with ``include="int"`` (:issue:`36569`) - Bug in :meth:`DataFrame.apply` and :meth:`DataFrame.agg` when passed argument ``func="size"`` would operate on the entire ``DataFrame`` instead of rows or columns (:issue:`39934`) - Bug in :meth:`DataFrame.transform` would raise ``SpecificationError`` when passed a dictionary and columns were missing; will now raise a ``KeyError`` instead (:issue:`40004`) diff --git a/pandas/_libs/algos.pyx b/pandas/_libs/algos.pyx index 495160e65eec3..a4bc2443e0eeb 100644 --- a/pandas/_libs/algos.pyx +++ b/pandas/_libs/algos.pyx @@ -962,6 +962,7 @@ ctypedef fused rank_t: def rank_1d( ndarray[rank_t, ndim=1] values, const intp_t[:] labels, + bint is_datetimelike=False, ties_method="average", bint ascending=True, bint pct=False, @@ -977,6 +978,8 @@ def rank_1d( Array containing unique label for each group, with its ordering matching up to the corresponding record in `values`. If not called from a groupby operation, will be an array of 0's + is_datetimelike : bool, default False + True if `values` contains datetime-like entries. ties_method : {'average', 'min', 'max', 'first', 'dense'}, default 'average' * average: average rank of group @@ -1032,7 +1035,7 @@ def rank_1d( if rank_t is object: mask = missing.isnaobj(masked_vals) - elif rank_t is int64_t: + elif rank_t is int64_t and is_datetimelike: mask = (masked_vals == NPY_NAT).astype(np.uint8) elif rank_t is float64_t: mask = np.isnan(masked_vals).astype(np.uint8) @@ -1059,7 +1062,7 @@ def rank_1d( if rank_t is object: nan_fill_val = NegInfinity() elif rank_t is int64_t: - nan_fill_val = np.iinfo(np.int64).min + nan_fill_val = NPY_NAT elif rank_t is uint64_t: nan_fill_val = 0 else: @@ -1275,6 +1278,7 @@ def rank_1d( def rank_2d( ndarray[rank_t, ndim=2] in_arr, int axis=0, + bint is_datetimelike=False, ties_method="average", bint ascending=True, na_option="keep", @@ -1299,7 +1303,9 @@ def rank_2d( tiebreak = tiebreakers[ties_method] keep_na = na_option == 'keep' - check_mask = rank_t is not uint64_t + + # For cases where a mask is not possible, we can avoid mask checks + check_mask = not (rank_t is uint64_t or (rank_t is int64_t and not is_datetimelike)) if axis == 0: values = np.asarray(in_arr).T.copy() @@ -1310,13 +1316,15 @@ def rank_2d( if values.dtype != np.object_: values = values.astype('O') - if rank_t is not uint64_t: + if check_mask: if ascending ^ (na_option == 'top'): if rank_t is object: nan_value = Infinity() elif rank_t is float64_t: nan_value = np.inf - elif rank_t is int64_t: + + # int64 and datetimelike + else: nan_value = np.iinfo(np.int64).max else: @@ -1324,14 +1332,18 @@ def rank_2d( nan_value = NegInfinity() elif rank_t is float64_t: nan_value = -np.inf - elif rank_t is int64_t: + + # int64 and datetimelike + else: nan_value = NPY_NAT if rank_t is object: mask = missing.isnaobj2d(values) elif rank_t is float64_t: mask = np.isnan(values) - elif rank_t is int64_t: + + # int64 and datetimelike + else: mask = values == NPY_NAT np.putmask(values, mask, nan_value) diff --git a/pandas/_libs/groupby.pyx b/pandas/_libs/groupby.pyx index 7ddc087df9b11..e23fa9b82f12e 100644 --- a/pandas/_libs/groupby.pyx +++ b/pandas/_libs/groupby.pyx @@ -681,18 +681,17 @@ group_mean_float64 = _group_mean['double'] @cython.wraparound(False) @cython.boundscheck(False) -def _group_ohlc(floating[:, ::1] out, - int64_t[::1] counts, - ndarray[floating, ndim=2] values, - const intp_t[:] labels, - Py_ssize_t min_count=-1): +def group_ohlc(floating[:, ::1] out, + int64_t[::1] counts, + ndarray[floating, ndim=2] values, + const intp_t[:] labels, + Py_ssize_t min_count=-1): """ Only aggregates on axis=0 """ cdef: Py_ssize_t i, j, N, K, lab - floating val, count - Py_ssize_t ngroups = len(counts) + floating val assert min_count == -1, "'min_count' only used in add and prod" @@ -727,10 +726,6 @@ def _group_ohlc(floating[:, ::1] out, out[lab, 3] = val -group_ohlc_float32 = _group_ohlc['float'] -group_ohlc_float64 = _group_ohlc['double'] - - @cython.boundscheck(False) @cython.wraparound(False) def group_quantile(ndarray[float64_t] out, @@ -1079,9 +1074,8 @@ def group_rank(float64_t[:, ::1] out, ngroups : int This parameter is not used, is needed to match signatures of other groupby functions. - is_datetimelike : bool, default False - unused in this method but provided for call compatibility with other - Cython transformations + is_datetimelike : bool + True if `values` contains datetime-like entries. ties_method : {'average', 'min', 'max', 'first', 'dense'}, default 'average' * average: average rank of group @@ -1109,6 +1103,7 @@ def group_rank(float64_t[:, ::1] out, result = rank_1d( values=values[:, 0], labels=labels, + is_datetimelike=is_datetimelike, ties_method=ties_method, ascending=ascending, pct=pct, diff --git a/pandas/_libs/internals.pyi b/pandas/_libs/internals.pyi new file mode 100644 index 0000000000000..446ee299698c5 --- /dev/null +++ b/pandas/_libs/internals.pyi @@ -0,0 +1,58 @@ +from typing import ( + Iterator, + Sequence, + overload, +) + +import numpy as np + +from pandas._typing import ArrayLike + +def slice_len(slc: slice, objlen: int = ...) -> int: ... + + +def get_blkno_indexers( + blknos: np.ndarray, # int64_t[:] + group: bool = ..., +) -> list[tuple[int, slice | np.ndarray]]: ... + + +def get_blkno_placements( + blknos: np.ndarray, + group: bool = ..., +) -> Iterator[tuple[int, BlockPlacement]]: ... + + +class BlockPlacement: + def __init__(self, val: int | slice | np.ndarray): ... + + @property + def indexer(self) -> np.ndarray | slice: ... + + @property + def as_array(self) -> np.ndarray: ... + + @property + def is_slice_like(self) -> bool: ... + + @overload + def __getitem__(self, loc: slice | Sequence[int]) -> BlockPlacement: ... + + @overload + def __getitem__(self, loc: int) -> int: ... + + def __iter__(self) -> Iterator[int]: ... + + def __len__(self) -> int: ... + + def delete(self, loc) -> BlockPlacement: ... + + def append(self, others: list[BlockPlacement]) -> BlockPlacement: ... + + +class Block: + _mgr_locs: BlockPlacement + ndim: int + values: ArrayLike + + def __init__(self, values: ArrayLike, placement: BlockPlacement, ndim: int): ... diff --git a/pandas/_libs/testing.pyi b/pandas/_libs/testing.pyi new file mode 100644 index 0000000000000..ac0c772780c5c --- /dev/null +++ b/pandas/_libs/testing.pyi @@ -0,0 +1,8 @@ + + +def assert_dict_equal(a, b, compare_keys: bool = ...): ... + +def assert_almost_equal(a, b, + rtol: float = ..., atol: float = ..., + check_dtype: bool = ..., + obj=..., lobj=..., robj=..., index_values=...): ... diff --git a/pandas/_testing/asserters.py b/pandas/_testing/asserters.py index 2adc70438cce7..62205b9203bf0 100644 --- a/pandas/_testing/asserters.py +++ b/pandas/_testing/asserters.py @@ -154,6 +154,9 @@ def assert_almost_equal( else: obj = "Input" assert_class_equal(left, right, obj=obj) + + # if we have "equiv", this becomes True + check_dtype = bool(check_dtype) _testing.assert_almost_equal( left, right, check_dtype=check_dtype, rtol=rtol, atol=atol, **kwargs ) @@ -388,12 +391,15 @@ def _get_ilevel_values(index, level): msg = f"{obj} values are different ({np.round(diff, 5)} %)" raise_assert_detail(obj, msg, left, right) else: + + # if we have "equiv", this becomes True + exact_bool = bool(exact) _testing.assert_almost_equal( left.values, right.values, rtol=rtol, atol=atol, - check_dtype=exact, + check_dtype=exact_bool, obj=obj, lobj=left, robj=right, diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py index 77b5a0148905e..f52aff424eb0b 100644 --- a/pandas/core/algorithms.py +++ b/pandas/core/algorithms.py @@ -1031,21 +1031,23 @@ def rank( Whether or not to the display the returned rankings in integer form (e.g. 1, 2, 3) or in percentile form (e.g. 0.333..., 0.666..., 1). """ + is_datetimelike = needs_i8_conversion(values.dtype) + values = _get_values_for_rank(values) if values.ndim == 1: - values = _get_values_for_rank(values) ranks = algos.rank_1d( values, labels=np.zeros(len(values), dtype=np.intp), + is_datetimelike=is_datetimelike, ties_method=method, ascending=ascending, na_option=na_option, pct=pct, ) elif values.ndim == 2: - values = _get_values_for_rank(values) ranks = algos.rank_2d( values, axis=axis, + is_datetimelike=is_datetimelike, ties_method=method, ascending=ascending, na_option=na_option, diff --git a/pandas/core/frame.py b/pandas/core/frame.py index 62341045413a7..46501c97cf38a 100644 --- a/pandas/core/frame.py +++ b/pandas/core/frame.py @@ -528,7 +528,7 @@ class DataFrame(NDFrame, OpsMixin): >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) - x y + x y 0 0 0 1 0 3 2 2 3 diff --git a/pandas/core/groupby/ops.py b/pandas/core/groupby/ops.py index 1350848741ad1..99b9aea4f82df 100644 --- a/pandas/core/groupby/ops.py +++ b/pandas/core/groupby/ops.py @@ -486,6 +486,12 @@ def _get_cython_func_and_vals( func = _get_cython_function(kind, how, values.dtype, is_numeric) else: raise + else: + if values.dtype.kind in ["i", "u"]: + if how in ["ohlc"]: + # The output may still include nans, so we have to cast + values = ensure_float64(values) + return func, values @final diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py index 29175d0b20f92..09e214237b736 100644 --- a/pandas/core/internals/blocks.py +++ b/pandas/core/internals/blocks.py @@ -36,6 +36,7 @@ Shape, final, ) +from pandas.util._decorators import cache_readonly from pandas.util._validators import validate_bool_kwarg from pandas.core.dtypes.cast import ( @@ -165,7 +166,7 @@ class Block(libinternals.Block, PandasObject): _validate_ndim = True @final - @property + @cache_readonly def _consolidate_key(self): return self._can_consolidate, self.dtype.name @@ -188,7 +189,7 @@ def _can_hold_na(self) -> bool: return values._can_hold_na @final - @property + @cache_readonly def is_categorical(self) -> bool: warnings.warn( "Block.is_categorical is deprecated and will be removed in a " @@ -217,6 +218,7 @@ def internal_values(self): """ return self.values + @property def array_values(self) -> ExtensionArray: """ The array that Series.array returns. Always an ExtensionArray. @@ -245,7 +247,7 @@ def get_block_values_for_json(self) -> np.ndarray: return np.asarray(self.values).reshape(self.shape) @final - @property + @cache_readonly def fill_value(self): # Used in reindex_indexer return na_value_for_dtype(self.dtype, compat=False) @@ -353,7 +355,7 @@ def shape(self) -> Shape: return self.values.shape @final - @property + @cache_readonly def dtype(self) -> DtypeObj: return self.values.dtype @@ -378,6 +380,11 @@ def delete(self, loc) -> None: """ self.values = np.delete(self.values, loc, 0) self.mgr_locs = self._mgr_locs.delete(loc) + try: + self._cache.clear() + except AttributeError: + # _cache not yet initialized + pass @final def apply(self, func, **kwargs) -> List[Block]: @@ -592,7 +599,7 @@ def astype(self, dtype, copy: bool = False, errors: str = "raise"): """ values = self.values if values.dtype.kind in ["m", "M"]: - values = self.array_values() + values = self.array_values new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) @@ -931,7 +938,7 @@ def setitem(self, indexer, value): return self.coerce_to_target_dtype(value).setitem(indexer, value) if self.dtype.kind in ["m", "M"]: - arr = self.array_values().T + arr = self.array_values.T arr[indexer] = value return self @@ -1445,7 +1452,7 @@ class ExtensionBlock(Block): values: ExtensionArray - @property + @cache_readonly def shape(self) -> Shape: # TODO(EA2D): override unnecessary with 2D EAs if self.ndim == 1: @@ -1476,6 +1483,12 @@ def set_inplace(self, locs, values): # see GH#33457 assert locs.tolist() == [0] self.values = values + try: + # TODO(GH33457) this can be removed + self._cache.clear() + except AttributeError: + # _cache not yet initialized + pass def putmask(self, mask, new) -> List[Block]: """ @@ -1500,7 +1513,7 @@ def is_view(self) -> bool: """Extension arrays are never treated as views.""" return False - @property + @cache_readonly def is_numeric(self): return self.values.dtype._is_numeric @@ -1549,6 +1562,7 @@ def get_values(self, dtype: Optional[DtypeObj] = None) -> np.ndarray: # TODO(EA2D): reshape not needed with 2D EAs return np.asarray(self.values).reshape(self.shape) + @cache_readonly def array_values(self) -> ExtensionArray: return self.values @@ -1675,10 +1689,7 @@ def where(self, other, cond, errors="raise") -> List[Block]: # The default `other` for Series / Frame is np.nan # we want to replace that with the correct NA value # for the type - - # error: Item "dtype[Any]" of "Union[dtype[Any], ExtensionDtype]" has no - # attribute "na_value" - other = self.dtype.na_value # type: ignore[union-attr] + other = self.dtype.na_value if is_sparse(self.values): # TODO(SparseArray.__setitem__): remove this if condition @@ -1739,10 +1750,11 @@ class HybridMixin: array_values: Callable def _can_hold_element(self, element: Any) -> bool: - values = self.array_values() + values = self.array_values try: - values._validate_setitem_value(element) + # error: "Callable[..., Any]" has no attribute "_validate_setitem_value" + values._validate_setitem_value(element) # type: ignore[attr-defined] return True except (ValueError, TypeError): return False @@ -1768,9 +1780,7 @@ def _can_hold_element(self, element: Any) -> bool: if isinstance(element, (IntegerArray, FloatingArray)): if element._mask.any(): return False - # error: Argument 1 to "can_hold_element" has incompatible type - # "Union[dtype[Any], ExtensionDtype]"; expected "dtype[Any]" - return can_hold_element(self.dtype, element) # type: ignore[arg-type] + return can_hold_element(self.dtype, element) class NDArrayBackedExtensionBlock(HybridMixin, Block): @@ -1780,23 +1790,25 @@ class NDArrayBackedExtensionBlock(HybridMixin, Block): def internal_values(self): # Override to return DatetimeArray and TimedeltaArray - return self.array_values() + return self.array_values def get_values(self, dtype: Optional[DtypeObj] = None) -> np.ndarray: """ return object dtype as boxed values, such as Timestamps/Timedelta """ - values = self.array_values() + values = self.array_values if is_object_dtype(dtype): # DTA/TDA constructor and astype can handle 2D - values = values.astype(object) + # error: "Callable[..., Any]" has no attribute "astype" + values = values.astype(object) # type: ignore[attr-defined] # TODO(EA2D): reshape not needed with 2D EAs return np.asarray(values).reshape(self.shape) def iget(self, key): # GH#31649 we need to wrap scalars in Timestamp/Timedelta # TODO(EA2D): this can be removed if we ever have 2D EA - return self.array_values().reshape(self.shape)[key] + # error: "Callable[..., Any]" has no attribute "reshape" + return self.array_values.reshape(self.shape)[key] # type: ignore[attr-defined] def putmask(self, mask, new) -> List[Block]: mask = extract_bool_array(mask) @@ -1805,14 +1817,16 @@ def putmask(self, mask, new) -> List[Block]: return self.astype(object).putmask(mask, new) # TODO(EA2D): reshape unnecessary with 2D EAs - arr = self.array_values().reshape(self.shape) + # error: "Callable[..., Any]" has no attribute "reshape" + arr = self.array_values.reshape(self.shape) # type: ignore[attr-defined] arr = cast("NDArrayBackedExtensionArray", arr) arr.T.putmask(mask, new) return [self] def where(self, other, cond, errors="raise") -> List[Block]: # TODO(EA2D): reshape unnecessary with 2D EAs - arr = self.array_values().reshape(self.shape) + # error: "Callable[..., Any]" has no attribute "reshape" + arr = self.array_values.reshape(self.shape) # type: ignore[attr-defined] cond = extract_bool_array(cond) @@ -1848,15 +1862,17 @@ def diff(self, n: int, axis: int = 0) -> List[Block]: by apply. """ # TODO(EA2D): reshape not necessary with 2D EAs - values = self.array_values().reshape(self.shape) + # error: "Callable[..., Any]" has no attribute "reshape" + values = self.array_values.reshape(self.shape) # type: ignore[attr-defined] new_values = values - values.shift(n, axis=axis) new_values = maybe_coerce_values(new_values) return [self.make_block(new_values)] def shift(self, periods: int, axis: int = 0, fill_value: Any = None) -> List[Block]: - # TODO(EA2D) this is unnecessary if these blocks are backed by 2D EAs - values = self.array_values().reshape(self.shape) + # TODO(EA2D) this is unnecessary if these blocks are backed by 2D EA + # error: "Callable[..., Any]" has no attribute "reshape" + values = self.array_values.reshape(self.shape) # type: ignore[attr-defined] new_values = values.shift(periods, fill_value=fill_value, axis=axis) new_values = maybe_coerce_values(new_values) return [self.make_block_same_class(new_values)] @@ -1871,9 +1887,13 @@ def fillna( # TODO: don't special-case td64 return self.astype(object).fillna(value, limit, inplace, downcast) - values = self.array_values() - values = values if inplace else values.copy() - new_values = values.fillna(value=value, limit=limit) + values = self.array_values + # error: "Callable[..., Any]" has no attribute "copy" + values = values if inplace else values.copy() # type: ignore[attr-defined] + # error: "Callable[..., Any]" has no attribute "fillna" + new_values = values.fillna( # type: ignore[attr-defined] + value=value, limit=limit + ) new_values = maybe_coerce_values(new_values) return [self.make_block_same_class(values=new_values)] @@ -1883,6 +1903,7 @@ class DatetimeLikeBlockMixin(NDArrayBackedExtensionBlock): is_numeric = False + @cache_readonly def array_values(self): return ensure_wrapped_if_datetimelike(self.values) diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py index 14fa994631623..28151a43d1dac 100644 --- a/pandas/core/internals/managers.py +++ b/pandas/core/internals/managers.py @@ -1668,7 +1668,7 @@ def internal_values(self): def array_values(self): """The array that Series.array returns""" - return self._block.array_values() + return self._block.array_values @property def _can_hold_na(self) -> bool: diff --git a/pandas/core/missing.py b/pandas/core/missing.py index 41d7fed66469d..feaecec382704 100644 --- a/pandas/core/missing.py +++ b/pandas/core/missing.py @@ -861,7 +861,4 @@ def _rolling_window(a: np.ndarray, window: int): # https://stackoverflow.com/a/6811241 shape = a.shape[:-1] + (a.shape[-1] - window + 1, window) strides = a.strides + (a.strides[-1],) - # error: Module has no attribute "stride_tricks" - return np.lib.stride_tricks.as_strided( # type: ignore[attr-defined] - a, shape=shape, strides=strides - ) + return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides) diff --git a/pandas/core/strings/accessor.py b/pandas/core/strings/accessor.py index 73a5ef9345fec..1eda06dbbb1c4 100644 --- a/pandas/core/strings/accessor.py +++ b/pandas/core/strings/accessor.py @@ -1925,13 +1925,13 @@ def get_dummies(self, sep="|"): Examples -------- >>> pd.Series(['a|b', 'a', 'a|c']).str.get_dummies() - a b c + a b c 0 1 1 0 1 1 0 0 2 1 0 1 >>> pd.Series(['a|b', np.nan, 'a|c']).str.get_dummies() - a b c + a b c 0 1 1 0 1 0 0 0 2 1 0 1 diff --git a/pandas/core/window/rolling.py b/pandas/core/window/rolling.py index 0fa49dccda573..b482934dd25d2 100644 --- a/pandas/core/window/rolling.py +++ b/pandas/core/window/rolling.py @@ -558,6 +558,10 @@ def __init__( if _grouper is None: raise ValueError("Must pass a Grouper object.") self._grouper = _grouper + # GH 32262: It's convention to keep the grouping column in + # groupby., but unexpected to users in + # groupby.rolling. + obj = obj.drop(columns=self._grouper.names, errors="ignore") super().__init__(obj, *args, **kwargs) def _apply( diff --git a/pandas/tests/arrays/string_/test_string.py b/pandas/tests/arrays/string_/test_string.py index 0574061a6a544..8b84a510c01e6 100644 --- a/pandas/tests/arrays/string_/test_string.py +++ b/pandas/tests/arrays/string_/test_string.py @@ -42,23 +42,16 @@ def cls(request): return request.param -def test_repr(dtype, request): - if dtype == "arrow_string": - reason = ( - "AssertionError: assert ' A\n0 a\n1 None\n2 b' " - "== ' A\n0 a\n1 \n2 b'" - ) - mark = pytest.mark.xfail(reason=reason) - request.node.add_marker(mark) - +def test_repr(dtype): df = pd.DataFrame({"A": pd.array(["a", pd.NA, "b"], dtype=dtype)}) expected = " A\n0 a\n1 \n2 b" assert repr(df) == expected - expected = "0 a\n1 \n2 b\nName: A, dtype: string" + expected = f"0 a\n1 \n2 b\nName: A, dtype: {dtype}" assert repr(df.A) == expected - expected = "\n['a', , 'b']\nLength: 3, dtype: string" + arr_name = "ArrowStringArray" if dtype == "arrow_string" else "StringArray" + expected = f"<{arr_name}>\n['a', , 'b']\nLength: 3, dtype: {dtype}" assert repr(df.A.array) == expected @@ -371,9 +364,20 @@ def test_astype_int(dtype, request): tm.assert_extension_array_equal(result, expected) -def test_astype_float(any_float_allowed_nullable_dtype): +def test_astype_float(dtype, any_float_allowed_nullable_dtype, request): # Don't compare arrays (37974) - ser = pd.Series(["1.1", pd.NA, "3.3"], dtype="string") + + if dtype == "arrow_string": + if any_float_allowed_nullable_dtype in {"Float32", "Float64"}: + reason = "TypeError: Cannot interpret 'Float32Dtype()' as a data type" + else: + reason = ( + "TypeError: float() argument must be a string or a number, not 'NAType'" + ) + mark = pytest.mark.xfail(reason=reason) + request.node.add_marker(mark) + + ser = pd.Series(["1.1", pd.NA, "3.3"], dtype=dtype) result = ser.astype(any_float_allowed_nullable_dtype) expected = pd.Series([1.1, np.nan, 3.3], dtype=any_float_allowed_nullable_dtype) @@ -436,17 +440,25 @@ def test_reduce_missing(skipna, dtype): assert pd.isna(result) -def test_fillna_args(): +def test_fillna_args(dtype, request): # GH 37987 - arr = pd.array(["a", pd.NA], dtype="string") + if dtype == "arrow_string": + reason = ( + "AssertionError: Regex pattern \"Cannot set non-string value '1' into " + "a StringArray.\" does not match 'Scalar must be NA or str'" + ) + mark = pytest.mark.xfail(reason=reason) + request.node.add_marker(mark) + + arr = pd.array(["a", pd.NA], dtype=dtype) res = arr.fillna(value="b") - expected = pd.array(["a", "b"], dtype="string") + expected = pd.array(["a", "b"], dtype=dtype) tm.assert_extension_array_equal(res, expected) res = arr.fillna(value=np.str_("b")) - expected = pd.array(["a", "b"], dtype="string") + expected = pd.array(["a", "b"], dtype=dtype) tm.assert_extension_array_equal(res, expected) msg = "Cannot set non-string value '1' into a StringArray." diff --git a/pandas/tests/frame/methods/test_rank.py b/pandas/tests/frame/methods/test_rank.py index ce46d1d8b1869..6538eda8cdeff 100644 --- a/pandas/tests/frame/methods/test_rank.py +++ b/pandas/tests/frame/methods/test_rank.py @@ -6,7 +6,6 @@ import numpy as np import pytest -from pandas._libs import iNaT from pandas._libs.algos import ( Infinity, NegInfinity, @@ -382,7 +381,7 @@ def test_pct_max_many_rows(self): "float32", ), ([np.iinfo(np.uint8).min, 1, 2, 100, np.iinfo(np.uint8).max], "uint8"), - pytest.param( + ( [ np.iinfo(np.int64).min, -100, @@ -394,20 +393,20 @@ def test_pct_max_many_rows(self): np.iinfo(np.int64).max, ], "int64", - marks=pytest.mark.xfail( - reason="iNaT is equivalent to minimum value of dtype" - "int64 pending issue GH#16674" - ), ), ([NegInfinity(), "1", "A", "BA", "Ba", "C", Infinity()], "object"), + ( + [datetime(2001, 1, 1), datetime(2001, 1, 2), datetime(2001, 1, 5)], + "datetime64", + ), ], ) def test_rank_inf_and_nan(self, contents, dtype, frame_or_series): dtype_na_map = { "float64": np.nan, "float32": np.nan, - "int64": iNaT, "object": None, + "datetime64": np.datetime64("nat"), } # Insert nans at random positions if underlying dtype has missing # value. Then adjust the expected order by adding nans accordingly diff --git a/pandas/tests/groupby/test_libgroupby.py b/pandas/tests/groupby/test_libgroupby.py index febc12edf0b32..d776c34f5b5ec 100644 --- a/pandas/tests/groupby/test_libgroupby.py +++ b/pandas/tests/groupby/test_libgroupby.py @@ -138,7 +138,7 @@ def _check(dtype): counts = np.zeros(len(out), dtype=np.int64) labels = ensure_platform_int(np.repeat(np.arange(3), np.diff(np.r_[0, bins]))) - func = getattr(libgroupby, f"group_ohlc_{dtype}") + func = libgroupby.group_ohlc func(out, counts, obj[:, None], labels) def _ohlc(group): diff --git a/pandas/tests/groupby/test_rank.py b/pandas/tests/groupby/test_rank.py index 6116703ebd174..00641effac08d 100644 --- a/pandas/tests/groupby/test_rank.py +++ b/pandas/tests/groupby/test_rank.py @@ -1,9 +1,12 @@ +from datetime import datetime + import numpy as np import pytest import pandas as pd from pandas import ( DataFrame, + NaT, Series, concat, ) @@ -517,3 +520,25 @@ def test_rank_zero_div(input_key, input_value, output_value): result = df.groupby("A").rank(method="dense", pct=True) expected = DataFrame({"B": output_value}) tm.assert_frame_equal(result, expected) + + +def test_rank_min_int(): + # GH-32859 + df = DataFrame( + { + "grp": [1, 1, 2], + "int_col": [ + np.iinfo(np.int64).min, + np.iinfo(np.int64).max, + np.iinfo(np.int64).min, + ], + "datetimelike": [NaT, datetime(2001, 1, 1), NaT], + } + ) + + result = df.groupby("grp").rank() + expected = DataFrame( + {"int_col": [1.0, 2.0, 1.0], "datetimelike": [np.NaN, 1.0, np.NaN]} + ) + + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/window/test_groupby.py b/pandas/tests/window/test_groupby.py index c3c5bbe460134..5c2f69a9247e9 100644 --- a/pandas/tests/window/test_groupby.py +++ b/pandas/tests/window/test_groupby.py @@ -83,6 +83,8 @@ def test_rolling(self, f): result = getattr(r, f)() expected = g.apply(lambda x: getattr(x.rolling(4), f)()) + # groupby.apply doesn't drop the grouped-by column + expected = expected.drop("A", axis=1) # GH 39732 expected_index = MultiIndex.from_arrays([self.frame["A"], range(40)]) expected.index = expected_index @@ -95,6 +97,8 @@ def test_rolling_ddof(self, f): result = getattr(r, f)(ddof=1) expected = g.apply(lambda x: getattr(x.rolling(4), f)(ddof=1)) + # groupby.apply doesn't drop the grouped-by column + expected = expected.drop("A", axis=1) # GH 39732 expected_index = MultiIndex.from_arrays([self.frame["A"], range(40)]) expected.index = expected_index @@ -111,6 +115,8 @@ def test_rolling_quantile(self, interpolation): expected = g.apply( lambda x: x.rolling(4).quantile(0.4, interpolation=interpolation) ) + # groupby.apply doesn't drop the grouped-by column + expected = expected.drop("A", axis=1) # GH 39732 expected_index = MultiIndex.from_arrays([self.frame["A"], range(40)]) expected.index = expected_index @@ -147,6 +153,8 @@ def test_rolling_apply(self, raw): # reduction result = r.apply(lambda x: x.sum(), raw=raw) expected = g.apply(lambda x: x.rolling(4).apply(lambda y: y.sum(), raw=raw)) + # groupby.apply doesn't drop the grouped-by column + expected = expected.drop("A", axis=1) # GH 39732 expected_index = MultiIndex.from_arrays([self.frame["A"], range(40)]) expected.index = expected_index @@ -442,6 +450,8 @@ def test_groupby_rolling_empty_frame(self): # GH 36197 expected = DataFrame({"s1": []}) result = expected.groupby("s1").rolling(window=1).sum() + # GH 32262 + expected = expected.drop(columns="s1") # GH-38057 from_tuples gives empty object dtype, we now get float/int levels # expected.index = MultiIndex.from_tuples([], names=["s1", None]) expected.index = MultiIndex.from_product( @@ -451,6 +461,8 @@ def test_groupby_rolling_empty_frame(self): expected = DataFrame({"s1": [], "s2": []}) result = expected.groupby(["s1", "s2"]).rolling(window=1).sum() + # GH 32262 + expected = expected.drop(columns=["s1", "s2"]) expected.index = MultiIndex.from_product( [ Index([], dtype="float64"), @@ -503,6 +515,8 @@ def test_groupby_rolling_no_sort(self): columns=["foo", "bar"], index=MultiIndex.from_tuples([(2, 0), (1, 1)], names=["foo", None]), ) + # GH 32262 + expected = expected.drop(columns="foo") tm.assert_frame_equal(result, expected) def test_groupby_rolling_count_closed_on(self): @@ -553,6 +567,8 @@ def test_groupby_rolling_sem(self, func, kwargs): [("a", 0), ("a", 1), ("b", 2), ("b", 3), ("b", 4)], names=["a", None] ), ) + # GH 32262 + expected = expected.drop(columns="a") tm.assert_frame_equal(result, expected) @pytest.mark.parametrize( @@ -666,6 +682,19 @@ def test_groupby_rolling_object_doesnt_affect_groupby_apply(self): assert not g.mutated assert not g.grouper.mutated + @pytest.mark.parametrize( + "columns", [MultiIndex.from_tuples([("A", ""), ("B", "C")]), ["A", "B"]] + ) + def test_by_column_not_in_values(self, columns): + # GH 32262 + df = DataFrame([[1, 0]] * 20 + [[2, 0]] * 12 + [[3, 0]] * 8, columns=columns) + g = df.groupby("A") + original_obj = g.obj.copy(deep=True) + r = g.rolling(4) + result = r.sum() + assert "A" not in result.columns + tm.assert_frame_equal(g.obj, original_obj) + class TestExpanding: def setup_method(self): @@ -680,6 +709,8 @@ def test_expanding(self, f): result = getattr(r, f)() expected = g.apply(lambda x: getattr(x.expanding(), f)()) + # groupby.apply doesn't drop the grouped-by column + expected = expected.drop("A", axis=1) # GH 39732 expected_index = MultiIndex.from_arrays([self.frame["A"], range(40)]) expected.index = expected_index @@ -692,6 +723,8 @@ def test_expanding_ddof(self, f): result = getattr(r, f)(ddof=0) expected = g.apply(lambda x: getattr(x.expanding(), f)(ddof=0)) + # groupby.apply doesn't drop the grouped-by column + expected = expected.drop("A", axis=1) # GH 39732 expected_index = MultiIndex.from_arrays([self.frame["A"], range(40)]) expected.index = expected_index @@ -708,6 +741,8 @@ def test_expanding_quantile(self, interpolation): expected = g.apply( lambda x: x.expanding().quantile(0.4, interpolation=interpolation) ) + # groupby.apply doesn't drop the grouped-by column + expected = expected.drop("A", axis=1) # GH 39732 expected_index = MultiIndex.from_arrays([self.frame["A"], range(40)]) expected.index = expected_index @@ -748,6 +783,8 @@ def test_expanding_apply(self, raw): # reduction result = r.apply(lambda x: x.sum(), raw=raw) expected = g.apply(lambda x: x.expanding().apply(lambda y: y.sum(), raw=raw)) + # groupby.apply doesn't drop the grouped-by column + expected = expected.drop("A", axis=1) # GH 39732 expected_index = MultiIndex.from_arrays([self.frame["A"], range(40)]) expected.index = expected_index diff --git a/pandas/tests/window/test_rolling.py b/pandas/tests/window/test_rolling.py index 0af0bba5f5f8c..cfd09d0842418 100644 --- a/pandas/tests/window/test_rolling.py +++ b/pandas/tests/window/test_rolling.py @@ -719,6 +719,9 @@ def scaled_sum(*args): df = DataFrame(data={"X": range(5)}, index=[0, 0, 1, 1, 1]) expected = DataFrame(data={"X": [0.0, 0.5, 1.0, 1.5, 2.0]}, index=_index) + # GH 40341 + if "by" in grouping: + expected = expected.drop(columns="X", errors="ignore") result = df.groupby(**grouping).rolling(1).apply(scaled_sum, raw=raw, args=(2,)) tm.assert_frame_equal(result, expected)