diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index d01956bb79e11..de030482ffc59 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -53,15 +53,6 @@ repos: plotting\.rst| 10min\.rst| basics\.rst| - categorical\.rst| - contributing\.rst| - contributing_docstring\.rst| - extending\.rst| - ecosystem\.rst| - comparison_with_sql\.rst| - install\.rst| - calculate_statistics\.rst| - combine_dataframes\.rst| v0\.| v1\.0\.| v1\.1\.[012]) diff --git a/doc/source/development/contributing.rst b/doc/source/development/contributing.rst index bb13fbed09677..97df86bdf92a6 100644 --- a/doc/source/development/contributing.rst +++ b/doc/source/development/contributing.rst @@ -16,11 +16,11 @@ All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome. If you are brand new to pandas or open-source development, we recommend going -through the `GitHub "issues" tab `_ -to find issues that interest you. There are a number of issues listed under `Docs -`_ -and `good first issue -`_ +through the ``GitHub "issues" tab ``_ +to find issues that interest you. There are a number of issues listed under ``Docs +``_ +and ``good first issue +``_ where you could start out. Once you've found an interesting issue, you can return here to get your development environment setup. @@ -31,13 +31,13 @@ comment letting others know they are working on an issue. While this is ok, you check each issue individually, and it's not possible to find the unassigned ones. For this reason, we implemented a workaround consisting of adding a comment with the exact -text `take`. When you do it, a GitHub action will automatically assign you the issue +text ``take``. When you do it, a GitHub action will automatically assign you the issue (this will take seconds, and may require refreshing the page to see it). By doing this, it's possible to filter the list of issues and find only the unassigned ones. So, a good way to find an issue to start contributing to pandas is to check the list of -`unassigned good first issues `_ -and assign yourself one you like by writing a comment with the exact text `take`. +``unassigned good first issues ``_ +and assign yourself one you like by writing a comment with the exact text ``take``. If for whatever reason you are not able to continue working with the issue, please try to unassign it, so other people know it's available again. You can check the list of @@ -45,8 +45,8 @@ assigned issues, since people may not be working in them anymore. If you want to that is assigned, feel free to kindly ask the current assignee if you can take it (please allow at least a week of inactivity before considering work in the issue discontinued). -Feel free to ask questions on the `mailing list -`_ or on `Gitter`_. +Feel free to ask questions on the ``mailing list +``_ or on ``Gitter``_. .. _contributing.bug_reports: @@ -55,8 +55,8 @@ Bug reports and enhancement requests Bug reports are an important part of making pandas more stable. Having a complete bug report will allow others to reproduce the bug and provide insight into fixing. See -`this stackoverflow article `_ and -`this blogpost `_ +``this stackoverflow article ``_ and +``this blogpost ``_ for tips on writing a good bug report. Trying the bug-producing code out on the *master* branch is often a worthwhile exercise @@ -66,8 +66,8 @@ to see if the issue has already been reported and/or fixed. Bug reports must: #. Include a short, self-contained Python snippet reproducing the problem. - You can format the code nicely by using `GitHub Flavored Markdown - `_:: + You can format the code nicely by using ``GitHub Flavored Markdown + ``_:: ```python >>> from pandas import DataFrame @@ -102,21 +102,21 @@ It can very quickly become overwhelming, but sticking to the guidelines below wi straightforward and mostly trouble free. As always, if you are having difficulties please feel free to ask for help. -The code is hosted on `GitHub `_. To -contribute you will need to sign up for a `free GitHub account -`_. We use `Git `_ for +The code is hosted on ``GitHub ``_. To +contribute you will need to sign up for a ``free GitHub account +``_. We use ``Git ``_ for version control to allow many people to work together on the project. Some great resources for learning Git: -* the `GitHub help pages `_. -* the `NumPy's documentation `_. -* Matthew Brett's `Pydagogue `_. +* the ``GitHub help pages ``_. +* the ``NumPy's documentation ``_. +* Matthew Brett's ``Pydagogue ``_. Getting started with Git ------------------------ -`GitHub has instructions `__ for installing git, +``GitHub has instructions ``__ for installing git, setting up your SSH key, and configuring git. All these steps need to be completed before you can work seamlessly between your local repository and GitHub. @@ -125,15 +125,15 @@ you can work seamlessly between your local repository and GitHub. Forking ------- -You will need your own fork to work on the code. Go to the `pandas project -page `_ and hit the ``Fork`` button. You will +You will need your own fork to work on the code. Go to the ``pandas project +page ``_ and hit the ``Fork`` button. You will want to clone your fork to your machine:: git clone https://github.com/your-user-name/pandas.git pandas-yourname cd pandas-yourname git remote add upstream https://github.com/pandas-dev/pandas.git -This creates the directory `pandas-yourname` and connects your repository to +This creates the directory ``pandas-yourname`` and connects your repository to the upstream (main project) *pandas* repository. Note that performing a shallow clone (with ``--depth==N``, for some ``N`` greater @@ -147,20 +147,20 @@ Creating a development environment To test out code changes, you'll need to build pandas from source, which requires a C compiler and Python environment. If you're making documentation -changes, you can skip to :ref:`contributing.documentation` but you won't be able +changes, you can skip to :ref:``contributing.documentation`` but you won't be able to build the documentation locally before pushing your changes. Using a Docker container ~~~~~~~~~~~~~~~~~~~~~~~~ -Instead of manually setting up a development environment, you can use `Docker -`_ to automatically create the environment with just several -commands. Pandas provides a `DockerFile` in the root directory to build a Docker image +Instead of manually setting up a development environment, you can use ``Docker +``_ to automatically create the environment with just several +commands. Pandas provides a ``DockerFile`` in the root directory to build a Docker image with a full pandas development environment. **Docker Commands** -Pass your GitHub username in the `DockerFile` to use your own fork:: +Pass your GitHub username in the ``DockerFile`` to use your own fork:: # Build the image pandas-yourname-env docker build --tag pandas-yourname-env . @@ -172,7 +172,7 @@ Even easier, you can integrate Docker with the following IDEs: **Visual Studio Code** You can use the DockerFile to launch a remote session with Visual Studio Code, -a popular free IDE, using the `.devcontainer.json` file. +a popular free IDE, using the ``.devcontainer.json`` file. See https://code.visualstudio.com/docs/remote/containers for details. **PyCharm (Professional)** @@ -197,8 +197,8 @@ platform you're using. **Windows** -You will need `Build Tools for Visual Studio 2017 -`_. +You will need ``Build Tools for Visual Studio 2017 +``_. .. warning:: You DO NOT need to install Visual Studio 2019. @@ -221,7 +221,7 @@ which compilers (and versions) are installed on your system:: # for Red Hat/RHEL/CentOS/Fedora: yum list installed | grep -i --color compiler -`GCC (GNU Compiler Collection) `_, is a widely used +``GCC (GNU Compiler Collection) ``_, is a widely used compiler, which supports C and a number of other languages. If GCC is listed as an installed compiler nothing more is required. If no C compiler is installed (or you wish to install a newer version) you can install a compiler @@ -236,7 +236,7 @@ For other Linux distributions, consult your favourite search engine for compiler installation instructions. Let us know if you have any difficulties by opening an issue or reaching out on -`Gitter`_. +``Gitter``_. .. _contributing.dev_python: @@ -246,10 +246,10 @@ Creating a Python environment Now that you have a C compiler, create an isolated pandas development environment: -* Install either `Anaconda `_ or `miniconda - `_ +* Install either ``Anaconda ``_ or ``miniconda + ``_ * Make sure your conda is up to date (``conda update conda``) -* Make sure that you have :ref:`cloned the repository ` +* Make sure that you have :ref:``cloned the repository `` * ``cd`` to the pandas source directory We'll now kick off a three-step process: @@ -289,7 +289,7 @@ To return to your root environment:: conda deactivate -See the full conda docs `here `__. +See the full conda docs ``here ``__. .. _contributing.pip: @@ -320,7 +320,7 @@ You'll need to have at least Python 3.6.1 installed on your system. **Unix**/**Mac OS with pyenv** -Consult the docs for setting up pyenv `here `__. +Consult the docs for setting up pyenv ``here ``__. .. code-block:: bash @@ -346,7 +346,7 @@ Consult the docs for setting up pyenv `here `__. Below is a brief overview on how to set-up a virtual environment with Powershell under Windows. For details please refer to the -`official virtualenv user guide `__ +``official virtualenv user guide ``__ Use an ENV_DIR of your choice. We'll use ~\\virtualenvs\\pandas-dev where '~' is the folder pointed to by either $env:USERPROFILE (Powershell) or @@ -395,7 +395,7 @@ can do:: When you want to update the feature branch with changes in master after you created the branch, check the section on -:ref:`updating a PR `. +:ref:``updating a PR ``. .. _contributing.documentation: @@ -418,9 +418,9 @@ About the pandas documentation -------------------------------- The documentation is written in **reStructuredText**, which is almost like writing -in plain English, and built using `Sphinx `__. The -Sphinx Documentation has an excellent `introduction to reST -`__. Review the Sphinx docs to perform more +in plain English, and built using ``Sphinx ``__. The +Sphinx Documentation has an excellent ``introduction to reST +``__. Review the Sphinx docs to perform more complex changes to the documentation as well. Some other important things to know about the docs: @@ -434,7 +434,7 @@ Some other important things to know about the docs: installation, etc). * The docstrings follow a pandas convention, based on the **Numpy Docstring - Standard**. Follow the :ref:`pandas docstring guide ` for detailed + Standard**. Follow the :ref:``pandas docstring guide `` for detailed instructions on how to write a correct docstring. .. toctree:: @@ -442,8 +442,8 @@ Some other important things to know about the docs: contributing_docstring.rst -* The tutorials make heavy use of the `ipython directive - `_ sphinx extension. +* The tutorials make heavy use of the ``ipython directive + ``_ sphinx extension. This directive lets you put code in the documentation which will be run during the doc build. For example:: @@ -490,7 +490,7 @@ Some other important things to know about the docs: The ``.rst`` files are used to automatically generate Markdown and HTML versions of the docs. For this reason, please do not edit ``CONTRIBUTING.md`` directly, but instead make any changes to ``doc/source/development/contributing.rst``. Then, to - generate ``CONTRIBUTING.md``, use `pandoc `_ + generate ``CONTRIBUTING.md``, use ``pandoc ``_ with the following command:: pandoc doc/source/development/contributing.rst -t markdown_github > CONTRIBUTING.md @@ -499,7 +499,7 @@ The utility script ``scripts/validate_docstrings.py`` can be used to get a csv summary of the API documentation. And also validate common errors in the docstring of a specific class, function or method. The summary also compares the list of methods documented in the files in ``doc/source/reference`` (which is used to generate -the `API Reference `_ page) +the ``API Reference ``_ page) and the actual public methods. This will identify methods documented in ``doc/source/reference`` that are not actually class methods, and existing methods that are not documented in ``doc/source/reference``. @@ -516,14 +516,14 @@ However, there is a script that checks a docstring (for example for the ``DataFr This script will indicate some formatting errors if present, and will also run and test the examples included in the docstring. -Check the :ref:`pandas docstring guide ` for a detailed guide +Check the :ref:``pandas docstring guide `` for a detailed guide on how to format the docstring. The examples in the docstring ('doctests') must be valid Python code, that in a deterministic way returns the presented output, and that can be copied and run by users. This can be checked with the script above, and is also tested on Travis. A failing doctest will be a blocker for merging a PR. -Check the :ref:`examples ` section in the docstring guide +Check the :ref:``examples `` section in the docstring guide for some tips and tricks to get the doctests passing. When doing a PR with a docstring update, it is good to post the @@ -537,7 +537,7 @@ Requirements ~~~~~~~~~~~~ First, you need to have a development environment to be able to build pandas -(see the docs on :ref:`creating a development environment above `). +(see the docs on :ref:``creating a development environment above ``). Building the documentation ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -597,9 +597,9 @@ Building master branch documentation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When pull requests are merged into the pandas ``master`` branch, the main parts of -the documentation are also built by Travis-CI. These docs are then hosted `here -`__, see also -the :ref:`Continuous Integration ` section. +the documentation are also built by Travis-CI. These docs are then hosted ``here +``__, see also +the :ref:``Continuous Integration `` section. .. _contributing.code: @@ -613,7 +613,7 @@ Code standards -------------- Writing good code is not just about what you write. It is also about *how* you -write it. During :ref:`Continuous Integration ` testing, several +write it. During :ref:``Continuous Integration `` testing, several tools will be run to check your code for stylistic errors. Generating any warnings will cause the test to fail. Thus, good style is a requirement for submitting code to pandas. @@ -635,10 +635,10 @@ a lot of user code as a result, that is, we need it to be as *backwards compatib as possible to avoid mass breakages. In addition to ``./ci/code_checks.sh``, some extra checks are run by -``pre-commit`` - see :ref:`here ` for how to +``pre-commit`` - see :ref:``here `` for how to run them. -Additional standards are outlined on the :ref:`pandas code style guide ` +Additional standards are outlined on the :ref:``pandas code style guide `` Optional dependencies --------------------- @@ -652,22 +652,22 @@ All methods using an optional dependency should include a test asserting that an should be skipped if the library is present. All optional dependencies should be documented in -:ref:`install.optional_dependencies` and the minimum required version should be +:ref:``install.optional_dependencies`` and the minimum required version should be set in the ``pandas.compat._optional.VERSIONS`` dict. C (cpplint) ~~~~~~~~~~~ -pandas uses the `Google `_ +pandas uses the ``Google ``_ standard. Google provides an open source style checker called ``cpplint``, but we -use a fork of it that can be found `here `__. +use a fork of it that can be found ``here ``__. Here are *some* of the more common ``cpplint`` issues: * we restrict line-length to 80 characters to promote readability * every header file must include a header guard to avoid name collisions if re-included -:ref:`Continuous Integration ` will run the -`cpplint `_ tool +:ref:``Continuous Integration `` will run the +``cpplint ``_ tool and report any stylistic errors in your code. Therefore, it is helpful before submitting code to run the check yourself:: @@ -678,8 +678,8 @@ You can also run this command on an entire directory if necessary:: cpplint --extensions=c,h --headers=h --filter=-readability/casting,-runtime/int,-build/include_subdir --recursive modified-c-directory To make your commits compliant with this standard, you can install the -`ClangFormat `_ tool, which can be -downloaded `here `__. To configure, in your home directory, +``ClangFormat ``_ tool, which can be +downloaded ``here ``__. To configure, in your home directory, run the following command:: clang-format style=google -dump-config > .clang-format @@ -709,12 +709,12 @@ fixes manually. Python (PEP8 / black) ~~~~~~~~~~~~~~~~~~~~~ -pandas follows the `PEP8 `_ standard -and uses `Black `_ and -`Flake8 `_ to ensure a consistent code +pandas follows the ``PEP8 ``_ standard +and uses ``Black ``_ and +``Flake8 ``_ to ensure a consistent code format throughout the project. -:ref:`Continuous Integration ` will run those tools and +:ref:``Continuous Integration `` will run those tools and report any stylistic errors in your code. Therefore, it is helpful before submitting code to run the check yourself:: @@ -728,7 +728,7 @@ You should use a ``black`` version 20.8b1 as previous versions are not compatibl with the pandas codebase. If you wish to run these checks automatically, we encourage you to use -:ref:`pre-commits ` instead. +:ref:``pre-commits `` instead. One caveat about ``git diff upstream/master -u -- "*.py" | flake8 --diff``: this command will catch any stylistic errors in your changes specifically, but @@ -746,7 +746,7 @@ run this slightly modified command:: git diff upstream/master --name-only -- "*.py" | xargs flake8 Windows does not support the ``xargs`` command (unless installed for example -via the `MinGW `__ toolchain), but one can imitate the +via the ``MinGW ``__ toolchain), but one can imitate the behaviour as follows:: for /f %i in ('git diff upstream/master --name-only -- "*.py"') do flake8 %i @@ -760,10 +760,10 @@ Note that these commands can be run analogously with ``black``. Import formatting ~~~~~~~~~~~~~~~~~ -pandas uses `isort `__ to standardise import +pandas uses ``isort ``__ to standardise import formatting across the codebase. -A guide to import layout as per pep8 can be found `here `__. +A guide to import layout as per pep8 can be found ``here ``__. A summary of our current import sections ( in order ): @@ -778,13 +778,13 @@ A summary of our current import sections ( in order ): Imports are alphabetically sorted within these sections. -As part of :ref:`Continuous Integration ` checks we run:: +As part of :ref:``Continuous Integration `` checks we run:: isort --check-only pandas -to check that imports are correctly formatted as per the `setup.cfg`. +to check that imports are correctly formatted as per the ``setup.cfg``. -If you see output like the below in :ref:`Continuous Integration ` checks: +If you see output like the below in :ref:``Continuous Integration `` checks: .. code-block:: shell @@ -799,13 +799,13 @@ You should run:: to automatically format imports correctly. This will modify your local copy of the files. -Alternatively, you can run a command similar to what was suggested for ``black`` and ``flake8`` :ref:`right above `:: +Alternatively, you can run a command similar to what was suggested for ``black`` and ``flake8`` :ref:``right above ``:: git diff upstream/master --name-only -- "*.py" | xargs -r isort Where similar caveats apply if you are on OSX or Windows. -You can then verify the changes look ok, then git :ref:`commit ` and :ref:`push `. +You can then verify the changes look ok, then git :ref:``commit `` and :ref:``push ``. .. _contributing.pre-commit: @@ -813,7 +813,7 @@ Pre-commit ~~~~~~~~~~ You can run many of these styling checks manually as we have described above. However, -we encourage you to use `pre-commit hooks `_ instead +we encourage you to use ``pre-commit hooks ``_ instead to automatically run ``black``, ``flake8``, ``isort`` when you make a git commit. This can be done by installing ``pre-commit``:: @@ -880,14 +880,14 @@ You'll also need to 1. Write a new test that asserts a warning is issued when calling with the deprecated argument 2. Update all of pandas existing tests and code to use the new argument -See :ref:`contributing.warnings` for more. +See :ref:``contributing.warnings`` for more. .. _contributing.type_hints: Type hints ---------- -pandas strongly encourages the use of :pep:`484` style type hints. New development should contain type hints and pull requests to annotate existing code are accepted as well! +pandas strongly encourages the use of :pep:``484`` style type hints. New development should contain type hints and pull requests to annotate existing code are accepted as well! Style guidelines ~~~~~~~~~~~~~~~~ @@ -920,7 +920,7 @@ You should write maybe_primes: List[Optional[int]] = [] -In some cases in the code base classes may define class variables that shadow builtins. This causes an issue as described in `Mypy 1775 `_. The defensive solution here is to create an unambiguous alias of the builtin and use that without your annotation. For example, if you come across a definition like +In some cases in the code base classes may define class variables that shadow builtins. This causes an issue as described in ``Mypy 1775 ``_. The defensive solution here is to create an unambiguous alias of the builtin and use that without your annotation. For example, if you come across a definition like .. code-block:: python @@ -952,7 +952,7 @@ In some cases you may be tempted to use ``cast`` from the typing module when you obj = cast(str, obj) # Mypy complains without this! return obj.upper() -The limitation here is that while a human can reasonably understand that ``is_number`` would catch the ``int`` and ``float`` types mypy cannot make that same inference just yet (see `mypy #5206 `_. While the above works, the use of ``cast`` is **strongly discouraged**. Where applicable a refactor of the code to appease static analysis is preferable +The limitation here is that while a human can reasonably understand that ``is_number`` would catch the ``int`` and ``float`` types mypy cannot make that same inference just yet (see ``mypy #5206 ``_. While the above works, the use of ``cast`` is **strongly discouraged**. Where applicable a refactor of the code to appease static analysis is preferable .. code-block:: python @@ -968,7 +968,7 @@ With custom types and inference this is not always possible so exceptions are ma pandas-specific types ~~~~~~~~~~~~~~~~~~~~~ -Commonly used types specific to pandas will appear in `pandas._typing `_ and you should use these where applicable. This module is private for now but ultimately this should be exposed to third party libraries who want to implement type checking against pandas. +Commonly used types specific to pandas will appear in ``pandas._typing ``_ and you should use these where applicable. This module is private for now but ultimately this should be exposed to third party libraries who want to implement type checking against pandas. For example, quite a few functions in pandas accept a ``dtype`` argument. This can be expressed as a string like ``"object"``, a ``numpy.dtype`` like ``np.int64`` or even a pandas ``ExtensionDtype`` like ``pd.CategoricalDtype``. Rather than burden the user with having to constantly annotate all of those options, this can simply be imported and reused from the pandas._typing module @@ -979,12 +979,12 @@ For example, quite a few functions in pandas accept a ``dtype`` argument. This c def as_type(dtype: Dtype) -> ...: ... -This module will ultimately house types for repeatedly used concepts like "path-like", "array-like", "numeric", etc... and can also hold aliases for commonly appearing parameters like `axis`. Development of this module is active so be sure to refer to the source for the most up to date list of available types. +This module will ultimately house types for repeatedly used concepts like "path-like", "array-like", "numeric", etc... and can also hold aliases for commonly appearing parameters like ``axis``. Development of this module is active so be sure to refer to the source for the most up to date list of available types. Validating type hints ~~~~~~~~~~~~~~~~~~~~~ -pandas uses `mypy `_ to statically analyze the code base and type hints. After making any change you can ensure your type hints are correct by running +pandas uses ``mypy ``_ to statically analyze the code base and type hints. After making any change you can ensure your type hints are correct by running .. code-block:: shell @@ -995,13 +995,13 @@ pandas uses `mypy `_ to statically analyze the code base a Testing with continuous integration ----------------------------------- -The pandas test suite will run automatically on `Travis-CI `__ and -`Azure Pipelines `__ +The pandas test suite will run automatically on ``Travis-CI ``__ and +``Azure Pipelines ``__ continuous integration services, once your pull request is submitted. However, if you wish to run the test suite on a branch prior to submitting the pull request, then the continuous integration services need to be hooked to your GitHub repository. Instructions are here -for `Travis-CI `__ and -`Azure Pipelines `__. +for ``Travis-CI ``__ and +``Azure Pipelines ``__. A pull-request will be considered for merging when you have an all 'green' build. If any tests are failing, then you will get a red 'X', where you can click through to see the individual failed tests. @@ -1013,7 +1013,7 @@ This is an example of a green build. Each time you push to *your* fork, a *new* run of the tests will be triggered on the CI. You can enable the auto-cancel feature, which removes any non-currently-running tests for that same pull-request, for - `Travis-CI here `__. + ``Travis-CI here ``__. .. _contributing.tdd: @@ -1022,7 +1022,7 @@ Test-driven development/code writing ------------------------------------ pandas is serious about testing and strongly encourages contributors to embrace -`test-driven development (TDD) `_. +``test-driven development (TDD) ``_. This development process "relies on the repetition of a very short development cycle: first the developer writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test." @@ -1033,10 +1033,10 @@ use cases and writing corresponding tests. Adding tests is one of the most common requests after code is pushed to pandas. Therefore, it is worth getting in the habit of writing tests ahead of time so this is never an issue. -Like many packages, pandas uses `pytest -`_ and the convenient -extensions in `numpy.testing -`_. +Like many packages, pandas uses ``pytest +``_ and the convenient +extensions in ``numpy.testing +``_. .. note:: @@ -1048,8 +1048,8 @@ Writing tests All tests should go into the ``tests`` subdirectory of the specific package. This folder contains many current examples of tests, and we suggest looking to these for inspiration. If your test requires working with files or -network connectivity, there is more information on the `testing page -`_ of the wiki. +network connectivity, there is more information on the ``testing page +``_ of the wiki. The ``pandas._testing`` module has many special ``assert`` functions that make it easier to make statements about whether Series or DataFrame objects are @@ -1087,7 +1087,7 @@ pandas existing test structure is *mostly* class-based, meaning that you will ty class TestReallyCoolFeature: pass -Going forward, we are moving to a more *functional* style using the `pytest `__ framework, which offers a richer testing +Going forward, we are moving to a more *functional* style using the ``pytest ``__ framework, which offers a richer testing framework that will facilitate testing and developing. Thus, instead of writing test classes, we will write test functions like this: .. code-block:: python @@ -1195,9 +1195,9 @@ try to find a failing input. Even better, no matter how many random examples it tries, Hypothesis always reports a single minimal counterexample to your assertions - often an example that you would never have thought to test. -See `Getting Started with Hypothesis `_ -for more of an introduction, then `refer to the Hypothesis documentation -for details `_. +See ``Getting Started with Hypothesis ``_ +for more of an introduction, then ``refer to the Hypothesis documentation +for details ``_. .. code-block:: python @@ -1265,7 +1265,7 @@ If the test generates a warning of class ``category`` whose message starts with ``msg``, the warning will be ignored and the test will pass. If you need finer-grained control, you can use Python's usual -`warnings module `__ +``warnings module ``__ to control whether a warning is ignored / raised at different places within a single test. @@ -1300,9 +1300,9 @@ Or with one of the following constructs:: pytest pandas/tests/[test-module].py::[TestClass] pytest pandas/tests/[test-module].py::[TestClass]::[test_method] -Using `pytest-xdist `_, one can +Using ``pytest-xdist ``_, one can speed up local testing on multicore machines. To use this feature, you will -need to install `pytest-xdist` via:: +need to install ``pytest-xdist`` via:: pip install pytest-xdist @@ -1320,7 +1320,7 @@ On Windows, one can type:: This can significantly reduce the time it takes to locally run tests before submitting a pull request. -For more, see the `pytest `_ documentation. +For more, see the ``pytest ``_ documentation. Furthermore one can run @@ -1335,14 +1335,14 @@ Running the performance test suite Performance matters and it is worth considering whether your code has introduced performance regressions. pandas is in the process of migrating to -`asv benchmarks `__ +``asv benchmarks ``__ to enable easy monitoring of the performance of critical pandas operations. These benchmarks are all found in the ``pandas/asv_bench`` directory, and the -test results can be found `here `__. +test results can be found ``here ``__. To use all features of asv, you will need either ``conda`` or -``virtualenv``. For more details please check the `asv installation -webpage `_. +``virtualenv``. For more details please check the ``asv installation +webpage ``_. To install asv:: @@ -1397,7 +1397,7 @@ This will display stderr from the benchmarks, and use your local ``python`` that comes from your ``$PATH``. Information on how to write a benchmark and how to use asv can be found in the -`asv documentation `_. +``asv documentation ``_. Documenting your code --------------------- @@ -1405,12 +1405,12 @@ Documenting your code Changes should be reflected in the release notes located in ``doc/source/whatsnew/vx.y.z.rst``. This file contains an ongoing change log for each release. Add an entry to this file to document your fix, enhancement or (unavoidable) breaking change. Make sure to include the -GitHub issue number when adding your entry (using ``:issue:`1234``` where ``1234`` is the +GitHub issue number when adding your entry (using ``:issue:``1234``` where ``1234`` is the issue/pull request number). If your code is an enhancement, it is most likely necessary to add usage examples to the existing documentation. This can be done following the section -regarding documentation :ref:`above `. +regarding documentation :ref:``above ``. Further, to let users know when this feature was added, the ``versionadded`` directive is used. The sphinx syntax for that is: @@ -1420,8 +1420,8 @@ directive is used. The sphinx syntax for that is: This will put the text *New in version 1.1.0* wherever you put the sphinx directive. This should also be put in the docstring when adding a new function -or method (`example `__) -or a new keyword argument (`example `__). +or method (``example ``__) +or a new keyword argument (``example ``__). Contributing your changes to pandas ===================================== @@ -1465,7 +1465,7 @@ The following defines how a commit message should be structured. Please referen relevant GitHub issues in your commit message using GH1234 or #1234. Either style is fine, but the former is generally preferred: -* a subject line with `< 80` chars. +* a subject line with ``< 80`` chars. * One blank line. * Optionally, a commit message body. @@ -1545,7 +1545,7 @@ automatically updated. Pushing them to GitHub again is done by:: git push origin shiny-new-feature This will automatically update your pull request with the latest code and restart the -:ref:`Continuous Integration ` tests. +:ref:``Continuous Integration `` tests. Another reason you might need to update your pull request is to solve conflicts with changes that have been merged into the master branch since you opened your @@ -1568,7 +1568,7 @@ added, you can run ``git commit`` to save those fixes. If you have uncommitted changes at the moment you want to update the branch with master, you will need to ``stash`` them prior to updating (see the -`stash docs `__). +``stash docs ``__). This will effectively store your changes and they can be reapplied after updating. After the feature branch has been update locally, you can now update your pull @@ -1604,7 +1604,7 @@ The branch will still exist on GitHub, so to delete it there do:: Tips for a successful pull request ================================== -If you have made it to the `Review your code`_ phase, one of the core contributors may +If you have made it to the ``Review your code``_ phase, one of the core contributors may take a look. Please note however that a handful of people are responsible for reviewing all of the contributions, which can often lead to bottlenecks. @@ -1614,4 +1614,4 @@ To improve the chances of your pull request being reviewed, you should: - **Ensure you have appropriate tests**. These should be the first part of any PR - **Keep your pull requests as simple as possible**. Larger PRs take longer to review - **Ensure that CI is in a green state**. Reviewers may not even look otherwise -- **Keep** `Updating your pull request`_, either by request or every few days +- **Keep** ``Updating your pull request``_, either by request or every few days diff --git a/doc/source/development/contributing_docstring.rst b/doc/source/development/contributing_docstring.rst index 33f30e1d97512..59f54dda7a46d 100644 --- a/doc/source/development/contributing_docstring.rst +++ b/doc/source/development/contributing_docstring.rst @@ -14,7 +14,7 @@ function or method, so programmers can understand what it does without having to read the details of the implementation. Also, it is a common practice to generate online (html) documentation -automatically from docstrings. `Sphinx `_ serves +automatically from docstrings. ``Sphinx ``_ serves this purpose. The next example gives an idea of what a docstring looks like: @@ -25,7 +25,7 @@ The next example gives an idea of what a docstring looks like: """ Add up two integer numbers. - This function simply wraps the `+` operator, and does not + This function simply wraps the ``+`` operator, and does not do anything interesting, except for illustrating what the docstring of a very simple function looks like. @@ -39,7 +39,7 @@ The next example gives an idea of what a docstring looks like: Returns ------- int - The sum of `num1` and `num2`. + The sum of ``num1`` and ``num2``. See Also -------- @@ -60,15 +60,15 @@ Some standards regarding docstrings exist, which make them easier to read, and a be easily exported to other formats such as html or pdf. The first conventions every Python docstring should follow are defined in -`PEP-257 `_. +``PEP-257 ``_. As PEP-257 is quite broad, other more specific standards also exist. In the case of pandas, the numpy docstring convention is followed. These conventions are explained in this document: -* `numpydoc docstring guide `_ - (which is based in the original `Guide to NumPy/SciPy documentation - `_) +* ``numpydoc docstring guide ``_ + (which is based in the original ``Guide to NumPy/SciPy documentation + ``_) numpydoc is a Sphinx extension to support the numpy docstring convention. @@ -76,12 +76,12 @@ The standard uses reStructuredText (reST). reStructuredText is a markup language that allows encoding styles in plain text files. Documentation about reStructuredText can be found in: -* `Sphinx reStructuredText primer `_ -* `Quick reStructuredText reference `_ -* `Full reStructuredText specification `_ +* ``Sphinx reStructuredText primer ``_ +* ``Quick reStructuredText reference ``_ +* ``Full reStructuredText specification ``_ pandas has some helpers for sharing docstrings between related classes, see -:ref:`docstring.sharing`. +:ref:``docstring.sharing``. The rest of this document will summarize all the above guidelines, and will provide additional conventions specific to the pandas project. @@ -108,16 +108,16 @@ backticks. The following are considered inline code: * The name of a parameter * Python code, a module, function, built-in, type, literal... (e.g. ``os``, ``list``, ``numpy.abs``, ``datetime.date``, ``True``) -* A pandas class (in the form ``:class:`pandas.Series```) -* A pandas method (in the form ``:meth:`pandas.Series.sum```) -* A pandas function (in the form ``:func:`pandas.to_datetime```) +* A pandas class (in the form ``:class:``pandas.Series```) +* A pandas method (in the form ``:meth:``pandas.Series.sum```) +* A pandas function (in the form ``:func:``pandas.to_datetime```) .. note:: To display only the last component of the linked class, method or - function, prefix it with ``~``. For example, ``:class:`~pandas.Series``` + function, prefix it with ``~``. For example, ``:class:``~pandas.Series``` will link to ``pandas.Series`` but only display the last part, ``Series`` - as the link text. See `Sphinx cross-referencing syntax - `_ + as the link text. See ``Sphinx cross-referencing syntax + ``_ for details. **Good:** @@ -126,9 +126,9 @@ backticks. The following are considered inline code: def add_values(arr): """ - Add the values in `arr`. + Add the values in ``arr``. - This is equivalent to Python `sum` of :meth:`pandas.Series.sum`. + This is equivalent to Python ``sum`` of :meth:``pandas.Series.sum``. Some sections are omitted here for simplicity. """ @@ -144,13 +144,13 @@ backticks. The following are considered inline code: With several mistakes in the docstring. - It has a blank like after the signature `def func():`. + It has a blank like after the signature ``def func():``. The text 'Some function' should go in the line after the opening quotes of the docstring, not in the same line. There is a blank line between the docstring and the first line - of code `foo = 1`. + of code ``foo = 1``. The closing quotes should be in the next line, not in this one.""" @@ -269,11 +269,11 @@ after, and not between the line with the word "Parameters" and the one with the hyphens. After the title, each parameter in the signature must be documented, including -`*args` and `**kwargs`, but not `self`. +``*args`` and ``**kwargs``, but not ``self``. The parameters are defined by their name, followed by a space, a colon, another space, and the type (or types). Note that the space between the name and the -colon is important. Types are not defined for `*args` and `**kwargs`, but must +colon is important. Types are not defined for ``*args`` and ``**kwargs``, but must be defined for all other parameters. After the parameter definition, it is required to have a line with the parameter description, which is indented, and can have multiple lines. The description must start with a capital letter, and @@ -285,13 +285,13 @@ comma at the end of the type. The exact form of the type in this case will be argument means, which can be added after a comma "int, default -1, meaning all cpus". -In cases where the default value is `None`, meaning that the value will not be +In cases where the default value is ``None``, meaning that the value will not be used. Instead of "str, default None", it is preferred to write "str, optional". -When `None` is a value being used, we will keep the form "str, default None". -For example, in `df.to_csv(compression=None)`, `None` is not a value being used, +When ``None`` is a value being used, we will keep the form "str, default None". +For example, in ``df.to_csv(compression=None)``, ``None`` is not a value being used, but means that compression is optional, and no compression is being used if not -provided. In this case we will use `str, optional`. Only in cases like -`func(value=None)` and `None` is being used in the same way as `0` or `foo` +provided. In this case we will use ``str, optional``. Only in cases like +``func(value=None)`` and ``None`` is being used in the same way as ``0`` or ``foo`` would be used, then we will specify "str, int or None, default None". **Good:** @@ -331,13 +331,13 @@ would be used, then we will specify "str, int or None, default None". specified kind. Note the blank line between the parameters title and the first - parameter. Also, note that after the name of the parameter `kind` + parameter. Also, note that after the name of the parameter ``kind`` and before the colon, a space is missing. Also, note that the parameter descriptions do not start with a capital letter, and do not finish with a dot. - Finally, the `**kwargs` parameter is missing. + Finally, the ``**kwargs`` parameter is missing. Parameters ---------- @@ -361,9 +361,9 @@ boolean, etc): * str * bool -For complex types, define the subtypes. For `dict` and `tuple`, as more than +For complex types, define the subtypes. For ``dict`` and ``tuple``, as more than one type is present, we use the brackets to help read the type (curly brackets -for `dict` and normal brackets for `tuple`): +for ``dict`` and normal brackets for ``tuple``): * list of int * dict of {str : int} @@ -512,8 +512,8 @@ This section is used to let users know about pandas functionality related to the one being documented. In rare cases, if no related methods or functions can be found at all, this section can be skipped. -An obvious example would be the `head()` and `tail()` methods. As `tail()` does -the equivalent as `head()` but at the end of the `Series` or `DataFrame` +An obvious example would be the ``head()`` and ``tail()`` methods. As ``tail()`` does +the equivalent as ``head()`` but at the end of the ``Series`` or ``DataFrame`` instead of at the beginning, it is good to let the users know about it. To give an intuition on what can be considered related, here there are some @@ -608,8 +608,8 @@ Examples in docstrings, besides illustrating the usage of the function or method, must be valid Python code, that returns the given output in a deterministic way, and that can be copied and run by users. -Examples are presented as a session in the Python terminal. `>>>` is used to -present code. `...` is used for code continuing from the previous line. +Examples are presented as a session in the Python terminal. ``>>>`` is used to +present code. ``...`` is used for code continuing from the previous line. Output is presented immediately after the last line of code generating the output (no blank lines in between). Comments describing the examples can be added with blank lines before and after them. @@ -664,7 +664,7 @@ A simple example could be: 4 Falcon dtype: object - With the `n` parameter, we can change the number of returned rows: + With the ``n`` parameter, we can change the number of returned rows: >>> s.head(n=3) 0 Ant @@ -692,7 +692,7 @@ shown: import pandas as pd Any other module used in the examples must be explicitly imported, one per line (as -recommended in :pep:`8#imports`) +recommended in :pep:``8#imports``) and avoiding aliases. Avoid excessive imports, but if needed, imports from the standard library go first, followed by third-party libraries (like matplotlib). @@ -742,7 +742,7 @@ positional arguments ``head(3)``. def fillna(self, value): """ - Replace missing values by `value`. + Replace missing values by ``value``. Examples -------- @@ -771,7 +771,7 @@ positional arguments ``head(3)``. def contains(self, pattern, case_sensitive=True, na=numpy.nan): """ - Return whether each value contains `pattern`. + Return whether each value contains ``pattern``. In this case, we are illustrating how to use sections, even if the example is simple enough and does not require them. @@ -788,8 +788,8 @@ positional arguments ``head(3)``. **Case sensitivity** - With `case_sensitive` set to `False` we can match `a` with both - `a` and `A`: + With ``case_sensitive`` set to ``False`` we can match ``a`` with both + ``a`` and ``A``: >>> s.contains(pattern='a', case_sensitive=False) 0 True @@ -800,7 +800,7 @@ positional arguments ``head(3)``. **Missing values** - We can fill missing values in the output using the `na` parameter: + We can fill missing values in the output using the ``na`` parameter: >>> s.contains(pattern='a', na=False) 0 False @@ -824,9 +824,9 @@ positional arguments ``head(3)``. Try to use meaningful data, when it makes the example easier to understand. - Try to avoid positional arguments like in `df.method(1)`. They + Try to avoid positional arguments like in ``df.method(1)``. They can be all right if previously defined with a meaningful name, - like in `present_value(interest_rate)`, but avoid them otherwise. + like in ``present_value(interest_rate)``, but avoid them otherwise. When presenting the behavior with different parameters, do not place all the calls one next to the other. Instead, add a short sentence @@ -914,7 +914,7 @@ plot will be generated automatically when building the documentation. class Series: def plot(self): """ - Generate a plot with the `Series` data. + Generate a plot with the ``Series`` data. Examples -------- diff --git a/doc/source/development/extending.rst b/doc/source/development/extending.rst index 46c2cbbe39b34..21491602bc8d2 100644 --- a/doc/source/development/extending.rst +++ b/doc/source/development/extending.rst @@ -16,9 +16,9 @@ Registering custom accessors ---------------------------- Libraries can use the decorators -:func:`pandas.api.extensions.register_dataframe_accessor`, -:func:`pandas.api.extensions.register_series_accessor`, and -:func:`pandas.api.extensions.register_index_accessor`, to add additional +:func:``pandas.api.extensions.register_dataframe_accessor``, +:func:``pandas.api.extensions.register_series_accessor``, and +:func:``pandas.api.extensions.register_index_accessor``, to add additional "namespaces" to pandas objects. All of these follow a similar convention: you decorate a class, providing the name of attribute to add. The class's ``__init__`` method gets the object being decorated. For example: @@ -59,9 +59,9 @@ Now users can access your methods using the ``geo`` namespace: This can be a convenient way to extend pandas objects without subclassing them. If you write a custom accessor, make a pull request adding it to our -:ref:`ecosystem` page. +:ref:``ecosystem`` page. -We highly recommend validating the data in your accessor's `__init__`. +We highly recommend validating the data in your accessor's ``__init__``. In our ``GeoAccessor``, we validate that the data contains the expected columns, raising an ``AttributeError`` when the validation fails. For a ``Series`` accessor, you should validate the ``dtype`` if the accessor @@ -75,7 +75,7 @@ Extension types .. warning:: - The :class:`pandas.api.extensions.ExtensionDtype` and :class:`pandas.api.extensions.ExtensionArray` APIs are new and + The :class:``pandas.api.extensions.ExtensionDtype`` and :class:``pandas.api.extensions.ExtensionArray`` APIs are new and experimental. They may change between versions without warning. pandas defines an interface for implementing data types and arrays that *extend* @@ -85,35 +85,35 @@ timezone). Libraries can define a custom array and data type. When pandas encounters these objects, they will be handled properly (i.e. not converted to an ndarray of -objects). Many methods like :func:`pandas.isna` will dispatch to the extension +objects). Many methods like :func:``pandas.isna`` will dispatch to the extension type's implementation. If you're building a library that implements the interface, please publicize it -on :ref:`ecosystem.extensions`. +on :ref:``ecosystem.extensions``. The interface consists of two classes. -:class:`~pandas.api.extensions.ExtensionDtype` +:class:``~pandas.api.extensions.ExtensionDtype`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -A :class:`pandas.api.extensions.ExtensionDtype` is similar to a ``numpy.dtype`` object. It describes the +A :class:``pandas.api.extensions.ExtensionDtype`` is similar to a ``numpy.dtype`` object. It describes the data type. Implementors are responsible for a few unique items like the name. One particularly important item is the ``type`` property. This should be the class that is the scalar type for your data. For example, if you were writing an extension array for IP Address data, this might be ``ipaddress.IPv4Address``. -See the `extension dtype source`_ for interface definition. +See the ``extension dtype source``_ for interface definition. .. versionadded:: 0.24.0 -:class:`pandas.api.extension.ExtensionDtype` can be registered to pandas to allow creation via a string dtype name. +:class:``pandas.api.extension.ExtensionDtype`` can be registered to pandas to allow creation via a string dtype name. This allows one to instantiate ``Series`` and ``.astype()`` with a registered string name, for example ``'category'`` is a registered string accessor for the ``CategoricalDtype``. -See the `extension dtype dtypes`_ for more on how to register dtypes. +See the ``extension dtype dtypes``_ for more on how to register dtypes. -:class:`~pandas.api.extensions.ExtensionArray` +:class:``~pandas.api.extensions.ExtensionArray`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This class provides all the array-like functionality. ExtensionArrays are @@ -132,17 +132,17 @@ be backed by a NumPy structured array with two fields, one for the lower 64 bits and one for the upper 64 bits. Or they may be backed by some other storage type, like Python lists. -See the `extension array source`_ for the interface definition. The docstrings +See the ``extension array source``_ for the interface definition. The docstrings and comments contain guidance for properly implementing the interface. .. _extending.extension.operator: -:class:`~pandas.api.extensions.ExtensionArray` operator support +:class:``~pandas.api.extensions.ExtensionArray`` operator support ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. versionadded:: 0.24.0 -By default, there are no operators defined for the class :class:`~pandas.api.extensions.ExtensionArray`. +By default, there are no operators defined for the class :class:``~pandas.api.extensions.ExtensionArray``. There are two approaches for providing operator support for your ExtensionArray: 1. Define each of the operators on your ``ExtensionArray`` subclass. @@ -165,11 +165,11 @@ of the class ``MyExtensionElement``, then if the operators are defined for ``MyExtensionElement``, the second approach will automatically define the operators for ``MyExtensionArray``. -A mixin class, :class:`~pandas.api.extensions.ExtensionScalarOpsMixin` supports this second +A mixin class, :class:``~pandas.api.extensions.ExtensionScalarOpsMixin`` supports this second approach. If developing an ``ExtensionArray`` subclass, for example ``MyExtensionArray``, can simply include ``ExtensionScalarOpsMixin`` as a parent class of ``MyExtensionArray``, -and then call the methods :meth:`~MyExtensionArray._add_arithmetic_ops` and/or -:meth:`~MyExtensionArray._add_comparison_ops` to hook the operators into +and then call the methods :meth:``~MyExtensionArray._add_arithmetic_ops`` and/or +:meth:``~MyExtensionArray._add_comparison_ops`` to hook the operators into your ``MyExtensionArray`` class, as follows: .. code-block:: python @@ -211,17 +211,17 @@ will NumPy universal functions ^^^^^^^^^^^^^^^^^^^^^^^^^ -:class:`Series` implements ``__array_ufunc__``. As part of the implementation, -pandas unboxes the ``ExtensionArray`` from the :class:`Series`, applies the ufunc, +:class:``Series`` implements ``__array_ufunc__``. As part of the implementation, +pandas unboxes the ``ExtensionArray`` from the :class:``Series``, applies the ufunc, and re-boxes it if necessary. If applicable, we highly recommend that you implement ``__array_ufunc__`` in your extension array to avoid coercion to an ndarray. See -`the numpy documentation `__ +``the numpy documentation ``__ for an example. As part of your implementation, we require that you defer to pandas when a pandas -container (:class:`Series`, :class:`DataFrame`, :class:`Index`) is detected in ``inputs``. +container (:class:``Series``, :class:``DataFrame``, :class:``Index``) is detected in ``inputs``. If any of those is present, you should return ``NotImplemented``. pandas will take care of unboxing the array from the container and re-calling the ufunc with the unwrapped input. @@ -286,7 +286,7 @@ appropriate pandas ``ExtensionArray`` for this dtype and the passed values: def __from_arrow__(self, array: pyarrow.Array/ChunkedArray) -> ExtensionArray: ... -See more in the `Arrow documentation `__. +See more in the ``Arrow documentation ``__. Those methods have been implemented for the nullable integer and string extension dtypes included in pandas, and ensure roundtrip to pyarrow and the Parquet file format. @@ -302,13 +302,13 @@ Subclassing pandas data structures .. warning:: There are some easier alternatives before considering subclassing ``pandas`` data structures. - 1. Extensible method chains with :ref:`pipe ` + 1. Extensible method chains with :ref:``pipe `` - 2. Use *composition*. See `here `_. + 2. Use *composition*. See ``here ``_. - 3. Extending by :ref:`registering an accessor ` + 3. Extending by :ref:``registering an accessor `` - 4. Extending by :ref:`extension type ` + 4. Extending by :ref:``extension type `` This section describes how to subclass ``pandas`` data structures to meet more specific needs. There are two points that need attention: @@ -317,7 +317,7 @@ This section describes how to subclass ``pandas`` data structures to meet more s .. note:: - You can find a nice example in `geopandas `_ project. + You can find a nice example in ``geopandas ``_ project. Override constructor properties ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -481,7 +481,7 @@ This would be more or less equivalent to: The backend module can then use other visualization tools (Bokeh, Altair,...) to generate the plots. -Libraries implementing the plotting backend should use `entry points `__ +Libraries implementing the plotting backend should use ``entry points ``__ to make their backend discoverable to pandas. The key is ``"pandas_plotting_backends"``. For example, pandas registers the default "matplotlib" backend as follows. diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index 624c0551de607..a6c66e8c9b962 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -19,8 +19,8 @@ development to remain focused around it's original requirements. This is an inexhaustive list of projects that build on pandas in order to provide tools in the PyData space. For a list of projects that depend on pandas, see the -`libraries.io usage page for pandas `_ -or `search pypi for pandas `_. +``libraries.io usage page for pandas ``_ +or ``search pypi for pandas ``_. We'd like to make it easier for users to find these projects, if you know of other substantial projects that you feel should be on this list, please let us know. @@ -30,21 +30,21 @@ substantial projects that you feel should be on this list, please let us know. Data cleaning and validation ---------------------------- -`Pyjanitor `__ +``Pyjanitor ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pyjanitor provides a clean API for cleaning data, using method chaining. -`Engarde `__ +``Engarde ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Engarde is a lightweight library used to explicitly state assumptions about your datasets and check that they're *actually* true. -`pandas-path `__ +``pandas-path ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Since Python 3.4, `pathlib `_ has been +Since Python 3.4, ``pathlib ``_ has been included in the Python standard library. Path objects provide a simple and delightful way to interact with the file system. The pandas-path package enables the Path API for pandas through a custom accessor ``.path``. Getting just the filenames from @@ -56,12 +56,12 @@ joining paths, replacing file extensions, and checking if files exist are also a Statistics and machine learning ------------------------------- -`pandas-tfrecords `__ +``pandas-tfrecords ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Easy saving pandas dataframe to tensorflow tfrecords format and reading tfrecords to pandas. -`Statsmodels `__ +``Statsmodels ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Statsmodels is the prominent Python "statistics and econometrics library" and it has @@ -69,18 +69,18 @@ a long-standing special relationship with pandas. Statsmodels provides powerful econometrics, analysis and modeling functionality that is out of pandas' scope. Statsmodels leverages pandas objects as the underlying data container for computation. -`sklearn-pandas `__ +``sklearn-pandas ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Use pandas DataFrames in your `scikit-learn `__ +Use pandas DataFrames in your ``scikit-learn ``__ ML pipeline. -`Featuretools `__ +``Featuretools ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Featuretools is a Python library for automated feature engineering built on top of pandas. It excels at transforming temporal and relational datasets into feature matrices for machine learning using reusable feature engineering "primitives". Users can contribute their own primitives in Python and share them with the rest of the community. -`Compose `__ +``Compose ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compose is a machine learning tool for labeling data and prediction engineering. It allows you to structure the labeling process by parameterizing prediction problems and transforming time-driven relational data into target values with cutoff times that can be used for supervised learning. @@ -90,7 +90,7 @@ Compose is a machine learning tool for labeling data and prediction engineering. Visualization ------------- -`Altair `__ +``Altair ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Altair is a declarative statistical visualization library for Python. @@ -101,7 +101,7 @@ simplicity produces beautiful and effective visualizations with a minimal amount of code. Altair works with Pandas DataFrames. -`Bokeh `__ +``Bokeh ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Bokeh is a Python interactive visualization library for large datasets that natively uses @@ -109,7 +109,7 @@ the latest web technologies. Its goal is to provide elegant, concise constructio graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients. -`Pandas-Bokeh `__ provides a high level API +``Pandas-Bokeh ``__ provides a high level API for Bokeh that can be loaded as a native Pandas plotting backend via .. code:: python @@ -120,11 +120,11 @@ It is very similar to the matplotlib plotting backend, but provides interactive web-based charts and maps. -`Seaborn `__ +``Seaborn ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Seaborn is a Python visualization library based on -`matplotlib `__. It provides a high-level, dataset-oriented +``matplotlib ``__. It provides a high-level, dataset-oriented interface for creating attractive statistical graphics. The plotting functions in seaborn understand pandas objects and leverage pandas grouping operations internally to support concise specification of complex visualizations. Seaborn @@ -132,33 +132,33 @@ also goes beyond matplotlib and pandas with the option to perform statistical estimation while plotting, aggregating across observations and visualizing the fit of statistical models to emphasize patterns in a dataset. -`plotnine `__ +``plotnine ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Hadley Wickham's `ggplot2 `__ is a foundational exploratory visualization package for the R language. -Based on `"The Grammar of Graphics" `__ it +Hadley Wickham's ``ggplot2 ``__ is a foundational exploratory visualization package for the R language. +Based on ``"The Grammar of Graphics" ``__ it provides a powerful, declarative and extremely general way to generate bespoke plots of any kind of data. Various implementations to other languages are available. -A good implementation for Python users is `has2k1/plotnine `__. +A good implementation for Python users is ``has2k1/plotnine ``__. -`IPython vega `__ +``IPython vega ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -`IPython Vega `__ leverages `Vega -`__ to create plots within Jupyter Notebook. +``IPython Vega ``__ leverages ``Vega +``__ to create plots within Jupyter Notebook. -`Plotly `__ +``Plotly ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -`Plotly’s `__ `Python API `__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js `__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn `__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks `__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `cloud `__, `offline `__, or `on-premise `__ accounts for private use. +``Plotly’s ``__ ``Python API ``__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and ``D3.js ``__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of ``matplotlib, ggplot for Python, and Seaborn ``__ can convert figures into interactive web-based plots. Plots can be drawn in ``IPython Notebooks ``__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has ``cloud ``__, ``offline ``__, or ``on-premise ``__ accounts for private use. -`Qtpandas `__ +``Qtpandas ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Spun off from the main pandas library, the `qtpandas `__ +Spun off from the main pandas library, the ``qtpandas ``__ library enables DataFrame visualization and manipulation in PyQt4 and PySide applications. -`D-Tale `__ +``D-Tale ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ D-Tale is a lightweight web client for visualizing pandas data structures. It @@ -173,22 +173,22 @@ invoked with the following command import dtale; dtale.show(df) D-Tale integrates seamlessly with jupyter notebooks, python terminals, kaggle -& Google Colab. Here are some demos of the `grid `__ -and `chart-builder `__. +& Google Colab. Here are some demos of the ``grid ``__ +and ``chart-builder ``__. .. _ecosystem.ide: IDE ------ -`IPython `__ +``IPython ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ IPython is an interactive command shell and distributed computing environment. IPython tab completion works with Pandas methods and also attributes like DataFrame columns. -`Jupyter Notebook / Jupyter Lab `__ +``Jupyter Notebook / Jupyter Lab ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jupyter Notebook is a web application for creating Jupyter notebooks. A Jupyter notebook is a JSON document containing an ordered list @@ -205,17 +205,17 @@ which are utilized by Jupyter Notebook for displaying (Note: HTML tables may or may not be compatible with non-HTML Jupyter output formats.) -See :ref:`Options and Settings ` and -:ref:`Available Options ` +See :ref:``Options and Settings `` and +:ref:``Available Options `` for pandas ``display.`` settings. -`Quantopian/qgrid `__ +``Quantopian/qgrid ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ qgrid is "an interactive grid for sorting and filtering DataFrames in IPython Notebook" built with SlickGrid. -`Spyder `__ +``Spyder ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Spyder is a cross-platform PyQt-based IDE combining the editing, analysis, @@ -223,7 +223,7 @@ debugging and profiling functionality of a software development tool with the data exploration, interactive execution, deep inspection and rich visualization capabilities of a scientific environment like MATLAB or Rstudio. -Its `Variable Explorer `__ +Its ``Variable Explorer ``__ allows users to view, manipulate and edit pandas ``Index``, ``Series``, and ``DataFrame`` objects like a "spreadsheet", including copying and modifying values, sorting, displaying a "heatmap", converting data types and more. @@ -233,9 +233,9 @@ Spyder can also import data from a variety of plain text and binary files or the clipboard into a new pandas DataFrame via a sophisticated import wizard. Most pandas classes, methods and data attributes can be autocompleted in -Spyder's `Editor `__ and -`IPython Console `__, -and Spyder's `Help pane `__ can retrieve +Spyder's ``Editor ``__ and +``IPython Console ``__, +and Spyder's ``Help pane ``__ can retrieve and render Numpydoc documentation on pandas objects in rich text with Sphinx both automatically and on-demand. @@ -245,12 +245,12 @@ both automatically and on-demand. API --- -`pandas-datareader `__ +``pandas-datareader ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``pandas-datareader`` is a remote data access library for pandas (PyPI:``pandas-datareader``). It is based on functionality that was located in ``pandas.io.data`` and ``pandas.io.wb`` but was split off in v0.19. -See more in the `pandas-datareader docs `_: +See more in the ``pandas-datareader docs ``_: The following data feeds are available: @@ -271,39 +271,39 @@ The following data feeds are available: * Stooq Index Data * MOEX Data -`Quandl/Python `__ +``Quandl/Python ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Quandl API for Python wraps the Quandl REST API to return Pandas DataFrames with timeseries indexes. -`Pydatastream `__ +``Pydatastream ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PyDatastream is a Python interface to the -`Refinitiv Datastream (DWS) `__ +``Refinitiv Datastream (DWS) ``__ REST API to return indexed Pandas DataFrames with financial data. This package requires valid credentials for this API (non free). -`pandaSDMX `__ +``pandaSDMX ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pandaSDMX is a library to retrieve and acquire statistical data and metadata disseminated in -`SDMX `_ 2.1, an ISO-standard +``SDMX ``_ 2.1, an ISO-standard widely used by institutions such as statistics offices, central banks, and international organisations. pandaSDMX can expose datasets and related structural metadata including data flows, code-lists, and data structure definitions as pandas Series or MultiIndexed DataFrames. -`fredapi `__ +``fredapi ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -fredapi is a Python interface to the `Federal Reserve Economic Data (FRED) `__ +fredapi is a Python interface to the ``Federal Reserve Economic Data (FRED) ``__ provided by the Federal Reserve Bank of St. Louis. It works with both the FRED database and ALFRED database that contains point-in-time data (i.e. historic data revisions). fredapi provides a wrapper in Python to the FRED HTTP API, and also provides several convenient methods for parsing and analyzing point-in-time data from ALFRED. fredapi makes use of pandas and returns data in a Series or DataFrame. This module requires a FRED API key that you can obtain for free on the FRED website. -`dataframe_sql `__ +``dataframe_sql ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``dataframe_sql`` is a Python package that translates SQL syntax directly into operations on pandas DataFrames. This is useful when migrating from a database to @@ -316,14 +316,14 @@ with pandas. Domain specific --------------- -`Geopandas `__ +``Geopandas ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Geopandas extends pandas data objects to include geographic information which support geometric operations. If your work entails maps and geographical coordinates, and you love pandas, you should take a close look at Geopandas. -`xarray `__ +``xarray ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ xarray brings the labeled data power of pandas to the physical sciences by @@ -337,7 +337,7 @@ dimensional arrays, rather than the tabular data for which pandas excels. IO -- -`BCPandas `__ +``BCPandas ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BCPandas provides high performance writes from pandas to Microsoft SQL Server, @@ -351,30 +351,30 @@ Rigorously tested, it is a complete replacement for ``df.to_sql``. Out-of-core ------------- -`Blaze `__ +``Blaze ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Blaze provides a standard API for doing computations with various in-memory and on-disk backends: NumPy, Pandas, SQLAlchemy, MongoDB, PyTables, PySpark. -`Dask `__ +``Dask ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dask is a flexible parallel computing library for analytics. Dask provides a familiar ``DataFrame`` interface for out-of-core, parallel and distributed computing. -`Dask-ML `__ +``Dask-ML ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dask-ML enables parallel and distributed machine learning using Dask alongside existing machine learning libraries like Scikit-Learn, XGBoost, and TensorFlow. -`Koalas `__ +``Koalas ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Koalas provides a familiar pandas DataFrame interface on top of Apache Spark. It enables users to leverage multi-cores on one machine or a cluster of machines to speed up or scale their DataFrame code. -`Odo `__ +``Odo ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Odo provides a uniform API for moving data between different formats. It uses @@ -383,7 +383,7 @@ PyTables, h5py, and pymongo to move data between non pandas formats. Its graph based approach is also extensible by end users for custom formats that may be too specific for the core of odo. -`Pandarallel `__ +``Pandarallel ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pandarallel provides a simple way to parallelize your pandas operations on all your CPUs by changing only one line of code. @@ -398,7 +398,7 @@ If also displays progress bars. # df.apply(func) df.parallel_apply(func) -`Ray `__ +``Ray ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pandas on Ray is an early stage DataFrame library that wraps Pandas and transparently distributes the data and computation. The user does not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous Pandas notebooks while experiencing a considerable speedup from Pandas on Ray, even on a single machine. Only a modification of the import statement is needed, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Pandas on Ray just like you would Pandas. @@ -409,10 +409,10 @@ Pandas on Ray is an early stage DataFrame library that wraps Pandas and transpar import ray.dataframe as pd -`Vaex `__ +``Vaex ``__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (10\ :sup:`9`) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted). +Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (10\ :sup:``9``) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted). * vaex.from_pandas * vaex.to_pandas_df @@ -423,20 +423,20 @@ Extension data types -------------------- Pandas provides an interface for defining -:ref:`extension types ` to extend NumPy's type +:ref:``extension types `` to extend NumPy's type system. The following libraries implement that interface to provide types not found in NumPy or pandas, which work well with pandas' data containers. -`Cyberpandas`_ +``Cyberpandas``_ ~~~~~~~~~~~~~~ Cyberpandas provides an extension type for storing arrays of IP Addresses. These arrays can be stored inside pandas' Series and DataFrame. -`Pint-Pandas`_ +``Pint-Pandas``_ ~~~~~~~~~~~~~~ -`Pint-Pandas ` provides an extension type for +``Pint-Pandas `` provides an extension type for storing numeric arrays with units. These arrays can be stored inside pandas' Series and DataFrame. Operations between Series and DataFrame columns which use pint's extension array are then units aware. @@ -447,17 +447,17 @@ Accessors --------- A directory of projects providing -:ref:`extension accessors `. This is for users to +:ref:``extension accessors ``. This is for users to discover new accessors and for library authors to coordinate on the namespace. =============== ========== ========================= =============================================================== Library Accessor Classes Description =============== ========== ========================= =============================================================== -`cyberpandas`_ ``ip`` ``Series`` Provides common operations for working with IP addresses. -`pdvega`_ ``vgplot`` ``Series``, ``DataFrame`` Provides plotting functions from the Altair_ library. -`pandas_path`_ ``path`` ``Index``, ``Series`` Provides `pathlib.Path`_ functions for Series. -`pint-pandas`_ ``pint`` ``Series``, ``DataFrame`` Provides units support for numeric Series and DataFrames. -`composeml`_ ``slice`` ``DataFrame`` Provides a generator for enhanced data slicing. +``cyberpandas``_ ``ip`` ``Series`` Provides common operations for working with IP addresses. +``pdvega``_ ``vgplot`` ``Series``, ``DataFrame`` Provides plotting functions from the Altair_ library. +``pandas_path``_ ``path`` ``Index``, ``Series`` Provides ``pathlib.Path``_ functions for Series. +``pint-pandas``_ ``pint`` ``Series``, ``DataFrame`` Provides units support for numeric Series and DataFrames. +``composeml``_ ``slice`` ``DataFrame`` Provides a generator for enhanced data slicing. =============== ========== ========================= =============================================================== .. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest diff --git a/doc/source/getting_started/comparison/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst index aa7218c3e4fad..734be6575e53b 100644 --- a/doc/source/getting_started/comparison/comparison_with_sql.rst +++ b/doc/source/getting_started/comparison/comparison_with_sql.rst @@ -5,10 +5,10 @@ Comparison with SQL ******************** Since many potential pandas users have some familiarity with -`SQL `_, this page is meant to provide some examples of how +``SQL ``_, this page is meant to provide some examples of how various SQL operations would be performed using pandas. -If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>` +If you're new to pandas, you might want to first read through :ref:``10 Minutes to pandas<10min>`` to familiarize yourself with the library. As is customary, we import pandas and NumPy as follows: @@ -19,7 +19,7 @@ As is customary, we import pandas and NumPy as follows: import numpy as np Most of the examples will utilize the ``tips`` dataset found within pandas tests. We'll read -the data into a DataFrame called `tips` and assume we have a database table of the same name and +the data into a DataFrame called ``tips`` and assume we have a database table of the same name and structure. .. ipython:: python @@ -57,7 +57,7 @@ In SQL, you can add a calculated column: FROM tips LIMIT 5; -With pandas, you can use the :meth:`DataFrame.assign` method of a DataFrame to append a new column: +With pandas, you can use the :meth:``DataFrame.assign`` method of a DataFrame to append a new column: .. ipython:: python @@ -75,7 +75,7 @@ Filtering in SQL is done via a WHERE clause. LIMIT 5; DataFrames can be filtered in multiple ways; the most intuitive of which is using -:ref:`boolean indexing ` +:ref:``boolean indexing `` .. ipython:: python @@ -117,7 +117,7 @@ Just like SQL's OR and AND, multiple conditions can be passed to a DataFrame usi # tips by parties of at least 5 diners OR bill total was more than $45 tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)] -NULL checking is done using the :meth:`~pandas.Series.notna` and :meth:`~pandas.Series.isna` +NULL checking is done using the :meth:``~pandas.Series.notna`` and :meth:``~pandas.Series.isna`` methods. .. ipython:: python @@ -139,7 +139,7 @@ where ``col2`` IS NULL with the following query: frame[frame['col2'].isna()] -Getting items where ``col1`` IS NOT NULL can be done with :meth:`~pandas.Series.notna`. +Getting items where ``col1`` IS NOT NULL can be done with :meth:``~pandas.Series.notna``. .. code-block:: sql @@ -155,7 +155,7 @@ Getting items where ``col1`` IS NOT NULL can be done with :meth:`~pandas.Series. GROUP BY -------- In pandas, SQL's GROUP BY operations are performed using the similarly named -:meth:`~pandas.DataFrame.groupby` method. :meth:`~pandas.DataFrame.groupby` typically refers to a +:meth:``~pandas.DataFrame.groupby`` method. :meth:``~pandas.DataFrame.groupby`` typically refers to a process where we'd like to split a dataset into groups, apply some function (typically aggregation) , and then combine the groups together. @@ -179,16 +179,16 @@ The pandas equivalent would be: tips.groupby('sex').size() -Notice that in the pandas code we used :meth:`~pandas.core.groupby.DataFrameGroupBy.size` and not -:meth:`~pandas.core.groupby.DataFrameGroupBy.count`. This is because -:meth:`~pandas.core.groupby.DataFrameGroupBy.count` applies the function to each column, returning +Notice that in the pandas code we used :meth:``~pandas.core.groupby.DataFrameGroupBy.size`` and not +:meth:``~pandas.core.groupby.DataFrameGroupBy.count``. This is because +:meth:``~pandas.core.groupby.DataFrameGroupBy.count`` applies the function to each column, returning the number of ``not null`` records within each. .. ipython:: python tips.groupby('sex').count() -Alternatively, we could have applied the :meth:`~pandas.core.groupby.DataFrameGroupBy.count` method +Alternatively, we could have applied the :meth:``~pandas.core.groupby.DataFrameGroupBy.count`` method to an individual column: .. ipython:: python @@ -196,7 +196,7 @@ to an individual column: tips.groupby('sex')['total_bill'].count() Multiple functions can also be applied at once. For instance, say we'd like to see how tip amount -differs by day of the week - :meth:`~pandas.core.groupby.DataFrameGroupBy.agg` allows you to pass a dictionary +differs by day of the week - :meth:``~pandas.core.groupby.DataFrameGroupBy.agg`` allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to specific columns. .. code-block:: sql @@ -216,7 +216,7 @@ to your grouped DataFrame, indicating which functions to apply to specific colum tips.groupby('day').agg({'tip': np.mean, 'day': np.size}) Grouping by more than one column is done by passing a list of columns to the -:meth:`~pandas.DataFrame.groupby` method. +:meth:``~pandas.DataFrame.groupby`` method. .. code-block:: sql @@ -243,8 +243,8 @@ Grouping by more than one column is done by passing a list of columns to the JOIN ---- -JOINs can be performed with :meth:`~pandas.DataFrame.join` or :meth:`~pandas.merge`. By default, -:meth:`~pandas.DataFrame.join` will join the DataFrames on their indices. Each method has +JOINs can be performed with :meth:``~pandas.DataFrame.join`` or :meth:``~pandas.merge``. By default, +:meth:``~pandas.DataFrame.join`` will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices). @@ -273,7 +273,7 @@ INNER JOIN # merge performs an INNER JOIN by default pd.merge(df1, df2, on='key') -:meth:`~pandas.merge` also offers parameters for cases when you'd like to join one DataFrame's +:meth:``~pandas.merge`` also offers parameters for cases when you'd like to join one DataFrame's column with another DataFrame's index. .. ipython:: python @@ -332,7 +332,7 @@ joined columns find a match. As of writing, FULL JOINs are not supported in all UNION ----- -UNION ALL can be performed using :meth:`~pandas.concat`. +UNION ALL can be performed using :meth:``~pandas.concat``. .. ipython:: python @@ -381,8 +381,8 @@ SQL's UNION is similar to UNION ALL, however UNION will remove duplicate rows. Los Angeles 5 */ -In pandas, you can use :meth:`~pandas.concat` in conjunction with -:meth:`~pandas.DataFrame.drop_duplicates`. +In pandas, you can use :meth:``~pandas.concat`` in conjunction with +:meth:``~pandas.DataFrame.drop_duplicates``. .. ipython:: python @@ -429,7 +429,7 @@ Top n rows per group .query('rn < 3') .sort_values(['day', 'rn'])) -the same using `rank(method='first')` function +the same using ``rank(method='first')`` function .. ipython:: python @@ -453,7 +453,7 @@ the same using `rank(method='first')` function Let's find tips with (rank < 3) per gender group for (tips < 2). Notice that when using ``rank(method='min')`` function -`rnk_min` remains the same for the same `tip` +``rnk_min`` remains the same for the same ``tip`` (as Oracle's RANK() function) .. ipython:: python diff --git a/doc/source/getting_started/install.rst b/doc/source/getting_started/install.rst index 2196c908ecf37..790e8ab03d621 100644 --- a/doc/source/getting_started/install.rst +++ b/doc/source/getting_started/install.rst @@ -7,13 +7,13 @@ Installation ============ The easiest way to install pandas is to install it -as part of the `Anaconda `__ distribution, a +as part of the ``Anaconda ``__ distribution, a cross platform distribution for data analysis and scientific computing. This is the recommended installation method for most users. Instructions for installing from source, -`PyPI `__, `ActivePython `__, various Linux distributions, or a -`development version `__ are also provided. +``PyPI ``__, ``ActivePython ``__, various Linux distributions, or a +``development version ``__ are also provided. Python version support ---------------------- @@ -28,28 +28,28 @@ Installing pandas Installing with Anaconda ~~~~~~~~~~~~~~~~~~~~~~~~ -Installing pandas and the rest of the `NumPy `__ and -`SciPy `__ stack can be a little +Installing pandas and the rest of the ``NumPy ``__ and +``SciPy ``__ stack can be a little difficult for inexperienced users. The simplest way to install not only pandas, but Python and the most popular -packages that make up the `SciPy `__ stack -(`IPython `__, `NumPy `__, -`Matplotlib `__, ...) is with -`Anaconda `__, a cross-platform +packages that make up the ``SciPy ``__ stack +(``IPython ``__, ``NumPy ``__, +``Matplotlib ``__, ...) is with +``Anaconda ``__, a cross-platform (Linux, Mac OS X, Windows) Python distribution for data analytics and scientific computing. After running the installer, the user will have access to pandas and the -rest of the `SciPy `__ stack without needing to install +rest of the ``SciPy ``__ stack without needing to install anything else, and without needing to wait for any software to be compiled. -Installation instructions for `Anaconda `__ -`can be found here `__. +Installation instructions for ``Anaconda ``__ +``can be found here ``__. A full list of the packages available as part of the -`Anaconda `__ distribution -`can be found here `__. +``Anaconda ``__ distribution +``can be found here ``__. Another advantage to installing Anaconda is that you don't need admin rights to install it. Anaconda can install in the user's home directory, @@ -62,28 +62,28 @@ Installing with Miniconda ~~~~~~~~~~~~~~~~~~~~~~~~~ The previous section outlined how to get pandas installed as part of the -`Anaconda `__ distribution. +``Anaconda ``__ distribution. However this approach means you will install well over one hundred packages and involves downloading the installer which is a few hundred megabytes in size. If you want to have more control on which packages, or have a limited internet bandwidth, then installing pandas with -`Miniconda `__ may be a better solution. +``Miniconda ``__ may be a better solution. -`Conda `__ is the package manager that the -`Anaconda `__ distribution is built upon. +``Conda ``__ is the package manager that the +``Anaconda ``__ distribution is built upon. It is a package manager that is both cross-platform and language agnostic (it can play a similar role to a pip and virtualenv combination). -`Miniconda `__ allows you to create a +``Miniconda ``__ allows you to create a minimal self contained Python installation, and then use the -`Conda `__ command to install additional packages. +``Conda ``__ command to install additional packages. -First you will need `Conda `__ to be installed and -downloading and running the `Miniconda -`__ +First you will need ``Conda ``__ to be installed and +downloading and running the ``Miniconda +``__ will do this for you. The installer -`can be found here `__ +``can be found here ``__ The next step is to create a new conda environment. A conda environment is like a virtualenv that allows you to specify a specific version of Python and set of libraries. @@ -113,7 +113,7 @@ To install other packages, IPython for example:: conda install ipython -To install the full `Anaconda `__ +To install the full ``Anaconda ``__ distribution:: conda install anaconda @@ -128,7 +128,7 @@ Installing from PyPI ~~~~~~~~~~~~~~~~~~~~ pandas can be installed via pip from -`PyPI `__. +``PyPI ``__. :: @@ -138,8 +138,8 @@ Installing with ActivePython ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Installation instructions for -`ActivePython `__ can be found -`here `__. Versions +``ActivePython ``__ can be found +``here ``__. Versions 2.7, 3.5 and 3.6 include pandas. Installing using your Linux distribution's package manager. @@ -152,12 +152,12 @@ The commands in this table will install pandas for Python 3 from your distributi :widths: 10, 10, 20, 50 - Debian, stable, `official Debian repository `__ , ``sudo apt-get install python3-pandas`` - Debian & Ubuntu, unstable (latest packages), `NeuroDebian `__ , ``sudo apt-get install python3-pandas`` - Ubuntu, stable, `official Ubuntu repository `__ , ``sudo apt-get install python3-pandas`` - OpenSuse, stable, `OpenSuse Repository `__ , ``zypper in python3-pandas`` - Fedora, stable, `official Fedora repository `__ , ``dnf install python3-pandas`` - Centos/RHEL, stable, `EPEL repository `__ , ``yum install python3-pandas`` + Debian, stable, ``official Debian repository ``__ , ``sudo apt-get install python3-pandas`` + Debian & Ubuntu, unstable (latest packages), ``NeuroDebian ``__ , ``sudo apt-get install python3-pandas`` + Ubuntu, stable, ``official Ubuntu repository ``__ , ``sudo apt-get install python3-pandas`` + OpenSuse, stable, ``OpenSuse Repository ``__ , ``zypper in python3-pandas`` + Fedora, stable, ``official Fedora repository ``__ , ``dnf install python3-pandas`` + Centos/RHEL, stable, ``EPEL repository ``__ , ``yum install python3-pandas`` **However**, the packages in the linux package managers are often a few versions behind, so to get the newest version of pandas, it's recommended to install using the ``pip`` or ``conda`` @@ -179,12 +179,12 @@ In Linux/Mac you can run ``which python`` on your terminal and it will tell you using. If it's something like "/usr/bin/python", you're using the Python from the system, which is not recommended. It is highly recommended to use ``conda``, for quick installation and for package and dependency updates. -You can find simple installation instructions for pandas in this document: `installation instructions `. +You can find simple installation instructions for pandas in this document: ``installation instructions ``. Installing from source ~~~~~~~~~~~~~~~~~~~~~~ -See the :ref:`contributing guide ` for complete instructions on building from the git source tree. Further, see :ref:`creating a development environment ` if you wish to create a *pandas* development environment. +See the :ref:``contributing guide `` for complete instructions on building from the git source tree. Further, see :ref:``creating a development environment `` if you wish to create a *pandas* development environment. Running the test suite ---------------------- @@ -192,9 +192,9 @@ Running the test suite pandas is equipped with an exhaustive set of unit tests, covering about 97% of the code base as of this writing. To run it on your machine to verify that everything is working (and that you have all of the dependencies, soft and hard, -installed), make sure you have `pytest -`__ >= 5.0.1 and `Hypothesis -`__ >= 3.58, then run: +installed), make sure you have ``pytest +``__ >= 5.0.1 and ``Hypothesis +``__ >= 3.58, then run: :: @@ -219,10 +219,10 @@ Dependencies ================================================================ ========================== Package Minimum supported version ================================================================ ========================== -`setuptools `__ 24.2.0 -`NumPy `__ 1.16.5 -`python-dateutil `__ 2.7.3 -`pytz `__ 2017.3 +``setuptools ``__ 24.2.0 +``NumPy ``__ 1.16.5 +``python-dateutil ``__ 2.7.3 +``pytz ``__ 2017.3 ================================================================ ========================== .. _install.recommended_dependencies: @@ -230,11 +230,11 @@ Package Minimum support Recommended dependencies ~~~~~~~~~~~~~~~~~~~~~~~~ -* `numexpr `__: for accelerating certain numerical operations. +* ``numexpr ``__: for accelerating certain numerical operations. ``numexpr`` uses multiple cores as well as smart chunking and caching to achieve large speedups. If installed, must be Version 2.6.8 or higher. -* `bottleneck `__: for accelerating certain types of ``nan`` +* ``bottleneck ``__: for accelerating certain types of ``nan`` evaluations. ``bottleneck`` uses specialized cython routines to achieve large speedups. If installed, must be Version 1.2.1 or higher. @@ -250,15 +250,15 @@ Optional dependencies ~~~~~~~~~~~~~~~~~~~~~ Pandas has many optional dependencies that are only used for specific methods. -For example, :func:`pandas.read_hdf` requires the ``pytables`` package, while -:meth:`DataFrame.to_markdown` requires the ``tabulate`` package. If the +For example, :func:``pandas.read_hdf`` requires the ``pytables`` package, while +:meth:``DataFrame.to_markdown`` requires the ``tabulate`` package. If the optional dependency is not installed, pandas will raise an ``ImportError`` when the method requiring that dependency is called. ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -BeautifulSoup4 4.6.0 HTML parser for read_html (see :ref:`note `) +BeautifulSoup4 4.6.0 HTML parser for read_html (see :ref:``note ``) Jinja2 2.10 Conditional formatting with DataFrame.style PyQt4 Clipboard I/O PyQt5 Clipboard I/O @@ -270,8 +270,8 @@ blosc 1.14.3 Compression for HDF5 fsspec 0.7.4 Handling files aside from local and HTTP fastparquet 0.3.2 Parquet reading / writing gcsfs 0.6.0 Google Cloud Storage access -html5lib 1.0.1 HTML parser for read_html (see :ref:`note `) -lxml 4.3.0 HTML parser for read_html (see :ref:`note `) +html5lib 1.0.1 HTML parser for read_html (see :ref:``note ``) +lxml 4.3.0 HTML parser for read_html (see :ref:``note ``) matplotlib 2.2.3 Visualization numba 0.46.0 Alternative execution engine for rolling operations openpyxl 2.6.0 Reading / writing for xlsx files @@ -284,7 +284,7 @@ pytables 3.4.4 HDF5 reading / writing pyxlsb 1.0.6 Reading for xlsb files qtpy Clipboard I/O s3fs 0.4.0 Amazon S3 access -tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_) +tabulate 0.8.3 Printing in Markdown-friendly format (see ``tabulate``_) xarray 0.12.0 pandas-like API for N-dimensional data xclip Clipboard I/O on linux xlrd 1.2.0 Excel reading @@ -299,21 +299,21 @@ Optional dependencies for parsing HTML ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ One of the following combinations of libraries is needed to use the -top-level :func:`~pandas.read_html` function: +top-level :func:``~pandas.read_html`` function: -* `BeautifulSoup4`_ and `html5lib`_ -* `BeautifulSoup4`_ and `lxml`_ -* `BeautifulSoup4`_ and `html5lib`_ and `lxml`_ -* Only `lxml`_, although see :ref:`HTML Table Parsing ` +* ``BeautifulSoup4``_ and ``html5lib``_ +* ``BeautifulSoup4``_ and ``lxml``_ +* ``BeautifulSoup4``_ and ``html5lib``_ and ``lxml``_ +* Only ``lxml``_, although see :ref:``HTML Table Parsing `` for reasons as to why you should probably **not** take this approach. .. warning:: - * if you install `BeautifulSoup4`_ you must install either - `lxml`_ or `html5lib`_ or both. - :func:`~pandas.read_html` will **not** work with *only* - `BeautifulSoup4`_ installed. - * You are highly encouraged to read :ref:`HTML Table Parsing gotchas `. + * if you install ``BeautifulSoup4``_ you must install either + ``lxml``_ or ``html5lib``_ or both. + :func:``~pandas.read_html`` will **not** work with *only* + ``BeautifulSoup4``_ installed. + * You are highly encouraged to read :ref:``HTML Table Parsing gotchas ``. It explains issues surrounding the installation and usage of the above three libraries. diff --git a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst index c7363b94146ac..94a0ac555f145 100644 --- a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst +++ b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst @@ -102,7 +102,7 @@ What is the median age and ticket fare price of the Titanic passengers? titanic[["Age", "Fare"]].median() The statistic applied to multiple columns of a ``DataFrame`` (the selection of two columns -return a ``DataFrame``, see the :ref:`subset data tutorial <10min_tut_03_subset>`) is calculated for each numeric column. +return a ``DataFrame``, see the :ref:``subset data tutorial <10min_tut_03_subset>``) is calculated for each numeric column. .. raw:: html @@ -110,7 +110,7 @@ return a ``DataFrame``, see the :ref:`subset data tutorial <10min_tut_03_subset> The aggregating statistic can be calculated for multiple columns at the -same time. Remember the ``describe`` function from :ref:`first tutorial <10min_tut_01_tableoriented>` tutorial? +same time. Remember the ``describe`` function from :ref:``first tutorial <10min_tut_01_tableoriented>`` tutorial? .. ipython:: python @@ -118,7 +118,7 @@ same time. Remember the ``describe`` function from :ref:`first tutorial <10min_t Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the -:func:`DataFrame.agg` method: +:func:``DataFrame.agg`` method: .. ipython:: python @@ -130,7 +130,7 @@ aggregating statistics for given columns can be defined using the
To user guide -Details about descriptive statistics are provided in the user guide section on :ref:`descriptive statistics `. +Details about descriptive statistics are provided in the user guide section on :ref:``descriptive statistics ``. .. raw:: html @@ -156,7 +156,7 @@ What is the average age for male versus female Titanic passengers? As our interest is the average age for each gender, a subselection on these two columns is made first: ``titanic[["Sex", "Age"]]``. Next, the -:meth:`~DataFrame.groupby` method is applied on the ``Sex`` column to make a group per +:meth:``~DataFrame.groupby`` method is applied on the ``Sex`` column to make a group per category. The average age *for each gender* is calculated and returned. @@ -197,12 +197,12 @@ on the grouped data as well: :align: center .. note:: - The `Pclass` column contains numerical data but actually + The ``Pclass`` column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a ``Categorical`` data type to handle this type of data. More information is provided in the user guide - :ref:`categorical` section. + :ref:``categorical`` section. .. raw:: html @@ -216,7 +216,7 @@ What is the mean ticket fare price for each of the sex and cabin class combinati titanic.groupby(["Sex", "Pclass"])["Fare"].mean() Grouping can be done by multiple columns at the same time. Provide the -column names as a list to the :meth:`~DataFrame.groupby` method. +column names as a list to the :meth:``~DataFrame.groupby`` method. .. raw:: html @@ -228,7 +228,7 @@ column names as a list to the :meth:`~DataFrame.groupby` method.
To user guide -A full description on the split-apply-combine approach is provided in the user guide section on :ref:`groupby operations `. +A full description on the split-apply-combine approach is provided in the user guide section on :ref:``groupby operations ``. .. raw:: html @@ -251,7 +251,7 @@ What is the number of passengers in each of the cabin classes? titanic["Pclass"].value_counts() -The :meth:`~Series.value_counts` method counts the number of records for each +The :meth:``~Series.value_counts`` method counts the number of records for each category in a column. .. raw:: html @@ -278,7 +278,7 @@ within each group:
To user guide -The user guide has a dedicated section on ``value_counts`` , see page on :ref:`discretization `. +The user guide has a dedicated section on ``value_counts`` , see page on :ref:``discretization ``. .. raw:: html @@ -303,7 +303,7 @@ The user guide has a dedicated section on ``value_counts`` , see page on :ref:`d
To user guide -A full description on the split-apply-combine approach is provided in the user guide pages about :ref:`groupby operations `. +A full description on the split-apply-combine approach is provided in the user guide pages about :ref:``groupby operations ``. .. raw:: html diff --git a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst index 600a75b156ac4..2ef6791628ed4 100644 --- a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst +++ b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst @@ -23,11 +23,11 @@

-For this tutorial, air quality data about :math:`NO_2` is used, made available by -`openaq `__ and downloaded using the -`py-openaq `__ package. +For this tutorial, air quality data about :math:``NO_2`` is used, made available by +``openaq ``__ and downloaded using the +``py-openaq ``__ package. -The ``air_quality_no2_long.csv`` data set provides :math:`NO_2` +The ``air_quality_no2_long.csv`` data set provides :math:``NO_2`` values for the measurement stations *FR04014*, *BETR801* and *London Westminster* in respectively Paris, Antwerp and London. @@ -59,10 +59,10 @@ Westminster* in respectively Paris, Antwerp and London. For this tutorial, air quality data about Particulate matter less than 2.5 micrometers is used, made available by -`openaq `__ and downloaded using the -`py-openaq `__ package. +``openaq ``__ and downloaded using the +``py-openaq ``__ package. -The ``air_quality_pm25_long.csv`` data set provides :math:`PM_{25}` +The ``air_quality_pm25_long.csv`` data set provides :math:``PM_{25}`` values for the measurement stations *FR04014*, *BETR801* and *London Westminster* in respectively Paris, Antwerp and London. @@ -102,14 +102,14 @@ Concatenating objects

  • -I want to combine the measurements of :math:`NO_2` and :math:`PM_{25}`, two tables with a similar structure, in a single table +I want to combine the measurements of :math:``NO_2`` and :math:``PM_{25}``, two tables with a similar structure, in a single table .. ipython:: python air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0) air_quality.head() -The :func:`~pandas.concat` function performs concatenation operations of multiple +The :func:``~pandas.concat`` function performs concatenation operations of multiple tables along one of the axis (row-wise or column-wise). .. raw:: html @@ -123,9 +123,9 @@ concatenated tables to verify the operation: .. ipython:: python - print('Shape of the `air_quality_pm25` table: ', air_quality_pm25.shape) - print('Shape of the `air_quality_no2` table: ', air_quality_no2.shape) - print('Shape of the resulting `air_quality` table: ', air_quality.shape) + print('Shape of the ``air_quality_pm25`` table: ', air_quality_pm25.shape) + print('Shape of the ``air_quality_no2`` table: ', air_quality_no2.shape) + print('Shape of the resulting ``air_quality`` table: ', air_quality.shape) Hence, the resulting table has 3178 = 1110 + 2068 rows. @@ -178,7 +178,7 @@ index. For example:
    To user guide - Feel free to dive into the world of multi-indexing at the user guide section on :ref:`advanced indexing `. + Feel free to dive into the world of multi-indexing at the user guide section on :ref:``advanced indexing ``. .. raw:: html @@ -192,7 +192,7 @@ index. For example: More options on table concatenation (row and column wise) and how ``concat`` can be used to define the logic (union or intersection) of the indexes on the other axes is provided at the section on -:ref:`object concatenation `. +:ref:``object concatenation ``. .. raw:: html @@ -214,7 +214,7 @@ Add the station coordinates, provided by the stations metadata table, to the cor .. warning:: The air quality measurement station coordinates are stored in a data file ``air_quality_stations.csv``, downloaded using the - `py-openaq `__ package. + ``py-openaq ``__ package. .. ipython:: python @@ -237,7 +237,7 @@ Add the station coordinates, provided by the stations metadata table, to the cor how='left', on='location') air_quality.head() -Using the :meth:`~pandas.merge` function, for each of the rows in the +Using the :meth:``~pandas.merge`` function, for each of the rows in the ``air_quality`` table, the corresponding coordinates are added from the ``air_quality_stations_coord`` table. Both tables have the column ``location`` in common which is used as a key to combine the @@ -261,7 +261,7 @@ Add the parameter full description and name, provided by the parameters metadata .. warning:: The air quality parameters metadata are stored in a data file ``air_quality_parameters.csv``, downloaded using the - `py-openaq `__ package. + ``py-openaq ``__ package. .. ipython:: python @@ -293,8 +293,8 @@ between the two tables. pandas supports also inner, outer, and right joins. More information on join/merge of tables is provided in the user guide section on -:ref:`database style merging of tables `. Or have a look at the -:ref:`comparison with SQL` page. +:ref:``database style merging of tables ``. Or have a look at the +:ref:``comparison with SQL`` page. .. raw:: html @@ -319,7 +319,7 @@ More information on join/merge of tables is provided in the user guide section o
    To user guide -See the user guide for a full description of the various :ref:`facilities to combine data tables `. +See the user guide for a full description of the various :ref:``facilities to combine data tables ``. .. raw:: html diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst index b7475ae7bb132..97f8957dd9ab8 100644 --- a/doc/source/user_guide/categorical.rst +++ b/doc/source/user_guide/categorical.rst @@ -9,9 +9,9 @@ Categorical data This is an introduction to pandas categorical data type, including a short comparison with R's ``factor``. -`Categoricals` are a pandas data type corresponding to categorical variables in +``Categoricals`` are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, -number of possible values (`categories`; `levels` in R). Examples are gender, +number of possible values (``categories``; ``levels`` in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales. @@ -19,22 +19,22 @@ In contrast to statistical categorical variables, categorical data might have an 'strongly agree' vs 'agree' or 'first observation' vs. 'second observation'), but numerical operations (additions, divisions, ...) are not possible. -All values of categorical data are either in `categories` or `np.nan`. Order is defined by -the order of `categories`, not lexical order of the values. Internally, the data structure -consists of a `categories` array and an integer array of `codes` which point to the real value in -the `categories` array. +All values of categorical data are either in ``categories`` or ``np.nan``. Order is defined by +the order of ``categories``, not lexical order of the values. Internally, the data structure +consists of a ``categories`` array and an integer array of ``codes`` which point to the real value in +the ``categories`` array. The categorical data type is useful in the following cases: * A string variable consisting of only a few different values. Converting such a string - variable to a categorical variable will save some memory, see :ref:`here `. + variable to a categorical variable will save some memory, see :ref:``here ``. * The lexical order of a variable is not the same as the logical order ("one", "two", "three"). By converting to a categorical and specifying an order on the categories, sorting and - min/max will use the logical order instead of the lexical order, see :ref:`here `. + min/max will use the logical order instead of the lexical order, see :ref:``here ``. * As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types). -See also the :ref:`API docs on categoricals`. +See also the :ref:``API docs on categoricals``. .. _categorical.objectcreation: @@ -61,8 +61,8 @@ By converting an existing ``Series`` or column to a ``category`` dtype: df["B"] = df["A"].astype('category') df -By using special functions, such as :func:`~pandas.cut`, which groups data into -discrete bins. See the :ref:`example on tiling ` in the docs. +By using special functions, such as :func:``~pandas.cut``, which groups data into +discrete bins. See the :ref:``example on tiling `` in the docs. .. ipython:: python @@ -72,7 +72,7 @@ discrete bins. See the :ref:`example on tiling ` in the docs df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels) df.head(10) -By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it to a ``DataFrame``. +By passing a :class:``pandas.Categorical`` object to a ``Series`` or assigning it to a ``DataFrame``. .. ipython:: python @@ -84,7 +84,7 @@ By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it df["B"] = raw_cat df -Categorical data has a specific ``category`` :ref:`dtype `: +Categorical data has a specific ``category`` :ref:``dtype ``: .. ipython:: python @@ -112,7 +112,7 @@ only labels present in a given column are categories: df['B'] -Analogously, all columns in an existing ``DataFrame`` can be batch converted using :meth:`DataFrame.astype`: +Analogously, all columns in an existing ``DataFrame`` can be batch converted using :meth:``DataFrame.astype``: .. ipython:: python @@ -138,7 +138,7 @@ behavior: 2. Categories are unordered. To control those behaviors, instead of passing ``'category'``, use an instance -of :class:`~pandas.api.types.CategoricalDtype`. +of :class:``~pandas.api.types.CategoricalDtype``. .. ipython:: python @@ -169,7 +169,7 @@ are consistent among all columns. ``categories = pd.unique(df.to_numpy().ravel())``. If you already have ``codes`` and ``categories``, you can use the -:func:`~pandas.Categorical.from_codes` constructor to save the factorize step +:func:``~pandas.Categorical.from_codes`` constructor to save the factorize step during normal constructor mode: .. ipython:: python @@ -196,13 +196,13 @@ To get back to the original ``Series`` or NumPy array, use .. note:: - In contrast to R's `factor` function, categorical data is not converting input values to + In contrast to R's ``factor`` function, categorical data is not converting input values to strings; categories will end up the same data type as the original values. .. note:: - In contrast to R's `factor` function, there is currently no way to assign/change labels at - creation time. Use `categories` to change the categories after creation time. + In contrast to R's ``factor`` function, there is currently no way to assign/change labels at + creation time. Use ``categories`` to change the categories after creation time. .. _categorical.categoricaldtype: @@ -214,10 +214,10 @@ A categorical's type is fully described by 1. ``categories``: a sequence of unique values and no missing values 2. ``ordered``: a boolean -This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`. +This information can be stored in a :class:``~pandas.api.types.CategoricalDtype``. The ``categories`` argument is optional, which implies that the actual categories should be inferred from whatever is present in the data when the -:class:`pandas.Categorical` is created. The categories are assumed to be unordered +:class:``pandas.Categorical`` is created. The categories are assumed to be unordered by default. .. ipython:: python @@ -227,14 +227,14 @@ by default. CategoricalDtype(['a', 'b', 'c'], ordered=True) CategoricalDtype() -A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas -expects a `dtype`. For example :func:`pandas.read_csv`, -:func:`pandas.DataFrame.astype`, or in the ``Series`` constructor. +A :class:``~pandas.api.types.CategoricalDtype`` can be used in any place pandas +expects a ``dtype``. For example :func:``pandas.read_csv``, +:func:``pandas.DataFrame.astype``, or in the ``Series`` constructor. .. note:: As a convenience, you can use the string ``'category'`` in place of a - :class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of + :class:``~pandas.api.types.CategoricalDtype`` when you want the default behavior of the categories being unordered, and equal to the set values present in the array. In other words, ``dtype='category'`` is equivalent to ``dtype=CategoricalDtype()``. @@ -242,7 +242,7 @@ expects a `dtype`. For example :func:`pandas.read_csv`, Equality semantics ~~~~~~~~~~~~~~~~~~ -Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal +Two instances of :class:``~pandas.api.types.CategoricalDtype`` compare equal whenever they have the same categories and order. When comparing two unordered categoricals, the order of the ``categories`` is not considered. @@ -273,7 +273,7 @@ All instances of ``CategoricalDtype`` compare equal to the string ``'category'`` Description ----------- -Using :meth:`~DataFrame.describe` on categorical data will produce similar +Using :meth:``~DataFrame.describe`` on categorical data will produce similar output to a ``Series`` or ``DataFrame`` of type ``string``. .. ipython:: python @@ -288,7 +288,7 @@ output to a ``Series`` or ``DataFrame`` of type ``string``. Working with categories ----------------------- -Categorical data has a `categories` and a `ordered` property, which list their +Categorical data has a ``categories`` and a ``ordered`` property, which list their possible values and whether the ordering matters or not. These properties are exposed as ``s.cat.categories`` and ``s.cat.ordered``. If you don't manually specify categories and ordering, they are inferred from the passed arguments. @@ -316,7 +316,7 @@ It's also possible to pass in the categories in a specific order: .. note:: - The result of :meth:`~Series.unique` is not always the same as ``Series.cat.categories``, + The result of :meth:``~Series.unique`` is not always the same as ``Series.cat.categories``, because ``Series.unique()`` has a couple of guarantees, namely that it returns categories in the order of appearance, and it only includes values that are actually present. @@ -336,7 +336,7 @@ Renaming categories Renaming categories is done by assigning new values to the ``Series.cat.categories`` property or by using the -:meth:`~pandas.Categorical.rename_categories` method: +:meth:``~pandas.Categorical.rename_categories`` method: .. ipython:: python @@ -353,14 +353,14 @@ Renaming categories is done by assigning new values to the .. note:: - In contrast to R's `factor`, categorical data can have categories of other types than string. + In contrast to R's ``factor``, categorical data can have categories of other types than string. .. note:: Be aware that assigning new categories is an inplace operation, while most other operations - under ``Series.cat`` per default return a new ``Series`` of dtype `category`. + under ``Series.cat`` per default return a new ``Series`` of dtype ``category``. -Categories must be unique or a `ValueError` is raised: +Categories must be unique or a ``ValueError`` is raised: .. ipython:: python @@ -369,7 +369,7 @@ Categories must be unique or a `ValueError` is raised: except ValueError as e: print("ValueError:", str(e)) -Categories must also not be ``NaN`` or a `ValueError` is raised: +Categories must also not be ``NaN`` or a ``ValueError`` is raised: .. ipython:: python @@ -382,7 +382,7 @@ Appending new categories ~~~~~~~~~~~~~~~~~~~~~~~~ Appending categories can be done by using the -:meth:`~pandas.Categorical.add_categories` method: +:meth:``~pandas.Categorical.add_categories`` method: .. ipython:: python @@ -394,7 +394,7 @@ Removing categories ~~~~~~~~~~~~~~~~~~~ Removing categories can be done by using the -:meth:`~pandas.Categorical.remove_categories` method. Values which are removed +:meth:``~pandas.Categorical.remove_categories`` method. Values which are removed are replaced by ``np.nan``.: .. ipython:: python @@ -419,7 +419,7 @@ Setting categories If you want to do remove and add new categories in one step (which has some speed advantage), or simply set the categories to a predefined scale, -use :meth:`~pandas.Categorical.set_categories`. +use :meth:``~pandas.Categorical.set_categories``. .. ipython:: python @@ -430,7 +430,7 @@ use :meth:`~pandas.Categorical.set_categories`. s .. note:: - Be aware that :func:`Categorical.set_categories` cannot know whether some category is omitted + Be aware that :func:``Categorical.set_categories`` cannot know whether some category is omitted intentionally or because it is misspelled or (under Python3) due to a type difference (e.g., NumPy S1 dtype and Python strings). This can result in surprising behaviour! @@ -477,8 +477,8 @@ This is even true for strings and numeric data: Reordering ~~~~~~~~~~ -Reordering the categories is possible via the :meth:`Categorical.reorder_categories` and -the :meth:`Categorical.set_categories` methods. For :meth:`Categorical.reorder_categories`, all +Reordering the categories is possible via the :meth:``Categorical.reorder_categories`` and +the :meth:``Categorical.set_categories`` methods. For :meth:``Categorical.reorder_categories``, all old categories must be included in the new categories and no new categories are allowed. This will necessarily make the sort order the same as the categories order. @@ -501,9 +501,9 @@ necessarily make the sort order the same as the categories order. .. note:: - If the ``Categorical`` is not ordered, :meth:`Series.min` and :meth:`Series.max` will raise + If the ``Categorical`` is not ordered, :meth:``Series.min`` and :meth:``Series.max`` will raise ``TypeError``. Numeric operations like ``+``, ``-``, ``*``, ``/`` and operations based on them - (e.g. :meth:`Series.median`, which would need to compute the mean between two values if the length + (e.g. :meth:``Series.median``, which would need to compute the mean between two values if the length of an array is even) do not work and raise a ``TypeError``. Multi column sorting @@ -535,7 +535,7 @@ Comparing categorical data with other objects is possible in three cases: * Comparing equality (``==`` and ``!=``) to a list-like object (list, Series, array, ...) of the same length as the categorical data. * All comparisons (``==``, ``!=``, ``>``, ``>=``, ``<``, and ``<=``) of categorical data to - another categorical Series, when ``ordered==True`` and the `categories` are the same. + another categorical Series, when ``ordered==True`` and the ``categories`` are the same. * All comparisons of a categorical data to a scalar. All other comparisons, especially "non-equality" comparisons of two categoricals with different @@ -614,10 +614,10 @@ When you compare two unordered categoricals with the same categories, the order Operations ---------- -Apart from :meth:`Series.min`, :meth:`Series.max` and :meth:`Series.mode`, the +Apart from :meth:``Series.min``, :meth:``Series.max`` and :meth:``Series.mode``, the following operations are possible with categorical data: -``Series`` methods like :meth:`Series.value_counts` will use all categories, +``Series`` methods like :meth:``Series.value_counts`` will use all categories, even if some categories are not present in the data: .. ipython:: python @@ -657,7 +657,7 @@ Data munging The optimized pandas data access methods ``.loc``, ``.iloc``, ``.at``, and ``.iat``, work as normal. The only difference is the return type (for getting) and -that only values already in `categories` can be assigned. +that only values already in ``categories`` can be assigned. Getting ~~~~~~~ @@ -695,8 +695,8 @@ of length "1". df.at["h", "cats"] # returns a string .. note:: - The is in contrast to R's `factor` function, where ``factor(c(1,2,3))[1]`` - returns a single value `factor`. + The is in contrast to R's ``factor`` function, where ``factor(c(1,2,3))[1]`` + returns a single value ``factor``. To get a single value ``Series`` of type ``category``, you pass in a list with a single value: @@ -732,7 +732,7 @@ an appropriate type: That means, that the returned values from methods and properties on the accessors of a ``Series`` and the returned values from methods and properties on the accessors of this -``Series`` transformed to one of type `category` will be equal: +``Series`` transformed to one of type ``category`` will be equal: .. ipython:: python @@ -753,7 +753,7 @@ Setting ~~~~~~~ Setting values in a categorical column (or ``Series``) works as long as the -value is included in the `categories`: +value is included in the ``categories``: .. ipython:: python @@ -770,7 +770,7 @@ value is included in the `categories`: except ValueError as e: print("ValueError:", str(e)) -Setting values by assigning categorical data will also check that the `categories` match: +Setting values by assigning categorical data will also check that the ``categories`` match: .. ipython:: python @@ -837,7 +837,7 @@ The following table summarizes the results of merging ``Categoricals``: | category (int) | category (float) | False | float (dtype is inferred) | +-------------------+------------------------+----------------------+-----------------------------+ -See also the section on :ref:`merge dtypes` for notes about +See also the section on :ref:``merge dtypes`` for notes about preserving merge dtypes and performance. .. _categorical.union: @@ -846,7 +846,7 @@ Unioning ~~~~~~~~ If you want to combine categoricals that do not necessarily have the same -categories, the :func:`~pandas.api.types.union_categoricals` function will +categories, the :func:``~pandas.api.types.union_categoricals`` function will combine a list-like of categoricals. The new categories will be the union of the categories being combined. @@ -894,7 +894,7 @@ using the ``ignore_ordered=True`` argument. b = pd.Categorical(["c", "b", "a"], ordered=True) union_categoricals([a, b], ignore_order=True) -:func:`~pandas.api.types.union_categoricals` also works with a +:func:``~pandas.api.types.union_categoricals`` also works with a ``CategoricalIndex``, or ``Series`` containing categorical data, but note that the resulting array will always be a plain ``Categorical``: @@ -934,14 +934,14 @@ Getting data in/out ------------------- You can write data that contains ``category`` dtypes to a ``HDFStore``. -See :ref:`here ` for an example and caveats. +See :ref:``here `` for an example and caveats. It is also possible to write data to and reading data from *Stata* format files. -See :ref:`here ` for an example and caveats. +See :ref:``here `` for an example and caveats. Writing to a CSV file will convert the data, effectively removing any information about the categorical (categories and ordering). So if you read back the CSV file you have to convert the -relevant columns back to `category` and assign the right categories and categories ordering. +relevant columns back to ``category`` and assign the right categories and categories ordering. .. ipython:: python @@ -970,9 +970,9 @@ The same holds for writing to a SQL database with ``to_sql``. Missing data ------------ -pandas primarily uses the value `np.nan` to represent missing data. It is by -default not included in computations. See the :ref:`Missing Data section -`. +pandas primarily uses the value ``np.nan`` to represent missing data. It is by +default not included in computations. See the :ref:``Missing Data section +``. Missing values should **not** be included in the Categorical's ``categories``, only in the ``values``. @@ -988,8 +988,8 @@ a code of ``-1``. s.cat.codes -Methods for working with missing data, e.g. :meth:`~Series.isna`, :meth:`~Series.fillna`, -:meth:`~Series.dropna`, all work normally: +Methods for working with missing data, e.g. :meth:``~Series.isna``, :meth:``~Series.fillna``, +:meth:``~Series.dropna``, all work normally: .. ipython:: python @@ -998,20 +998,20 @@ Methods for working with missing data, e.g. :meth:`~Series.isna`, :meth:`~Series pd.isna(s) s.fillna("a") -Differences to R's `factor` +Differences to R's ``factor`` --------------------------- The following differences to R's factor functions can be observed: -* R's `levels` are named `categories`. -* R's `levels` are always of type string, while `categories` in pandas can be of any dtype. +* R's ``levels`` are named ``categories``. +* R's ``levels`` are always of type string, while ``categories`` in pandas can be of any dtype. * It's not possible to specify labels at creation time. Use ``s.cat.rename_categories(new_labels)`` afterwards. -* In contrast to R's `factor` function, using categorical data as the sole input to create a +* In contrast to R's ``factor`` function, using categorical data as the sole input to create a new categorical series will *not* remove unused categories but create a new categorical series which is equal to the passed in one! -* R allows for missing values to be included in its `levels` (pandas' `categories`). Pandas - does not allow `NaN` categories, but missing values can still be in the `values`. +* R allows for missing values to be included in its ``levels`` (pandas' ``categories``). Pandas + does not allow ``NaN`` categories, but missing values can still be in the ``values``. Gotchas @@ -1053,13 +1053,13 @@ an ``object`` dtype is a constant times the length of the data. s.astype('category').nbytes -`Categorical` is not a `numpy` array +``Categorical`` is not a ``numpy`` array ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Currently, categorical data and the underlying ``Categorical`` is implemented as a Python object and not as a low-level NumPy array dtype. This leads to some problems. -NumPy itself doesn't know about the new `dtype`: +NumPy itself doesn't know about the new ``dtype``: .. ipython:: python @@ -1088,7 +1088,7 @@ To check if a Series contains Categorical data, use ``hasattr(s, 'cat')``: hasattr(pd.Series(['a'], dtype='category'), 'cat') hasattr(pd.Series(['a']), 'cat') -Using NumPy functions on a ``Series`` of type ``category`` should not work as `Categoricals` +Using NumPy functions on a ``Series`` of type ``category`` should not work as ``Categoricals`` are not numeric data (even in the case that ``.categories`` is numeric). .. ipython:: python @@ -1107,7 +1107,7 @@ dtype in apply ~~~~~~~~~~~~~~ Pandas currently does not preserve the dtype in apply functions: If you apply along rows you get -a `Series` of ``object`` `dtype` (same as getting a row -> getting one element will return a +a ``Series`` of ``object`` ``dtype`` (same as getting a row -> getting one element will return a basic type) and applying along columns will also convert to object. ``NaN`` values are unaffected. You can use ``fillna`` to handle missing values before applying a function. @@ -1125,7 +1125,7 @@ Categorical index ``CategoricalIndex`` is a type of index that is useful for supporting indexing with duplicates. This is a container around a ``Categorical`` and allows efficient indexing and storage of an index with a large number of duplicated elements. -See the :ref:`advanced indexing docs ` for a more detailed +See the :ref:``advanced indexing docs `` for a more detailed explanation. Setting the index will create a ``CategoricalIndex``: