Skip to content

DOC: Fix some rendering problems in arrow and io user guide #52530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 30 additions & 30 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ date_format : str or dict of column -> format, default ``None``
format. For anything more complex,
please read in as ``object`` and then apply :func:`to_datetime` as-needed.

.. versionadded:: 2.0.0
.. versionadded:: 2.0.0
dayfirst : boolean, default ``False``
DD/MM format dates, international and European format.
cache_dates : boolean, default True
Expand Down Expand Up @@ -385,9 +385,9 @@ on_bad_lines : {{'error', 'warn', 'skip'}}, default 'error'
Specifies what to do upon encountering a bad line (a line with too many fields).
Allowed values are :

- 'error', raise an ParserError when a bad line is encountered.
- 'warn', print a warning when a bad line is encountered and skip that line.
- 'skip', skip bad lines without raising or warning when they are encountered.
- 'error', raise an ParserError when a bad line is encountered.
- 'warn', print a warning when a bad line is encountered and skip that line.
- 'skip', skip bad lines without raising or warning when they are encountered.

.. versionadded:: 1.3.0

Expand Down Expand Up @@ -1998,12 +1998,12 @@ fall back in the following manner:
* if an object is unsupported it will attempt the following:


* check if the object has defined a ``toDict`` method and call it.
- check if the object has defined a ``toDict`` method and call it.
A ``toDict`` method should return a ``dict`` which will then be JSON serialized.

* invoke the ``default_handler`` if one was provided.
- invoke the ``default_handler`` if one was provided.

* convert the object to a ``dict`` by traversing its contents. However this will often fail
- convert the object to a ``dict`` by traversing its contents. However this will often fail
with an ``OverflowError`` or give unexpected results.

In general the best approach for unsupported objects or dtypes is to provide a ``default_handler``.
Expand Down Expand Up @@ -2092,19 +2092,19 @@ preserve string-like numbers (e.g. '1', '2') in an axes.

Large integer values may be converted to dates if ``convert_dates=True`` and the data and / or column labels appear 'date-like'. The exact threshold depends on the ``date_unit`` specified. 'date-like' means that the column label meets one of the following criteria:

* it ends with ``'_at'``
* it ends with ``'_time'``
* it begins with ``'timestamp'``
* it is ``'modified'``
* it is ``'date'``
* it ends with ``'_at'``
* it ends with ``'_time'``
* it begins with ``'timestamp'``
* it is ``'modified'``
* it is ``'date'``

.. warning::

When reading JSON data, automatic coercing into dtypes has some quirks:

* an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization
* a column that was ``float`` data will be converted to ``integer`` if it can be done safely, e.g. a column of ``1.``
* bool columns will be converted to ``integer`` on reconstruction
* an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization
* a column that was ``float`` data will be converted to ``integer`` if it can be done safely, e.g. a column of ``1.``
* bool columns will be converted to ``integer`` on reconstruction

Thus there are times where you may want to specify specific dtypes via the ``dtype`` keyword argument.

Expand Down Expand Up @@ -2370,19 +2370,19 @@ A few notes on the generated table schema:

* The default naming roughly follows these rules:

* For series, the ``object.name`` is used. If that's none, then the
- For series, the ``object.name`` is used. If that's none, then the
name is ``values``
* For ``DataFrames``, the stringified version of the column name is used
* For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
- For ``DataFrames``, the stringified version of the column name is used
- For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
fallback to ``index`` if that is None.
* For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
- For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
then ``level_<i>`` is used.

``read_json`` also accepts ``orient='table'`` as an argument. This allows for
the preservation of metadata such as dtypes and index names in a
round-trippable manner.

.. ipython:: python
.. ipython:: python

df = pd.DataFrame(
{
Expand Down Expand Up @@ -2780,20 +2780,20 @@ parse HTML tables in the top-level pandas io function ``read_html``.

* Benefits

* |lxml|_ is very fast.
- |lxml|_ is very fast.

* |lxml|_ requires Cython to install correctly.
- |lxml|_ requires Cython to install correctly.

* Drawbacks

* |lxml|_ does *not* make any guarantees about the results of its parse
- |lxml|_ does *not* make any guarantees about the results of its parse
*unless* it is given |svm|_.

* In light of the above, we have chosen to allow you, the user, to use the
- In light of the above, we have chosen to allow you, the user, to use the
|lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_
fails to parse

* It is therefore *highly recommended* that you install both
- It is therefore *highly recommended* that you install both
|BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid
result (provided everything else is valid) even if |lxml|_ fails.

Expand All @@ -2806,22 +2806,22 @@ parse HTML tables in the top-level pandas io function ``read_html``.

* Benefits

* |html5lib|_ is far more lenient than |lxml|_ and consequently deals
- |html5lib|_ is far more lenient than |lxml|_ and consequently deals
with *real-life markup* in a much saner way rather than just, e.g.,
dropping an element without notifying you.

* |html5lib|_ *generates valid HTML5 markup from invalid markup
- |html5lib|_ *generates valid HTML5 markup from invalid markup
automatically*. This is extremely important for parsing HTML tables,
since it guarantees a valid document. However, that does NOT mean that
it is "correct", since the process of fixing markup does not have a
single definition.

* |html5lib|_ is pure Python and requires no additional build steps beyond
- |html5lib|_ is pure Python and requires no additional build steps beyond
its own installation.

* Drawbacks

* The biggest drawback to using |html5lib|_ is that it is slow as
- The biggest drawback to using |html5lib|_ is that it is slow as
molasses. However consider the fact that many tables on the web are not
big enough for the parsing algorithm runtime to matter. It is more
likely that the bottleneck will be in the process of reading the raw
Expand Down Expand Up @@ -3211,7 +3211,7 @@ supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iter
which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes.
without holding entire tree in memory.

.. versionadded:: 1.5.0
.. versionadded:: 1.5.0

.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
Expand Down
21 changes: 16 additions & 5 deletions doc/source/user_guide/pyarrow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ which is similar to a NumPy array. To construct these from the main pandas data

.. note::

The string alias ``"string[pyarrow]"`` maps to ``pd.StringDtype("pyarrow")`` which is not equivalent to
specifying ``dtype=pd.ArrowDtype(pa.string())``. Generally, operations on the data will behave similarly
except ``pd.StringDtype("pyarrow")`` can return NumPy-backed nullable types while ``pd.ArrowDtype(pa.string())``
will return :class:`ArrowDtype`.
The string alias ``"string[pyarrow]"`` maps to ``pd.StringDtype("pyarrow")`` which is not equivalent to
specifying ``dtype=pd.ArrowDtype(pa.string())``. Generally, operations on the data will behave similarly
except ``pd.StringDtype("pyarrow")`` can return NumPy-backed nullable types while ``pd.ArrowDtype(pa.string())``
will return :class:`ArrowDtype`.

.. ipython:: python

Expand All @@ -62,10 +62,14 @@ into :class:`ArrowDtype` to use in the ``dtype`` parameter.
ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))
ser

.. ipython:: python

from datetime import time
idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us")))
idx

.. ipython:: python

from decimal import Decimal
decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2))
data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]]
Expand All @@ -78,7 +82,10 @@ or :class:`DataFrame` object.

.. ipython:: python

pa_array = pa.array([{"1": "2"}, {"10": "20"}, None])
pa_array = pa.array(
[{"1": "2"}, {"10": "20"}, None],
type=pa.map_(pa.string(), pa.string()),
)
ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))
ser

Expand Down Expand Up @@ -133,9 +140,13 @@ The following are just some examples of operations that are accelerated by nativ
ser.isna()
ser.fillna(0)

.. ipython:: python

ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string()))
ser_str.str.startswith("a")

.. ipython:: python

from datetime import datetime
pa_type = pd.ArrowDtype(pa.timestamp("ns"))
ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)
Expand Down