Skip to content

DEPR: Remove literal string input for read_xml #53809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jul 11, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
f347e8e
Updating documentation and adding deprecation logic for read_xml.
rmhowe425 Jun 22, 2023
296b45a
Fixing documentation issue and adding unit test
rmhowe425 Jun 23, 2023
69cdc1a
Updating unit tests and documentation.
rmhowe425 Jun 23, 2023
83a9177
Merge branch 'main' into dev/depr/literal-str-read_xml
rmhowe425 Jun 23, 2023
0f0f38b
Fixing unit tests and documentation issues
rmhowe425 Jun 24, 2023
2c848ac
Fixing unit tests and documentation issues
rmhowe425 Jun 24, 2023
b8a582c
Fixing unit tests and documentation issues
rmhowe425 Jun 24, 2023
92bc6fa
Fixing import error in documentation
rmhowe425 Jun 24, 2023
8bbd7c4
Updated deprecation logic per reviewer recommendations.
rmhowe425 Jun 26, 2023
5aece78
Updating deprecation logic and documentation per reviewer recommendat…
rmhowe425 Jun 26, 2023
6f15924
Fixing logic error
rmhowe425 Jun 26, 2023
00f7b15
Fixing implementation per reviewer recommendations.
rmhowe425 Jun 27, 2023
20e7ef2
Updating implementation per reviewer recommendations.
rmhowe425 Jun 27, 2023
526c224
Cleaning up the deprecation logic a bit.
rmhowe425 Jun 27, 2023
9dfa18d
Merge branch 'main' into dev/depr/literal-str-read_xml
rmhowe425 Jun 27, 2023
65f88b9
Updating implementation per reviewer recommendations.
rmhowe425 Jun 27, 2023
ec28efa
Merge branch 'main' into dev/depr/literal-str-read_xml
rmhowe425 Jun 28, 2023
2c58638
Merge branch 'main' into dev/depr/literal-str-read_xml
rmhowe425 Jun 29, 2023
e08f4e0
Merge branch 'main' into dev/depr/literal-str-read_xml
rmhowe425 Jun 30, 2023
ba1edd6
Merge branch 'main' into dev/depr/literal-str-read_xml
rmhowe425 Jul 9, 2023
b7e1fb6
Updating unit tests
rmhowe425 Jul 9, 2023
14d2cb1
Fixing discrepancy in doc string.
rmhowe425 Jul 9, 2023
c215a94
Updating implementation based on reviewer recommendations.
rmhowe425 Jul 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2920,6 +2920,7 @@ Read an XML string:

.. ipython:: python

from io import StringIO
xml = """<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
Expand All @@ -2942,7 +2943,7 @@ Read an XML string:
</book>
</bookstore>"""

df = pd.read_xml(xml)
df = pd.read_xml(StringIO(xml))
df

Read a URL with no options:
Expand All @@ -2962,7 +2963,7 @@ as a string:
f.write(xml)

with open(file_path, "r") as f:
df = pd.read_xml(f.read())
df = pd.read_xml(StringIO(f.read()))
df

Read in the content of the "books.xml" as instance of ``StringIO`` or
Expand Down Expand Up @@ -3053,7 +3054,7 @@ For example, below XML contains a namespace with prefix, ``doc``, and URI at
</doc:row>
</doc:data>"""

df = pd.read_xml(xml,
df = pd.read_xml(StringIO(xml),
xpath="//doc:row",
namespaces={"doc": "https://example.com"})
df
Expand Down Expand Up @@ -3083,7 +3084,7 @@ But assigning *any* temporary name to correct URI allows parsing by nodes.
</row>
</data>"""

df = pd.read_xml(xml,
df = pd.read_xml(StringIO(xml),
xpath="//pandas:row",
namespaces={"pandas": "https://example.com"})
df
Expand Down Expand Up @@ -3118,7 +3119,7 @@ However, if XPath does not reference node names such as default, ``/*``, then
</row>
</data>"""

df = pd.read_xml(xml, xpath="./row")
df = pd.read_xml(StringIO(xml), xpath="./row")
df

shows the attribute ``sides`` on ``shape`` element was not parsed as
Expand Down Expand Up @@ -3219,7 +3220,7 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
</row>
</response>"""

df = pd.read_xml(xml, stylesheet=xsl)
df = pd.read_xml(StringIO(xml), stylesheet=xsl)
df

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
Expand Down
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,7 @@ apply converter methods, and parse dates (:issue:`43567`).

.. ipython:: python

from io import StringIO
xml_dates = """<?xml version='1.0' encoding='utf-8'?>
<data>
<row>
Expand All @@ -244,7 +245,7 @@ apply converter methods, and parse dates (:issue:`43567`).
</data>"""

df = pd.read_xml(
xml_dates,
StringIO(xml_dates),
dtype={'sides': 'Int64'},
converters={'degrees': str},
parse_dates=['date']
Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -298,13 +298,15 @@ Deprecations
- Deprecated constructing :class:`SparseArray` from scalar data, pass a sequence instead (:issue:`53039`)
- Deprecated falling back to filling when ``value`` is not specified in :meth:`DataFrame.replace` and :meth:`Series.replace` with non-dict-like ``to_replace`` (:issue:`33302`)
- Deprecated literal json input to :func:`read_json`. Wrap literal json string input in ``io.StringIO`` instead. (:issue:`53409`)
- Deprecated literal string input to :func:`read_xml`. Wrap literal string/bytes input in ``io.StringIO`` instead. (:issue:`53767`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Deprecated literal string input to :func:`read_xml`. Wrap literal string/bytes input in ``io.StringIO`` instead. (:issue:`53767`)
- Deprecated literal string input to :func:`read_xml`. Wrap literal string/bytes input in ``io.StringIO``/``io.BytesIO`` instead. (:issue:`53767`)

- Deprecated option "mode.use_inf_as_na", convert inf entries to ``NaN`` before instead (:issue:`51684`)
- Deprecated parameter ``obj`` in :meth:`GroupBy.get_group` (:issue:`53545`)
- Deprecated positional indexing on :class:`Series` with :meth:`Series.__getitem__` and :meth:`Series.__setitem__`, in a future version ``ser[item]`` will *always* interpret ``item`` as a label, not a position (:issue:`50617`)
- Deprecated strings ``T``, ``t``, ``L`` and ``l`` denoting units in :func:`to_timedelta` (:issue:`52536`)
- Deprecated the "method" and "limit" keywords on :meth:`Series.fillna`, :meth:`DataFrame.fillna`, :meth:`SeriesGroupBy.fillna`, :meth:`DataFrameGroupBy.fillna`, and :meth:`Resampler.fillna`, use ``obj.bfill()`` or ``obj.ffill()`` instead (:issue:`53394`)
- Deprecated the ``method`` and ``limit`` keywords in :meth:`DataFrame.replace` and :meth:`Series.replace` (:issue:`33302`)
- Deprecated values "pad", "ffill", "bfill", "backfill" for :meth:`Series.interpolate` and :meth:`DataFrame.interpolate`, use ``obj.ffill()`` or ``obj.bfill()`` instead (:issue:`53581`)
-

.. ---------------------------------------------------------------------------
.. _whatsnew_210.performance:
Expand Down
21 changes: 18 additions & 3 deletions pandas/io/xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
Callable,
Sequence,
)
import warnings

from pandas._libs import lib
from pandas.compat._optional import import_optional_dependency
Expand All @@ -20,6 +21,7 @@
ParserError,
)
from pandas.util._decorators import doc
from pandas.util._exceptions import find_stack_level
from pandas.util._validators import check_dtype_backend

from pandas.core.dtypes.common import is_list_like
Expand Down Expand Up @@ -894,6 +896,9 @@ def read_xml(
string or a path. The string can further be a URL. Valid URL schemes
include http, ftp, s3, and file.

.. deprecated:: 2.1.0
Passing html literal strings is deprecated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you mention wrapping in StringIO/BytesIO as an alternate?


xpath : str, optional, default './\*'
The XPath to parse required set of nodes for migration to DataFrame.
XPath should return a collection of elements and not a single
Expand Down Expand Up @@ -1049,6 +1054,7 @@ def read_xml(

Examples
--------
>>> import io
>>> xml = '''<?xml version='1.0' encoding='utf-8'?>
... <data xmlns="http://example.com">
... <row>
Expand All @@ -1068,7 +1074,7 @@ def read_xml(
... </row>
... </data>'''

>>> df = pd.read_xml(xml)
>>> df = pd.read_xml(io.StringIO(xml))
>>> df
shape degrees sides
0 square 360 4.0
Expand All @@ -1082,7 +1088,7 @@ def read_xml(
... <row shape="triangle" degrees="180" sides="3.0"/>
... </data>'''

>>> df = pd.read_xml(xml, xpath=".//row")
>>> df = pd.read_xml(io.StringIO(xml), xpath=".//row")
>>> df
shape degrees sides
0 square 360 4.0
Expand All @@ -1108,7 +1114,7 @@ def read_xml(
... </doc:row>
... </doc:data>'''

>>> df = pd.read_xml(xml,
>>> df = pd.read_xml(io.StringIO(xml),
... xpath="//doc:row",
... namespaces={{"doc": "https://example.com"}})
>>> df
Expand All @@ -1119,6 +1125,15 @@ def read_xml(
"""
check_dtype_backend(dtype_backend)

if isinstance(path_or_buffer, str) and "\n" in path_or_buffer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is \n a reliable way detect if it's literal xml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use is_file_like (and in your other PRs)

Copy link
Contributor Author

@rmhowe425 rmhowe425 Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Ah thanks for letting me know about is_file_like!

Are we okay with updating the detection logic to include both is_file_like and the check for "\n"? is_file_like doesn't help differentiate between raw xml input and a url so that's where checking for "\n" could be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use is_url to detect urls.

warnings.warn(
"Passing literal xml to 'read_xml' is deprecated and "
"will be removed in a future version. To read from a "
"literal string, wrap it in a 'StringIO' object.",
FutureWarning,
stacklevel=find_stack_level(),
)

return _parse(
path_or_buffer=path_or_buffer,
xpath=xpath,
Expand Down
77 changes: 46 additions & 31 deletions pandas/tests/io/xml/test_xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,19 @@
)


@td.skip_if_no("lxml")
def test_literal_xml_deprecation():
# GH 53809
msg = (
"Passing literal xml to 'read_xml' is deprecated and "
"will be removed in a future version. To read from a "
"literal string, wrap it in a 'StringIO' object."
)

with tm.assert_produces_warning(FutureWarning, match=msg):
read_xml(xml_default_nmsp)


@pytest.fixture(params=["rb", "r"])
def mode(request):
return request.param
Expand Down Expand Up @@ -391,6 +404,11 @@ def test_file_buffered_reader_string(xml_books, parser, mode):
with open(xml_books, mode, encoding="utf-8" if mode == "r" else None) as f:
xml_obj = f.read()

if mode == "rb":
xml_obj = StringIO(xml_obj.decode())
elif mode == "r":
xml_obj = StringIO(xml_obj)

df_str = read_xml(xml_obj, parser=parser)

df_expected = DataFrame(
Expand All @@ -411,6 +429,11 @@ def test_file_buffered_reader_no_xml_declaration(xml_books, parser, mode):
next(f)
xml_obj = f.read()

if mode == "rb":
xml_obj = StringIO(xml_obj.decode())
elif mode == "r":
xml_obj = StringIO(xml_obj)

df_str = read_xml(xml_obj, parser=parser)

df_expected = DataFrame(
Expand Down Expand Up @@ -580,7 +603,7 @@ def test_bad_xpath_lxml(xml_books):

def test_default_namespace(parser):
df_nmsp = read_xml(
xml_default_nmsp,
StringIO(xml_default_nmsp),
xpath=".//ns:row",
namespaces={"ns": "http://example.com"},
parser=parser,
Expand All @@ -606,7 +629,7 @@ def test_default_namespace(parser):

def test_prefix_namespace(parser):
df_nmsp = read_xml(
xml_prefix_nmsp,
StringIO(xml_prefix_nmsp),
xpath=".//doc:row",
namespaces={"doc": "http://example.com"},
parser=parser,
Expand All @@ -630,14 +653,14 @@ def test_prefix_namespace(parser):
@td.skip_if_no("lxml")
def test_consistency_default_namespace():
df_lxml = read_xml(
xml_default_nmsp,
StringIO(xml_default_nmsp),
xpath=".//ns:row",
namespaces={"ns": "http://example.com"},
parser="lxml",
)

df_etree = read_xml(
xml_default_nmsp,
StringIO(xml_default_nmsp),
xpath=".//doc:row",
namespaces={"doc": "http://example.com"},
parser="etree",
Expand All @@ -649,14 +672,14 @@ def test_consistency_default_namespace():
@td.skip_if_no("lxml")
def test_consistency_prefix_namespace():
df_lxml = read_xml(
xml_prefix_nmsp,
StringIO(xml_prefix_nmsp),
xpath=".//doc:row",
namespaces={"doc": "http://example.com"},
parser="lxml",
)

df_etree = read_xml(
xml_prefix_nmsp,
StringIO(xml_prefix_nmsp),
xpath=".//doc:row",
namespaces={"doc": "http://example.com"},
parser="etree",
Expand Down Expand Up @@ -693,7 +716,7 @@ def test_none_namespace_prefix(key):
TypeError, match=("empty namespace prefix is not supported in XPath")
):
read_xml(
xml_default_nmsp,
StringIO(xml_default_nmsp),
xpath=".//kml:Placemark",
namespaces={key: "http://www.opengis.net/kml/2.2"},
parser="lxml",
Expand Down Expand Up @@ -782,7 +805,7 @@ def test_empty_attrs_only(parser):
ValueError,
match=("xpath does not return any nodes or attributes"),
):
read_xml(xml, xpath="./row", attrs_only=True, parser=parser)
read_xml(StringIO(xml), xpath="./row", attrs_only=True, parser=parser)


def test_empty_elems_only(parser):
Expand All @@ -797,7 +820,7 @@ def test_empty_elems_only(parser):
ValueError,
match=("xpath does not return any nodes or attributes"),
):
read_xml(xml, xpath="./row", elems_only=True, parser=parser)
read_xml(StringIO(xml), xpath="./row", elems_only=True, parser=parser)


@td.skip_if_no("lxml")
Expand All @@ -822,8 +845,8 @@ def test_attribute_centric_xml():
</Stations>
</TrainSchedule>"""

df_lxml = read_xml(xml, xpath=".//station")
df_etree = read_xml(xml, xpath=".//station", parser="etree")
df_lxml = read_xml(StringIO(xml), xpath=".//station")
df_etree = read_xml(StringIO(xml), xpath=".//station", parser="etree")

df_iter_lx = read_xml_iterparse(xml, iterparse={"station": ["Name", "coords"]})
df_iter_et = read_xml_iterparse(
Expand Down Expand Up @@ -875,7 +898,10 @@ def test_repeat_names(parser):
</shape>
</shapes>"""
df_xpath = read_xml(
xml, xpath=".//shape", parser=parser, names=["type_dim", "shape", "type_edge"]
StringIO(xml),
xpath=".//shape",
parser=parser,
names=["type_dim", "shape", "type_edge"],
)

df_iter = read_xml_iterparse(
Expand Down Expand Up @@ -917,7 +943,9 @@ def test_repeat_values_new_names(parser):
<family>ellipse</family>
</shape>
</shapes>"""
df_xpath = read_xml(xml, xpath=".//shape", parser=parser, names=["name", "group"])
df_xpath = read_xml(
StringIO(xml), xpath=".//shape", parser=parser, names=["name", "group"]
)

df_iter = read_xml_iterparse(
xml,
Expand Down Expand Up @@ -960,7 +988,7 @@ def test_repeat_elements(parser):
</shape>
</shapes>"""
df_xpath = read_xml(
xml,
StringIO(xml),
xpath=".//shape",
parser=parser,
names=["name", "family", "degrees", "sides"],
Expand Down Expand Up @@ -1339,19 +1367,6 @@ def test_empty_stylesheet(val):


# ITERPARSE


def test_string_error(parser):
with pytest.raises(
ParserError, match=("iterparse is designed for large XML files")
):
read_xml(
xml_default_nmsp,
parser=parser,
iterparse={"row": ["shape", "degrees", "sides", "date"]},
)


def test_file_like_iterparse(xml_books, parser, mode):
with open(xml_books, mode, encoding="utf-8" if mode == "r" else None) as f:
if mode == "r" and parser == "lxml":
Expand Down Expand Up @@ -1532,7 +1547,7 @@ def test_comment(parser):
</shapes>
<!-- comment after root -->"""

df_xpath = read_xml(xml, xpath=".//shape", parser=parser)
df_xpath = read_xml(StringIO(xml), xpath=".//shape", parser=parser)

df_iter = read_xml_iterparse(
xml, parser=parser, iterparse={"shape": ["name", "type"]}
Expand Down Expand Up @@ -1568,7 +1583,7 @@ def test_dtd(parser):
</shape>
</shapes>"""

df_xpath = read_xml(xml, xpath=".//shape", parser=parser)
df_xpath = read_xml(StringIO(xml), xpath=".//shape", parser=parser)

df_iter = read_xml_iterparse(
xml, parser=parser, iterparse={"shape": ["name", "type"]}
Expand Down Expand Up @@ -1604,7 +1619,7 @@ def test_processing_instruction(parser):
</shape>
</shapes>"""

df_xpath = read_xml(xml, xpath=".//shape", parser=parser)
df_xpath = read_xml(StringIO(xml), xpath=".//shape", parser=parser)

df_iter = read_xml_iterparse(
xml, parser=parser, iterparse={"shape": ["name", "type"]}
Expand Down Expand Up @@ -1808,7 +1823,7 @@ def test_read_xml_nullable_dtypes(parser, string_storage, dtype_backend):
string_array_na = ArrowStringArray(pa.array(["x", None]))

with pd.option_context("mode.string_storage", string_storage):
result = read_xml(data, parser=parser, dtype_backend=dtype_backend)
result = read_xml(StringIO(data), parser=parser, dtype_backend=dtype_backend)

expected = DataFrame(
{
Expand Down
Loading