How can I force Pandas `read_html` function to read digit field as string not integer #30589

Zhenye-Na · 2019-12-31T20:49:23Z

Is there a possible way to convert the field from int to str?

I have explored the issues like #10534, #21379, https://github.com/gte620v/pandas/blob/5cb8243f2dd31cc2155627f29cfc89bbf6d4b84b/pandas/io/tests/test_html.py#L715

I do not think converters arg fit for our usage since the table is updated everyday and it may add a new column, then we need manually add a new key to the parameter

Here is the entire stacktrace when I used the function

PS C:\Users\Zhenye.na\Desktop> python3 .\dash-prod.py
.\dash-prod.py:4: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Mapping, Iterable
Traceback (most recent call last):
  File ".\dash-prod.py", line 59, in <module>
    df = pd.read_html(response.text, skiprows=1)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 1105, in read_html
    displayed_only=displayed_only,
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 915, in _parse
    for table in tables:
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 213, in <genexpr>
    return (self._parse_thead_tbody_tfoot(table) for table in tables)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 411, in _parse_thead_tbody_tfoot
    header = self._expand_colspan_rowspan(header_rows)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 459, in _expand_colspan_rowspan
    colspan = int(self._attr_getter(td, "colspan") or 1)
ValueError: invalid literal for int() with base 10: '\\"1\\"'

The core usage of read_html function code is as follows:

response = requests.get(url, headers=hdrs)
df = pd.read_html(response.text, skiprows=1)[0]
print(df)

I would love to use the read_html function to extract the table in the response returned from the REST API. I have test the function in a small scale table, which contains only digits and it works. But for the data returned from REST API contains characters and digits.

Here is a demo of what the table looks like: (Assume DC1 and Location 1 has one '\n' symbol separated)

Date	DC 1 Location 1	DC 2 Location 2	DC 3 Location 3
03/04	1.23.4	1.23.4	1.23.4
04/05	1.23.4	1.23.4	1.23.4

I assume the error message may because of the '.' symbol in field like 1.23.4 but I am not sure how to fix it.

Any ideas or thoughts are appreciated!

Thanks!

The text was updated successfully, but these errors were encountered:

Zhenye-Na · 2020-01-02T17:57:43Z

very interesting, I run the script today and install html5lib which is required for today's trial, not for last time. the result is correct, no error.

python3 script.py

0        01/08              NaN        NaN         NaN             5.12.5             5.12.5            5.12.5                        NaN
1        01/15           5.13.0     5.13.0         NaN                NaN                NaN               NaN                        NaN
2        01/16              NaN        NaN      5.13.0                NaN                NaN               NaN                        NaN
3        01/22              NaN        NaN         NaN             5.13.5             5.13.5            5.13.5                        NaN
4        01/29           5.14.0     5.14.0         NaN                NaN                NaN               NaN                        NaN

I am gonna close this issue since it is not related to Pandas implementation, but welcome for an explanation.

Thanks.

Zhenye-Na closed this as completed Jan 2, 2020

holymonson mentioned this issue Feb 14, 2021

ENH: read_html support dtype param #39804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I force Pandas `read_html` function to read digit field as string not integer #30589

How can I force Pandas `read_html` function to read digit field as string not integer #30589

Zhenye-Na commented Dec 31, 2019

Zhenye-Na commented Jan 2, 2020

How can I force Pandas read_html function to read digit field as string not integer #30589

How can I force Pandas read_html function to read digit field as string not integer #30589

Comments

Zhenye-Na commented Dec 31, 2019

Zhenye-Na commented Jan 2, 2020

How can I force Pandas `read_html` function to read digit field as string not integer #30589

How can I force Pandas `read_html` function to read digit field as string not integer #30589