Skip to content

How can I force Pandas read_html function to read digit field as string not integer #30589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Zhenye-Na opened this issue Dec 31, 2019 · 1 comment

Comments

@Zhenye-Na
Copy link

Is there a possible way to convert the field from int to str?

I have explored the issues like #10534, #21379, https://github.com/gte620v/pandas/blob/5cb8243f2dd31cc2155627f29cfc89bbf6d4b84b/pandas/io/tests/test_html.py#L715

I do not think converters arg fit for our usage since the table is updated everyday and it may add a new column, then we need manually add a new key to the parameter

Here is the entire stacktrace when I used the function

PS C:\Users\Zhenye.na\Desktop> python3 .\dash-prod.py
.\dash-prod.py:4: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Mapping, Iterable
Traceback (most recent call last):
  File ".\dash-prod.py", line 59, in <module>
    df = pd.read_html(response.text, skiprows=1)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 1105, in read_html
    displayed_only=displayed_only,
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 915, in _parse
    for table in tables:
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 213, in <genexpr>
    return (self._parse_thead_tbody_tfoot(table) for table in tables)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 411, in _parse_thead_tbody_tfoot
    header = self._expand_colspan_rowspan(header_rows)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 459, in _expand_colspan_rowspan
    colspan = int(self._attr_getter(td, "colspan") or 1)
ValueError: invalid literal for int() with base 10: '\\"1\\"'

The core usage of read_html function code is as follows:

response = requests.get(url, headers=hdrs)
df = pd.read_html(response.text, skiprows=1)[0]
print(df)

I would love to use the read_html function to extract the table in the response returned from the REST API. I have test the function in a small scale table, which contains only digits and it works. But for the data returned from REST API contains characters and digits.

Here is a demo of what the table looks like: (Assume DC1 and Location 1 has one '\n' symbol separated)

Date DC 1 Location 1 DC 2 Location 2 DC 3 Location 3
03/04 1.23.4 1.23.4 1.23.4
04/05 1.23.4 1.23.4 1.23.4

I assume the error message may because of the '.' symbol in field like 1.23.4 but I am not sure how to fix it.

Any ideas or thoughts are appreciated!

Thanks!

@Zhenye-Na
Copy link
Author

very interesting, I run the script today and install html5lib which is required for today's trial, not for last time. the result is correct, no error.

python3 script.py

0        01/08              NaN        NaN         NaN             5.12.5             5.12.5            5.12.5                        NaN
1        01/15           5.13.0     5.13.0         NaN                NaN                NaN               NaN                        NaN
2        01/16              NaN        NaN      5.13.0                NaN                NaN               NaN                        NaN
3        01/22              NaN        NaN         NaN             5.13.5             5.13.5            5.13.5                        NaN
4        01/29           5.14.0     5.14.0         NaN                NaN                NaN               NaN                        NaN

I am gonna close this issue since it is not related to Pandas implementation, but welcome for an explanation.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant