-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: read_html returns empty list #59147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Fredrik-M The issue you've encountered seems to relate to how
When processing an
Expected Behavior
Hope this helps.. |
@Fredrik-M it seems that the read_html function has no flag for skip_blank_lines = True or False, the parser defaults say that it is true. Thus when you add a bunch of space it skips those as blank lines and thus shows up an empty array. Moreover in the HTMLParser code which parses HTML data elements there is a specific condition which strips whitespaces from a line thus a string with spaces is reduced to an empty string and passed downstream. |
take |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
From the
read_html
docstring:It has something to do with the space in the
<td>
tag in the example. Removing the space causes the function to fail instead.Expected Behavior
The function should either fail, or return a list containing a
DataFrame
representing a 1x1 table (either empty or containing the space character in its only cell). Don't know which is more appropriate.Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.9.19.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-30-amd64
Version : #1 SMP Debian 5.10.218-1 (2024-06-01)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.2
numpy : 1.24.1
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.2
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.4
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.5.0
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : 2.0.30
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: