Skip to content

Pandas.read_html missing converted data #15366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nooperpudd opened this issue Feb 11, 2017 · 4 comments
Closed

Pandas.read_html missing converted data #15366

nooperpudd opened this issue Feb 11, 2017 · 4 comments
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Usage Question

Comments

@nooperpudd
Copy link

nooperpudd commented Feb 11, 2017

pandas version:
'0.19.2'

import requests
url ="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
response = requests.get(url)
if response.status_code == 200:
        soup = BeautifulSoup(response.content, "lxml")
        table = soup.find("table", class_="table_grey_border")
        board_data = pandas.read_html(table.prettify(),header=0, flavor="bs4")
        
        return board_data[0]

Problem description

       股份代號             股份名稱   買賣單位   附註 Unnamed: 4 Unnamed: 5 Unnamed: 6
0         1               長和    500    #          H          O          F
1         2             中電控股    500    #          H          O          F
2         3           香港中華煤氣   1000    #          H          O          F
3         4            九龍倉集團   1000    #          H          O          F
4         5             匯豐控股    400    #          H          O          F
5         6             電能實業    500    #          H          O          F
6         7             凱富能源   2000    #        NaN        NaN        NaN

股份代號 this column data should be 1->00001, 2->00002

datatype:
股份代號 int64
股份名稱 object
買賣單位 int64
附註 object
Unnamed: 4 object
Unnamed: 5 object
Unnamed: 6 object
dtype: object
<class 'pandas.core.frame.DataFrame'>

why missing the 0000 data in the columns

actually, the 股份代號 datatype should be object.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2017

@sinhrks can you have a look

@jreback jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Feb 11, 2017
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 11, 2017

@nooperpudd Can you try to pass dtype={'股份代號': str} to read_html? The 000001 are just interpreted as numbers, hence the 1

@jorisvandenbossche
Copy link
Member

Small correction to the above, it is the converters keyword, not dtype (related PR: #13575)

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Feb 11, 2017
@sinhrks
Copy link
Member

sinhrks commented Feb 12, 2017

Yeah, the problem should be solved by @jorisvandenbossche 's answer:)

@jreback jreback closed this as completed Feb 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants