Skip to content

No way to force read numerics as string in read_html #10534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adamist521 opened this issue Jul 9, 2015 · 15 comments
Open

No way to force read numerics as string in read_html #10534

adamist521 opened this issue Jul 9, 2015 · 15 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@adamist521
Copy link

When HTML table shows 01 in cell, read_html reads it and interpret it as float and removes 0 of 01 .
Options to read them as string?

@sinhrks sinhrks added IO Data IO issues that don't fit into a more specific label Dtype Conversions Unexpected or buggy dtype conversions IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jul 11, 2015
@sinhrks
Copy link
Member

sinhrks commented Jul 11, 2015

Thanks. Nice to add dtypes like read_csv. I just saw a little, but it looks to be achieved by passing dtype to TextParser -> TextFileReader.

PR is welcome:)

@stevenmanton
Copy link

Just stumbled across this page with the same issue. @gte620v can you explain how to accomplish the raw html parsing given your PR? Thanks!

@gte620v
Copy link
Contributor

gte620v commented Mar 20, 2017

@stevenmanton
Copy link

@gte620v thanks for the info. It sounds like you can easily convert back to string, but can't prevent the automatic parsing in the first place. For example, keeping the leading zeros in an integer. Thanks again!

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 25, 2017

@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})

If you try that example, you will see that the leading zeros are preserved.

@gte620v
Copy link
Contributor

gte620v commented Mar 25, 2017

As @jorisvandenbossche said, the converter does what you want. I made the PR to solve this exact problem.

@stevenmanton
Copy link

Thanks for the clarification guys. I saw "converter" I assumed it was parsing to string back from the inferred type. I'll use this fix :-)

@adrivsh
Copy link

adrivsh commented Oct 16, 2017

Should we have "dtypes" be an alias for "converters", to match pd.read_csv argument ?

@jorisvandenbossche
Copy link
Member

Yes, I think we should add a dtype argument (not sure it should be an alias, it might be possible to just pass through dtype to the underlying parser, now the python parser supports it: #14295).
@adrivsh Want to do a PR for this?

@tuhinsharma121
Copy link
Contributor

@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})

If you try that example, you will see that the leading zeros are preserved.

I tried using your solulition:-

import pandas as pd
pd.read_html('https://www.gpw.pl/wskazniki',converters={'C/WK': str},header=0)[1]

But it removes the "," from the column values.

@jorisvandenbossche
Copy link
Member

@tuhinsharma121 That seems like a bug (the returned values are strings, but indeed should not remove the ","). Could you open a new issue for that?

@jbsilva
Copy link

jbsilva commented Jun 5, 2020

Same problem here. Looks like it tries to parse the numbers before converting them to strings.
A workaround is to pass thousands="ª", decimal="ª" (or any other character not in text).

@mominali12
Copy link

Any solution to the "," problem?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@clehene
Copy link

clehene commented Jul 16, 2023

Use converters

converters = {
        'col1': str,
        'col2': str,
}
df = pd.read_html(str(table), converters=converters)

@tuhinsharma121
Copy link
Contributor

can I work on a PR for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests