Skip to content

read_html: fails to parse column #3606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timmie opened this issue May 14, 2013 · 10 comments
Closed

read_html: fails to parse column #3606

timmie opened this issue May 14, 2013 · 10 comments
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@timmie
Copy link
Contributor

timmie commented May 14, 2013

The second column of the table
http://code.google.com/p/pythonxy/wiki/StandardPlugins#Python_packages

is not parsed as shown with this code:

# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <codecell>

import pandas as pd

# <codecell>

url = 'http://code.google.com/p/pythonxy/wiki/StandardPlugins'

# <codecell>

dfs = pd.read_html(url, attrs={'class': 'wikitable'})

# <codecell>

dfs

# <codecell>

dfs = pd.read_html(url, flavor='lxml', attrs={'class': 'wikitable'})

# <codecell>

dfs

# <codecell>

python_core = dfs[0]

# <codecell>

python_core[:10]
@cpcloud
Copy link
Member

cpcloud commented May 14, 2013

i will be submitting a pr soon that should fix issues like this. this is a result of killing whitespace in a table when parsing the raw data, which might not have been the best decision on my part, i.e., prolly should let the user decide what he/she wants to keep.

@cpcloud
Copy link
Member

cpcloud commented May 15, 2013

@timmie fixed in #3616 if u want 2 try it out, the branch is cpcloud/read-html-fixes. i even put a test in there using ur table :) (https won't work with lxml so do url.replace('https', 'http') b4 passing 2 read_html).

@jreback
Copy link
Contributor

jreback commented May 20, 2013

closed by #3616

@jreback jreback closed this as completed May 20, 2013
@timmie
Copy link
Contributor Author

timmie commented Aug 2, 2013

@cpcloud

it occurrs again.

# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <codecell>

import pandas as pd
pd.__version__

# <codecell>

url = 'http://code.google.com/p/pythonxy/wiki/StandardPlugins'

# <codecell>

dfs = pd.read_html(url, attrs={'class': 'wikitable'})

# <codecell>

match = 'Distribute'
dfs = pd.read_html(url, attrs={'class': 'wikitable'}, match=match)

# <codecell>

x = dfs[0]

# <codecell>

x.head()

# <codecell>


# <codecell>



  • The version column shows: NaN.
  • And the parse result returns a list not a df: x = dfs[0]

using 0.12.dev (yesterday).

@cpcloud
Copy link
Member

cpcloud commented Aug 2, 2013

I'll take a look. FYI the parse result has always been a list.

@cpcloud
Copy link
Member

cpcloud commented Aug 2, 2013

@timmie try passing infer_types=False

In [18]: dfs = read_html('http://code.google.com/p/pythonxy/wiki/StandardPlugins',attrs={'class':'wikitable'},match='Distribute',infer_types=False)

In [19]: dfs[0]
Out[19]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 91 entries, 0 to 90
Data columns (total 4 columns):
0    91  non-null values
1    91  non-null values
2    91  non-null values
3    91  non-null values
dtypes: object(4)

In [20]: dfs[0].head()
Out[20]:
                0         1 2                                                  3
0          Python     2.7.5                            Python standard libraries
1  Base Libraries   1.1.0-5      shared libraries commonly used by other plugins
2     Base Python   1.3.0-5    A collection of small (in scope and size) but ...
3      Distribute  0.6.45-8    Download, build, install, upgrade, and uninsta...
4             Pip   1.3.1-2    pip is a tool for installing and managing Pyth...

@cpcloud
Copy link
Member

cpcloud commented Aug 2, 2013

wonder if it might be useful to parse the src attr of an img tag...i'll raise an issue

@timmie
Copy link
Contributor Author

timmie commented Aug 6, 2013

OK, shall we add it to the docs, then?

@timmie
Copy link
Contributor Author

timmie commented Aug 6, 2013

BTW, thank you.

@cpcloud
Copy link
Member

cpcloud commented Aug 6, 2013

i believe infer_types is in the docs, let me check...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

3 participants