Skip to content

df.dtypes.values is not O(1) and repr(df) is therefore slow for large frames #5968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Jan 16, 2014 · 21 comments · Fixed by #5970 or #5973
Closed

df.dtypes.values is not O(1) and repr(df) is therefore slow for large frames #5968

ghost opened this issue Jan 16, 2014 · 21 comments · Fixed by #5970 or #5973
Labels
Bug Performance Memory or execution speed performance
Milestone

Comments

@ghost
Copy link

ghost commented Jan 16, 2014

For the FEC dataset, it takes about 1.5 sec to get a repr. and prun
puts it all in infer_dtype.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

do you have a link to the dataset...can't seem to find mine

@ghost
Copy link
Author

ghost commented Jan 16, 2014

ftp://ftp.fec.gov/FEC/Presidential_Map/2012/P00000001/P00000001-ALL.zip

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

I believe that this is the problem.

It is trying to see if their are floats in an object array. I would simply not do this at all
or short-circuit it.

Breakpoint 2 at /mnt/home/jreback/pandas/pandas/core/format.py:1663
(Pdb) c
> /mnt/home/jreback/pandas/pandas/core/format.py(1663)_format_strings()
-> is_float = lib.map_infer(vals, com.is_float) & notnull(vals)
(Pdb) l
1658                    # object dtype
1659                    return '%s' % formatter(x)
1660 
1661            vals = self.values
1662 
1663B->         is_float = lib.map_infer(vals, com.is_float) & notnull(vals)
1664            leading_space = is_float.any()
1665 
1666            fmt_values = []
1667            for i, v in enumerate(vals):
1668                if not is_float[i] and leading_space:

@ghost
Copy link
Author

ghost commented Jan 16, 2014

It's probably there to support the float_format arg of to_string. I'll have to think about it.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

for an object dtype you could warn if it 'looks' like float, but otherwise skip it
its checking strings that aren't numbers at all is the problem

@ghost
Copy link
Author

ghost commented Jan 16, 2014

isn't looks_like_float() exactly what map_infer does? how could it be faster if
I have to check each value for "appearence"?

@ghost
Copy link
Author

ghost commented Jan 16, 2014

It doesn't need to do this for values not displayed in the output. that's it.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

right!

@ghost
Copy link
Author

ghost commented Jan 16, 2014

That's not where the bottleneck is.
What's this?

In [10]: %timeit df.dtypes.values
1 loops, best of 3: 178 ms per loop

aren't dtypes just a lookup?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

this issue address this, but it needs reworking to make it more internal as I havev indicated: #5740

@ghost
Copy link
Author

ghost commented Jan 16, 2014

Related (djeavu): 3cb6961, #2807 (comment)

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

I have got a PR...give me a few

@ghost
Copy link
Author

ghost commented Jan 16, 2014

There's an off chance this might be the cause of a lot of the slowdowns we saw in 0.13
after the NDFrame refactor. is the change in behaviour related? if yes, hurrah.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

anything with df.apply in it is generally bad (as dsm fixed for str.extract), when needed internally

@ghost
Copy link
Author

ghost commented Jan 16, 2014

#5660

frame_get_dtype_counts | 0.1843 | 0.1113 | 1.6552 |

less then what I expected, but should have sent bells ringing.

Unrelated in fact.

@dsm054
Copy link
Contributor

dsm054 commented Jan 16, 2014

@jreback: dsm->unutbu. Can't take credit for that one. :^)

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

@dsm054 sorry....you are right!! morning confusion

@ghost
Copy link
Author

ghost commented Jan 16, 2014

That only cuts it in half. Is this expected?

df=pd.read_csv('P00000001-ALL.csv',low_memory=False)
%timeit df.iloc[:100, 4]
10 loops, best of 3: 88.1 ms per loop

Isn't slicing supposed to be cheap?

@ghost ghost reopened this Jan 16, 2014
@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

let me look

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

easy enough....
was inferring the object dtypes internally when no need to do so

In [5]: %timeit df.iloc[:100, 4]
1000 loops, best of 3: 293 ᄉs per loop

@ghost
Copy link
Author

ghost commented Jan 16, 2014

2 secs -> 50 ms for repr(df). hellz yeah.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Performance Memory or execution speed performance
Projects
None yet
2 participants