Skip to content

ENH/DOC: wide_to_long performance and docstring clarification #14778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
erikcs opened this issue Nov 30, 2016 · 2 comments
Closed

ENH/DOC: wide_to_long performance and docstring clarification #14778

erikcs opened this issue Nov 30, 2016 · 2 comments
Labels
Docs Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@erikcs
Copy link
Contributor

erikcs commented Nov 30, 2016

I had to massage some messy data recently, and a big bottleneck turned out to be wide_to_long

Some test data with many id variables and time variables

import pandas as pd
import numpy as np
import string

vars = string.ascii_uppercase[0:4]
nyrs = 20
nidvars = 20
N = 5000

yrvars = []
for var in vars:
    for yr in range(1, nyrs + 1):
        yrvars.append( var + str(yr))
        
yearobs = dict(zip(yrvars, np.random.randn(len(yrvars), N)))
idobs = dict(zip(range(nidvars), np.random.rand(nidvars, N)))
        
frame = pd.concat([pd.DataFrame(idobs), pd.DataFrame(yearobs)], axis=1)

frame.shape
#(5000, 91)

Reshaping with wide_to_long takes around 2 secs.

%%time
frame['id'] = frame.index
res1 = pd.wide_to_long(frame, list(vars), i='id', j='year')
#CPU times: user 1.89 s, sys: 210 ms, total: 2.1 s
#Wall time: 2.13 s

I modified wide_to_long slightly (regex on categorical column / avoid copying many "idvariables", postpone type coercion) and the runtime is now

%%time
res2 = wide_to_long2(frame, list(vars), i='id', j='year')
#CPU times: user 112 ms, sys: 10.8 ms, total: 123 ms
#Wall time: 125 ms

The result is the same

res1.equals(res2)
#True

Docstring clarification

The wide_to_long docstring also contains an unused last parameter stubend : str, which should be removed.

A docstring Note addtion about escaping special characters (with for example re.escape) in stubnames could also perhaps be informative, since if the user passes a dataframe with messy stubnames, the function fails with a pretty uninformative error message.

I can send a PR for this if that is wanted?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: 06f26b5 python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 16.1.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.0+124.g06f26b5.dirty
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 30, 2016

@Nuffe sure that would be great! (also you could add your example in the docs if that is useful).

@jreback jreback added Docs Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 30, 2016
@jreback jreback added this to the Next Major Release milestone Nov 30, 2016
@erikcs
Copy link
Contributor Author

erikcs commented Dec 1, 2016

Thanks, sorry for the premature PR, it turned out I had some edges cases that broke the routine. I will get back to this and fix this later. Sorry.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Dec 11, 2016
ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016
closes pandas-dev#14778

Please see
regex search on long columns by first converting to Categorical, avoid
melting all dataframes with all the id variables, and wait with trying
to convert the "time" variable to `int` until last), and clear up the
docstring.

Author: nuffe <[email protected]>

Closes pandas-dev#14779 from nuffe/wide2longfix and squashes the following commits:

df1edf8 [nuffe] asv_bench: fix indentation and simplify
dc13064 [nuffe] Set docstring to raw literal to allow backslashes to be printed (still had to escape them)
295d1e6 [nuffe] Use pd.Index in doc example
1c49291 [nuffe] Can of course get rid negative lookahead now that suffix is a regex
54c5920 [nuffe] Specify the suffix with a regex
5747a25 [nuffe] ENH/DOC: wide_to_long performance and functionality improvements (pandas-dev#14779)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants