ENH/DOC: wide_to_long performance and docstring clarification #14778

erikcs · 2016-11-30T23:31:15Z

I had to massage some messy data recently, and a big bottleneck turned out to be wide_to_long

Some test data with many id variables and time variables

import pandas as pd
import numpy as np
import string

vars = string.ascii_uppercase[0:4]
nyrs = 20
nidvars = 20
N = 5000

yrvars = []
for var in vars:
    for yr in range(1, nyrs + 1):
        yrvars.append( var + str(yr))
        
yearobs = dict(zip(yrvars, np.random.randn(len(yrvars), N)))
idobs = dict(zip(range(nidvars), np.random.rand(nidvars, N)))
        
frame = pd.concat([pd.DataFrame(idobs), pd.DataFrame(yearobs)], axis=1)

frame.shape
#(5000, 91)

Reshaping with wide_to_long takes around 2 secs.

%%time
frame['id'] = frame.index
res1 = pd.wide_to_long(frame, list(vars), i='id', j='year')
#CPU times: user 1.89 s, sys: 210 ms, total: 2.1 s
#Wall time: 2.13 s

I modified wide_to_long slightly (regex on categorical column / avoid copying many "idvariables", postpone type coercion) and the runtime is now

%%time
res2 = wide_to_long2(frame, list(vars), i='id', j='year')
#CPU times: user 112 ms, sys: 10.8 ms, total: 123 ms
#Wall time: 125 ms

The result is the same

res1.equals(res2)
#True

Docstring clarification

The wide_to_long docstring also contains an unused last parameter stubend : str, which should be removed.

A docstring Note addtion about escaping special characters (with for example re.escape) in stubnames could also perhaps be informative, since if the user passes a dataframe with messy stubnames, the function fails with a pretty uninformative error message.

I can send a PR for this if that is wanted?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: 06f26b5 python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 16.1.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.0+124.g06f26b5.dirty
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-11-30T23:53:34Z

@Nuffe sure that would be great! (also you could add your example in the docs if that is useful).

erikcs · 2016-12-01T01:01:06Z

Thanks, sorry for the premature PR, it turned out I had some edges cases that broke the routine. I will get back to this and fix this later. Sorry.

closes pandas-dev#14778 Please see regex search on long columns by first converting to Categorical, avoid melting all dataframes with all the id variables, and wait with trying to convert the "time" variable to `int` until last), and clear up the docstring. Author: nuffe <[email protected]> Closes pandas-dev#14779 from nuffe/wide2longfix and squashes the following commits: df1edf8 [nuffe] asv_bench: fix indentation and simplify dc13064 [nuffe] Set docstring to raw literal to allow backslashes to be printed (still had to escape them) 295d1e6 [nuffe] Use pd.Index in doc example 1c49291 [nuffe] Can of course get rid negative lookahead now that suffix is a regex 54c5920 [nuffe] Specify the suffix with a regex 5747a25 [nuffe] ENH/DOC: wide_to_long performance and functionality improvements (pandas-dev#14779)

jreback added Docs Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 30, 2016

jreback added this to the Next Major Release milestone Nov 30, 2016

erikcs mentioned this issue Dec 1, 2016

ENH/DOC: wide_to_long performance and docstring clarification #14779

Closed

4 tasks

jreback modified the milestones: 0.20.0, Next Major Release Dec 11, 2016

jreback closed this as completed in 86233e1 Dec 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/DOC: wide_to_long performance and docstring clarification #14778

ENH/DOC: wide_to_long performance and docstring clarification #14778

erikcs commented Nov 30, 2016

jreback commented Nov 30, 2016

erikcs commented Dec 1, 2016

ENH/DOC: wide_to_long performance and docstring clarification #14778

ENH/DOC: wide_to_long performance and docstring clarification #14778

Comments

erikcs commented Nov 30, 2016

Docstring clarification

Output of pd.show_versions()

jreback commented Nov 30, 2016

erikcs commented Dec 1, 2016

Output of `pd.show_versions()`