wide_to_long should verify uniqueness #16382

ohiobicyclist · 2017-05-18T01:36:56Z

Code Sample, a copy-pastable example if possible

# Your code here
# This code produces stacked tables in 0.19.2 but respectively an error and a blank table in 0.20.1
import pandas as pd
pd.__version__
seed = []
for i in range(14):
    seed.append([1, 2, 3, 4, 5])
seed.append([1] * 5)
test_df = pd.DataFrame(seed).T
test_df.columns = ["A_A1", "B_B1", "A_A2", "B_B2", "A_A3", "B_B3", "A_A4", 
                   "B_B4", "A_A5", "B_B5", "A_A6", "B_B6", "A_A7", "B_B7", "x"]
test_df
# 0.19.2: a table with the 'i' field all as '1'.  0.20.1:  NotImplementedError
pd.wide_to_long(test_df, ["A_A", "B_B"], i="x", j="colname")
# 0.19.2: a table: 0.20.1: an empty table
# this line "clears" the error above in 0.20.1 by assigning each row a unique identifier.
test_df["x"] = test_df.apply(lambda row: row["A_A1"], axis=1)
pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")

Problem description

Changelog lists "performance improvements" for pd.wide_to_long but this is not an improvement for me; for these corner cases I would rather have the old behavior. Are these not mainstream enough to support?

Expected Output

 		A_A 	B_B
x 	colname 		
1 	1 	1 	1
1 	1 	2
1 	1 	3
1 	1 	4
1 	1 	5
1 	2 	1
1 	2 	2
1 	2 	3
1 	2 	4
1 	2 	5
1 	3 	1
1 	3 	2
1 	3 	3
1 	3 	4
1 	3 	5
1 	4 	1
1 	4 	2
1 	4 	3
1 	4 	4
1 	4 	5
1 	5 	1
1 	5 	2
1 	5 	3
1 	5 	4
1 	5 	5
2 	1 	1
2 	1 	2
2 	1 	3
2 	1 	4
2 	1 	5
... 	... 	...
6 	5 	1
6 	5 	2
6 	5 	3
6 	5 	4
6 	5 	5
7 	1 	1
7 	1 	2
7 	1 	3
7 	1 	4
7 	1 	5
7 	2 	1
7 	2 	2
7 	2 	3
7 	2 	4
7 	2 	5
7 	3 	1
7 	3 	2
7 	3 	3
7 	3 	4
7 	3 	5
7 	4 	1
7 	4 	2
7 	4 	3
7 	4 	4
7 	4 	5
7 	5 	1
7 	5 	2
7 	5 	3
7 	5 	4
7 	5 	5

175 rows x 2 cols

2)

 		A_ 	B_
x 	colname 		
1 	A1 	1.0 	NaN
2 	A1 	2.0 	NaN
3 	A1 	3.0 	NaN
4 	A1 	4.0 	NaN
5 	A1 	5.0 	NaN
1 	A2 	1.0 	NaN
2 	A2 	2.0 	NaN
3 	A2 	3.0 	NaN
4 	A2 	4.0 	NaN
5 	A2 	5.0 	NaN
1 	A3 	1.0 	NaN
2 	A3 	2.0 	NaN
3 	A3 	3.0 	NaN
4 	A3 	4.0 	NaN
5 	A3 	5.0 	NaN
1 	A4 	1.0 	NaN
2 	A4 	2.0 	NaN
3 	A4 	3.0 	NaN
4 	A4 	4.0 	NaN
5 	A4 	5.0 	NaN
1 	A5 	1.0 	NaN
2 	A5 	2.0 	NaN
3 	A5 	3.0 	NaN
4 	A5 	4.0 	NaN
5 	A5 	5.0 	NaN
1 	A6 	1.0 	NaN
2 	A6 	2.0 	NaN
3 	A6 	3.0 	NaN
4 	A6 	4.0 	NaN
5 	A6 	5.0 	NaN
... 	... 	... 	...
1 	B2 	NaN 	1.0
2 	B2 	NaN 	2.0
3 	B2 	NaN 	3.0
4 	B2 	NaN 	4.0
5 	B2 	NaN 	5.0
1 	B3 	NaN 	1.0
2 	B3 	NaN 	2.0
3 	B3 	NaN 	3.0
4 	B3 	NaN 	4.0
5 	B3 	NaN 	5.0
1 	B4 	NaN 	1.0
2 	B4 	NaN 	2.0
3 	B4 	NaN 	3.0
4 	B4 	NaN 	4.0
5 	B4 	NaN 	5.0
1 	B5 	NaN 	1.0
2 	B5 	NaN 	2.0
3 	B5 	NaN 	3.0
4 	B5 	NaN 	4.0
5 	B5 	NaN 	5.0
1 	B6 	NaN 	1.0
2 	B6 	NaN 	2.0
3 	B6 	NaN 	3.0
4 	B6 	NaN 	4.0
5 	B6 	NaN 	5.0
1 	B7 	NaN 	1.0
2 	B7 	NaN 	2.0
3 	B7 	NaN 	3.0
4 	B7 	NaN 	4.0
5 	B7 	NaN 	5.0

70 rows x 2 columns

pd_wide_to_long_changes.zip

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here # this is after upgrading.

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-05-18T01:53:43Z

@Nuffe does this seem related to #14779

erikcs · 2017-05-18T13:00:28Z

@ohiobicyclist sorry, but I do not understand what you intend to do (with either the 0.19 or 0.2 version).

From the wide_to_long docstring in 0.2:

Each row of these wide variables are assumed to be uniquely identified by
i (can be a single column name or a list of column names)

This is not directly checked, which is why you get at NotImplementedError much further down the stack trace.

After you fix this and supply a unique row identifier, you forget to specify the correct stubnames (as you did in the first example), which is why an empty data frame is returned (empty because no columns matching those stubs where found):

pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")
Out[20]:
B_B3	A_A4	A_A1	A_A6	A_A7	B_B7	B_B1	B_B5	B_B6	B_B4	A_A2	B_B2	A_A3	A_A5	A_	B_
x	colname

pd.wide_to_long(test_df, ["A_A", "B_B"], i="x", j="colname")
A_A	B_B
x	colname		
1	1	1	1
1	1	2
1	1	3
1	1	4
1	1	5
1	2	1
1	2	2
1	2	3
1	2	4
1	2	5
1	3	1
...

which returns a new df with 5*7 = 35 rows because the df had 7 columns and 5 id rows.

ohiobicyclist · 2017-05-18T13:32:50Z

nuffe -- I'm a little less worried about the first example as yes, worst comes to worst, I can just reset_index (twice if the index isn't already unique) and use that as a unique column (it's annoying that code that works in 0.19.2 breaks in 0.20.1 because of this, but there is a work-around at least); however, I contend the following -- changing "A_Ax" "B_Bx" to "A_Ax" "B_Ax" then "A_" and "B_" ARE the correct sub-strings, with the stack strings being "A1" "A2" "A3" etc
all a substrings notebook.zip
.

Yet even that doesn't produce a table with data under 0.20.1.

TomAugspurger · 2017-05-18T13:44:32Z

@ohiobicyclist can you post the code for the example that should work here in the issue?

ohiobicyclist · 2017-05-18T14:17:04Z

Sure, thanks.

import pandas as pd
seed = []
for i in range(15):
    seed.append([1, 2, 3, 4, 5])
test_df = pd.DataFrame(seed).T
test_df.columns = ["A_A1", "B_A1", "A_A2", "B_A2", "A_A3", "B_A3", "A_A4", 
                   "B_A4", "A_A5", "B_A5", "A_A6", "B_A6", "A_A7", "B_A7", "x"]
pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")

TomAugspurger · 2017-05-18T14:22:55Z

That example has the same issue with the stubnames, right? It should be A_A and B_A I think

In [13]: pd.wide_to_long(test_df, ['A_A', 'B_A'], i='x', j='y').head()
Out[13]:
     A_A  B_A
x y
1 1    1    1
2 1    2    2
3 1    3    3
4 1    4    4
5 1    5    5

erikcs · 2017-05-18T15:44:29Z

@ohiobicyclist when you specify stubnames = [A_, B_] wide_to_long by default expects to find numeric suffixes (as does Stata - this is in the documentation), but this is not the case in your example. You can do this with (please note the new suffix argument):

In [7]: pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname", suffix='').head()
Out[7]:
            A_  B_
x colname
1 A1       1.0 NaN
  A2       1.0 NaN
  A3       1.0 NaN
  A4       1.0 NaN
  A5       1.0 NaN

erikcs · 2017-05-20T17:21:50Z

Was this helpful @ohiobicyclist ? (I am not sure if your exact use case is better handled by another function - because I am not sure I understand, judging from your examples, what exactly you need to do).

But I agree that checking for a unique i should have been done (I can send a PR for this @TomAugspurger )

jreback · 2017-05-20T17:41:57Z

@Nuffe yes pls.

ohiobicyclist · 2017-05-20T18:25:40Z

This (suffix="") is helpful. The thing that makes wide_to_long better than melt for my application is the dual-column (or multiple column) stacked output. A somewhat clearer example would be column names
["key_sampleid1", "value_sampleid1", "key_sampleid2", "value_sampleid2", "key_otherid1", "value_otherid1"] because it can nicely stack the 3 id's with key and value. Suffix="" takes care of that.

* BUG: wide_to_long should check for unique id vars (#16382) * Fix uncaught lint error * Add whatsnew note (bug fix)

…andas-dev#16403) * BUG: wide_to_long should check for unique id vars (pandas-dev#16382) * Fix uncaught lint error * Add whatsnew note (bug fix) (cherry picked from commit 04356a8)

* BUG: wide_to_long should check for unique id vars (#16382) * Fix uncaught lint error * Add whatsnew note (bug fix) (cherry picked from commit 04356a8)

…andas-dev#16403) * BUG: wide_to_long should check for unique id vars (pandas-dev#16382) * Fix uncaught lint error * Add whatsnew note (bug fix)

TomAugspurger added Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 18, 2017

TomAugspurger added this to the 0.20.2 milestone May 18, 2017

erikcs mentioned this issue May 20, 2017

BUG: wide_to_long should check for unique id vars (#16382) #16403

Merged

3 tasks

TomAugspurger changed the title ~~Wide_to_long has lost functionality compared to 0.19.2~~ wide_to_long should verify uniqueness May 22, 2017

TomAugspurger closed this as completed in #16403 May 23, 2017

TomAugspurger pushed a commit that referenced this issue May 23, 2017

BUG: wide_to_long should check for unique id vars (#16382) (#16403)

04356a8

* BUG: wide_to_long should check for unique id vars (#16382) * Fix uncaught lint error * Add whatsnew note (bug fix)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wide_to_long should verify uniqueness #16382

wide_to_long should verify uniqueness #16382

ohiobicyclist commented May 18, 2017 •

edited by TomAugspurger

Loading

INSTALLED VERSIONS

TomAugspurger commented May 18, 2017

erikcs commented May 18, 2017

ohiobicyclist commented May 18, 2017

TomAugspurger commented May 18, 2017

ohiobicyclist commented May 18, 2017 •

edited by TomAugspurger

Loading

TomAugspurger commented May 18, 2017

erikcs commented May 18, 2017 •

edited

Loading

erikcs commented May 20, 2017

jreback commented May 20, 2017

ohiobicyclist commented May 20, 2017

wide_to_long should verify uniqueness #16382

wide_to_long should verify uniqueness #16382

Comments

ohiobicyclist commented May 18, 2017 • edited by TomAugspurger Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented May 18, 2017

erikcs commented May 18, 2017

ohiobicyclist commented May 18, 2017

TomAugspurger commented May 18, 2017

ohiobicyclist commented May 18, 2017 • edited by TomAugspurger Loading

TomAugspurger commented May 18, 2017

erikcs commented May 18, 2017 • edited Loading

erikcs commented May 20, 2017

jreback commented May 20, 2017

ohiobicyclist commented May 20, 2017

ohiobicyclist commented May 18, 2017 •

edited by TomAugspurger

Loading

Output of `pd.show_versions()`

ohiobicyclist commented May 18, 2017 •

edited by TomAugspurger

Loading

erikcs commented May 18, 2017 •

edited

Loading