Skip to content

wide_to_long should verify uniqueness #16382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ohiobicyclist opened this issue May 18, 2017 · 10 comments · Fixed by #16403
Closed

wide_to_long should verify uniqueness #16382

ohiobicyclist opened this issue May 18, 2017 · 10 comments · Fixed by #16403
Labels
Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@ohiobicyclist
Copy link

ohiobicyclist commented May 18, 2017

Code Sample, a copy-pastable example if possible

# Your code here
# This code produces stacked tables in 0.19.2 but respectively an error and a blank table in 0.20.1
import pandas as pd
pd.__version__
seed = []
for i in range(14):
    seed.append([1, 2, 3, 4, 5])
seed.append([1] * 5)
test_df = pd.DataFrame(seed).T
test_df.columns = ["A_A1", "B_B1", "A_A2", "B_B2", "A_A3", "B_B3", "A_A4", 
                   "B_B4", "A_A5", "B_B5", "A_A6", "B_B6", "A_A7", "B_B7", "x"]
test_df
# 0.19.2: a table with the 'i' field all as '1'.  0.20.1:  NotImplementedError
pd.wide_to_long(test_df, ["A_A", "B_B"], i="x", j="colname")
# 0.19.2: a table: 0.20.1: an empty table
# this line "clears" the error above in 0.20.1 by assigning each row a unique identifier.
test_df["x"] = test_df.apply(lambda row: row["A_A1"], axis=1)
pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")

Problem description

Changelog lists "performance improvements" for pd.wide_to_long but this is not an improvement for me; for these corner cases I would rather have the old behavior. Are these not mainstream enough to support?

Expected Output

 		A_A 	B_B
x 	colname 		
1 	1 	1 	1
1 	1 	2
1 	1 	3
1 	1 	4
1 	1 	5
1 	2 	1
1 	2 	2
1 	2 	3
1 	2 	4
1 	2 	5
1 	3 	1
1 	3 	2
1 	3 	3
1 	3 	4
1 	3 	5
1 	4 	1
1 	4 	2
1 	4 	3
1 	4 	4
1 	4 	5
1 	5 	1
1 	5 	2
1 	5 	3
1 	5 	4
1 	5 	5
2 	1 	1
2 	1 	2
2 	1 	3
2 	1 	4
2 	1 	5
... 	... 	...
6 	5 	1
6 	5 	2
6 	5 	3
6 	5 	4
6 	5 	5
7 	1 	1
7 	1 	2
7 	1 	3
7 	1 	4
7 	1 	5
7 	2 	1
7 	2 	2
7 	2 	3
7 	2 	4
7 	2 	5
7 	3 	1
7 	3 	2
7 	3 	3
7 	3 	4
7 	3 	5
7 	4 	1
7 	4 	2
7 	4 	3
7 	4 	4
7 	4 	5
7 	5 	1
7 	5 	2
7 	5 	3
7 	5 	4
7 	5 	5

175 rows x 2 cols

2)

 		A_ 	B_
x 	colname 		
1 	A1 	1.0 	NaN
2 	A1 	2.0 	NaN
3 	A1 	3.0 	NaN
4 	A1 	4.0 	NaN
5 	A1 	5.0 	NaN
1 	A2 	1.0 	NaN
2 	A2 	2.0 	NaN
3 	A2 	3.0 	NaN
4 	A2 	4.0 	NaN
5 	A2 	5.0 	NaN
1 	A3 	1.0 	NaN
2 	A3 	2.0 	NaN
3 	A3 	3.0 	NaN
4 	A3 	4.0 	NaN
5 	A3 	5.0 	NaN
1 	A4 	1.0 	NaN
2 	A4 	2.0 	NaN
3 	A4 	3.0 	NaN
4 	A4 	4.0 	NaN
5 	A4 	5.0 	NaN
1 	A5 	1.0 	NaN
2 	A5 	2.0 	NaN
3 	A5 	3.0 	NaN
4 	A5 	4.0 	NaN
5 	A5 	5.0 	NaN
1 	A6 	1.0 	NaN
2 	A6 	2.0 	NaN
3 	A6 	3.0 	NaN
4 	A6 	4.0 	NaN
5 	A6 	5.0 	NaN
... 	... 	... 	...
1 	B2 	NaN 	1.0
2 	B2 	NaN 	2.0
3 	B2 	NaN 	3.0
4 	B2 	NaN 	4.0
5 	B2 	NaN 	5.0
1 	B3 	NaN 	1.0
2 	B3 	NaN 	2.0
3 	B3 	NaN 	3.0
4 	B3 	NaN 	4.0
5 	B3 	NaN 	5.0
1 	B4 	NaN 	1.0
2 	B4 	NaN 	2.0
3 	B4 	NaN 	3.0
4 	B4 	NaN 	4.0
5 	B4 	NaN 	5.0
1 	B5 	NaN 	1.0
2 	B5 	NaN 	2.0
3 	B5 	NaN 	3.0
4 	B5 	NaN 	4.0
5 	B5 	NaN 	5.0
1 	B6 	NaN 	1.0
2 	B6 	NaN 	2.0
3 	B6 	NaN 	3.0
4 	B6 	NaN 	4.0
5 	B6 	NaN 	5.0
1 	B7 	NaN 	1.0
2 	B7 	NaN 	2.0
3 	B7 	NaN 	3.0
4 	B7 	NaN 	4.0
5 	B7 	NaN 	5.0

70 rows x 2 columns

pd_wide_to_long_changes.zip

Output of pd.show_versions()

# Paste the output here pd.show_versions() here # this is after upgrading.

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

@Nuffe does this seem related to #14779

@TomAugspurger TomAugspurger added Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 18, 2017
@TomAugspurger TomAugspurger added this to the 0.20.2 milestone May 18, 2017
@erikcs
Copy link
Contributor

erikcs commented May 18, 2017

@ohiobicyclist sorry, but I do not understand what you intend to do (with either the 0.19 or 0.2 version).

From the wide_to_long docstring in 0.2:

Each row of these wide variables are assumed to be uniquely identified by
i (can be a single column name or a list of column names)

This is not directly checked, which is why you get at NotImplementedError much further down the stack trace.

After you fix this and supply a unique row identifier, you forget to specify the correct stubnames (as you did in the first example), which is why an empty data frame is returned (empty because no columns matching those stubs where found):

pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")
Out[20]:
B_B3	A_A4	A_A1	A_A6	A_A7	B_B7	B_B1	B_B5	B_B6	B_B4	A_A2	B_B2	A_A3	A_A5	A_	B_
x	colname																
pd.wide_to_long(test_df, ["A_A", "B_B"], i="x", j="colname")
A_A	B_B
x	colname		
1	1	1	1
1	1	2
1	1	3
1	1	4
1	1	5
1	2	1
1	2	2
1	2	3
1	2	4
1	2	5
1	3	1
...

which returns a new df with 5*7 = 35 rows because the df had 7 columns and 5 id rows.

@ohiobicyclist
Copy link
Author

nuffe -- I'm a little less worried about the first example as yes, worst comes to worst, I can just reset_index (twice if the index isn't already unique) and use that as a unique column (it's annoying that code that works in 0.19.2 breaks in 0.20.1 because of this, but there is a work-around at least); however, I contend the following -- changing "A_Ax" "B_Bx" to "A_Ax" "B_Ax" then "A_" and "B_" ARE the correct sub-strings, with the stack strings being "A1" "A2" "A3" etc
all a substrings notebook.zip
.

Yet even that doesn't produce a table with data under 0.20.1.

@TomAugspurger
Copy link
Contributor

@ohiobicyclist can you post the code for the example that should work here in the issue?

@ohiobicyclist
Copy link
Author

ohiobicyclist commented May 18, 2017

Sure, thanks.

import pandas as pd
seed = []
for i in range(15):
    seed.append([1, 2, 3, 4, 5])
test_df = pd.DataFrame(seed).T
test_df.columns = ["A_A1", "B_A1", "A_A2", "B_A2", "A_A3", "B_A3", "A_A4", 
                   "B_A4", "A_A5", "B_A5", "A_A6", "B_A6", "A_A7", "B_A7", "x"]
pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")

@TomAugspurger
Copy link
Contributor

That example has the same issue with the stubnames, right? It should be A_A and B_A I think

In [13]: pd.wide_to_long(test_df, ['A_A', 'B_A'], i='x', j='y').head()
Out[13]:
     A_A  B_A
x y
1 1    1    1
2 1    2    2
3 1    3    3
4 1    4    4
5 1    5    5

@erikcs
Copy link
Contributor

erikcs commented May 18, 2017

@ohiobicyclist when you specify stubnames = [A_, B_] wide_to_long by default expects to find numeric suffixes (as does Stata - this is in the documentation), but this is not the case in your example. You can do this with (please note the new suffix argument):

In [7]: pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname", suffix='').head()
Out[7]:
            A_  B_
x colname
1 A1       1.0 NaN
  A2       1.0 NaN
  A3       1.0 NaN
  A4       1.0 NaN
  A5       1.0 NaN

@erikcs
Copy link
Contributor

erikcs commented May 20, 2017

Was this helpful @ohiobicyclist ? (I am not sure if your exact use case is better handled by another function - because I am not sure I understand, judging from your examples, what exactly you need to do).

But I agree that checking for a unique i should have been done (I can send a PR for this @TomAugspurger )

@jreback
Copy link
Contributor

jreback commented May 20, 2017

@Nuffe yes pls.

@ohiobicyclist
Copy link
Author

This (suffix="") is helpful. The thing that makes wide_to_long better than melt for my application is the dual-column (or multiple column) stacked output. A somewhat clearer example would be column names
["key_sampleid1", "value_sampleid1", "key_sampleid2", "value_sampleid2", "key_otherid1", "value_otherid1"] because it can nicely stack the 3 id's with key and value. Suffix="" takes care of that.

@TomAugspurger TomAugspurger changed the title Wide_to_long has lost functionality compared to 0.19.2 wide_to_long should verify uniqueness May 22, 2017
TomAugspurger pushed a commit that referenced this issue May 23, 2017
* BUG: wide_to_long should check for unique id vars (#16382)

* Fix uncaught lint error

* Add whatsnew note (bug fix)
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue May 29, 2017
…andas-dev#16403)

* BUG: wide_to_long should check for unique id vars (pandas-dev#16382)

* Fix uncaught lint error

* Add whatsnew note (bug fix)

(cherry picked from commit 04356a8)
TomAugspurger pushed a commit that referenced this issue May 30, 2017
* BUG: wide_to_long should check for unique id vars (#16382)

* Fix uncaught lint error

* Add whatsnew note (bug fix)

(cherry picked from commit 04356a8)
stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017
…andas-dev#16403)

* BUG: wide_to_long should check for unique id vars (pandas-dev#16382)

* Fix uncaught lint error

* Add whatsnew note (bug fix)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants