DOC, CI: Correct wide_to_long docstring and add reshape/melt to CI #26273

vandenn · 2019-05-03T06:41:29Z

closes Incorrect example in wide_to_long docstring #25733
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Fix the erroneous example in the wide_to_long docstring (non-integer suffixes portion)
Update code_checks to include reshape/melt.py
Edits based on the PR made by sanjusci (#26010)

Docstring validation shows errors that are already present from HEAD. Changes made to the docstring did not introduce any new errors.
Edit: Update validation based on latest commit.

$ python scripts/validate_docstrings.py pandas.wide_to_long

################################################################################
####################### Docstring (pandas.wide_to_long)  #######################
################################################################################

Wide panel to long format. Less flexible but more user-friendly than melt.

With stubnames ['A', 'B'], this function expects to find one or more
group of columns with format
A-suffix1, A-suffix2,..., B-suffix1, B-suffix2,...
You specify what you want to call this suffix in the resulting long format
with `j` (for example `j='year'`)

Each row of these wide variables are assumed to be uniquely identified by
`i` (can be a single column name or a list of column names)

All remaining variables in the data frame are left intact.

Parameters
----------
df : DataFrame
    The wide-format DataFrame
stubnames : str or list-like
    The stub name(s). The wide format variables are assumed to
    start with the stub names.
i : str or list-like
    Column(s) to use as id variable(s)
j : str
    The name of the sub-observation variable. What you wish to name your
    suffix in the long format.
sep : str, default ""
    A character indicating the separation of the variable names
    in the wide format, to be stripped from the names in the long format.
    For example, if your column names are A-suffix1, A-suffix2, you
    can strip the hyphen by specifying `sep='-'`

    .. versionadded:: 0.20.0

suffix : str, default '\\d+'
    A regular expression capturing the wanted suffixes. '\\d+' captures
    numeric suffixes. Suffixes with no numbers could be specified with the
    negated character class '\\D+'. You can also further disambiguate
    suffixes, for example, if your wide variables are of the form
    A-one, B-two,.., and you have an unrelated column A-rating, you can
    ignore the last one by specifying `suffix='(!?one|two)'`

    .. versionadded:: 0.20.0

    .. versionchanged:: 0.23.0
        When all suffixes are numeric, they are cast to int64/float64.

Returns
-------
DataFrame
    A DataFrame that contains each stub name as a variable, with new index
    (i, j).

Notes
-----
All extra variables are left untouched. This simply uses
`pandas.melt` under the hood, but is hard-coded to "do the right thing"
in a typical case.

Examples
--------
>>> np.random.seed(123)
>>> df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
...                    "A1980" : {0 : "d", 1 : "e", 2 : "f"},
...                    "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
...                    "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
...                    "X"     : dict(zip(range(3), np.random.randn(3)))
...                   })
>>> df["id"] = df.index
>>> df
  A1970 A1980  B1970  B1980         X  id
0     a     d    2.5    3.2 -1.085631   0
1     b     e    1.2    1.3  0.997345   1
2     c     f    0.7    0.1  0.282978   2
>>> pd.wide_to_long(df, ["A", "B"], i="id", j="year")
... # doctest: +NORMALIZE_WHITESPACE
                X  A    B
id year
0  1970 -1.085631  a  2.5
1  1970  0.997345  b  1.2
2  1970  0.282978  c  0.7
0  1980 -1.085631  d  3.2
1  1980  0.997345  e  1.3
2  1980  0.282978  f  0.1

With multiple id columns

>>> df = pd.DataFrame({
...     'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
...     'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
...     'ht1': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
...     'ht2': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
   famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9
>>> l = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age')
>>> l
... # doctest: +NORMALIZE_WHITESPACE
                  ht
famid birth age
1     1     1    2.8
            2    3.4
      2     1    2.9
            2    3.8
      3     1    2.2
            2    2.9
2     1     1    2.0
            2    3.2
      2     1    1.8
            2    2.8
      3     1    1.9
            2    2.4
3     1     1    2.2
            2    3.3
      2     1    2.3
            2    3.4
      3     1    2.1
            2    2.9

Going from long back to wide just takes some creative use of `unstack`

>>> w = l.unstack()
>>> w.columns = w.columns.map('{0[0]}{0[1]}'.format)
>>> w.reset_index()
   famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9

Less wieldy column names are also handled

>>> np.random.seed(0)
>>> df = pd.DataFrame({'A(weekly)-2010': np.random.rand(3),
...                    'A(weekly)-2011': np.random.rand(3),
...                    'B(weekly)-2010': np.random.rand(3),
...                    'B(weekly)-2011': np.random.rand(3),
...                    'X' : np.random.randint(3, size=3)})
>>> df['id'] = df.index
>>> df # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
   A(weekly)-2010  A(weekly)-2011  B(weekly)-2010  B(weekly)-2011  X  id
0        0.548814        0.544883        0.437587        0.383442  0   0
1        0.715189        0.423655        0.891773        0.791725  1   1
2        0.602763        0.645894        0.963663        0.528895  1   2

>>> pd.wide_to_long(df, ['A(weekly)', 'B(weekly)'], i='id',
...                 j='year', sep='-')
... # doctest: +NORMALIZE_WHITESPACE
         X  A(weekly)  B(weekly)
id year
0  2010  0   0.548814   0.437587
1  2010  1   0.715189   0.891773
2  2010  1   0.602763   0.963663
0  2011  0   0.544883   0.383442
1  2011  1   0.423655   0.791725
2  2011  1   0.645894   0.528895

If we have many columns, we could also use a regex to find our
stubnames and pass that list on to wide_to_long

>>> stubnames = sorted(
...     set([match[0] for match in df.columns.str.findall(
...         r'[A-B]\(.*\)').values if match != [] ])
... )
>>> list(stubnames)
['A(weekly)', 'B(weekly)']

All of the above examples have integers as suffixes. It is possible to
have non-integers as suffixes.

>>> df = pd.DataFrame({
...     'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
...     'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
...     'ht_one': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
...     'ht_two': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
   famid  birth  ht_one  ht_two
0      1      1     2.8     3.4
1      1      2     2.9     3.8
2      1      3     2.2     2.9
3      2      1     2.0     3.2
4      2      2     1.8     2.8
5      2      3     1.9     2.4
6      3      1     2.2     3.3
7      3      2     2.3     3.4
8      3      3     2.1     2.9

>>> l = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age',
...                     sep='_', suffix='\w+')
>>> l
... # doctest: +NORMALIZE_WHITESPACE
                  ht
famid birth age
1     1     one  2.8
            two  3.4
      2     one  2.9
            two  3.8
      3     one  2.2
            two  2.9
2     1     one  2.0
            two  3.2
      2     one  1.8
            two  2.8
      3     one  1.9
            two  2.4
3     1     one  2.2
            two  3.3
      2     one  2.3
            two  3.4
      3     one  2.1
            two  2.9

################################################################################
################################## Validation ##################################
################################################################################

11 Errors found:
	Parameter "df" description should finish with "."
	Parameter "i" description should finish with "."
	Parameter "sep" description should finish with "."
	Parameter "suffix" description should finish with "."
	flake8 error: C403 Unnecessary list comprehension - rewrite as a set comprehension.
	flake8 error: E124 closing bracket does not match visual indentation
	flake8 error: E202 whitespace before ']'
	flake8 error: E203 whitespace before ':' (18 times)
	flake8 error: E261 at least two spaces before inline comment
	flake8 error: E741 ambiguous variable name 'l' (2 times)
	flake8 error: W605 invalid escape sequence '\w'
1 Warnings found:
	See Also section not found

pep8speaks · 2019-05-03T07:11:47Z

Hello @vandenn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-03 07:20:34 UTC

codecov · 2019-05-03T07:11:50Z

Codecov Report

Merging #26273 into master will decrease coverage by 51.26%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #26273       +/-   ##
===========================================
- Coverage   91.99%   40.72%   -51.27%     
===========================================
  Files         175      175               
  Lines       52379    52379               
===========================================
- Hits        48184    21331    -26853     
- Misses       4195    31048    +26853

Flag	Coverage Δ
#multiple	`?`
#single	`40.72% <ø> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/melt.py	`12.6% <ø> (-84.88%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e854ccf...25888c3. Read the comment docs.

codecov · 2019-05-03T07:11:51Z

Codecov Report

Merging #26273 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #26273      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52379    52379              
==========================================
- Hits        48184    48179       -5     
- Misses       4195     4200       +5

Flag	Coverage Δ
#multiple	`90.53% <ø> (ø)`	⬆️
#single	`40.73% <ø> (-0.14%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/melt.py	`97.47% <ø> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.61% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e854ccf...61b2793. Read the comment docs.

Change "quarterly" columns to "weekly" in order to fit output within line limit.

WillAyd · 2019-05-03T14:52:23Z

Thanks a lot @vandenn

vandenn · 2019-05-03T14:57:25Z

Thanks again @gfyoung and @WillAyd !

vandenn added 2 commits May 3, 2019 14:19

DOC: Correct wide_to_long docstring

3861066

CI: Add reshape/melt.py to CI checks

a1f8a63

DOC: Fix outputs in wide_to_long docstring to reflect actual results

61b2793

Change "quarterly" columns to "weekly" in order to fit output within line limit.

vandenn force-pushed the fix-wide-to-long branch from 25888c3 to 61b2793 Compare May 3, 2019 07:20

gfyoung added Docs Code Style Code style, linting, code_checks labels May 3, 2019

gfyoung approved these changes May 3, 2019

View reviewed changes

WillAyd added this to the 0.25.0 milestone May 3, 2019

WillAyd approved these changes May 3, 2019

View reviewed changes

WillAyd merged commit f46ab96 into pandas-dev:master May 3, 2019

WillAyd mentioned this pull request May 3, 2019

Incorrect example in wide_to_long docstring #26010

Closed

4 tasks

vandenn deleted the fix-wide-to-long branch May 3, 2019 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC, CI: Correct wide_to_long docstring and add reshape/melt to CI #26273

DOC, CI: Correct wide_to_long docstring and add reshape/melt to CI #26273

vandenn commented May 3, 2019 •

edited

Loading

pep8speaks commented May 3, 2019 •

edited

Loading

codecov bot commented May 3, 2019

codecov bot commented May 3, 2019 •

edited

Loading

WillAyd commented May 3, 2019

vandenn commented May 3, 2019

DOC, CI: Correct wide_to_long docstring and add reshape/melt to CI #26273

DOC, CI: Correct wide_to_long docstring and add reshape/melt to CI #26273

Conversation

vandenn commented May 3, 2019 • edited Loading

pep8speaks commented May 3, 2019 • edited Loading

Comment last updated at 2019-05-03 07:20:34 UTC

codecov bot commented May 3, 2019

Codecov Report

codecov bot commented May 3, 2019 • edited Loading

Codecov Report

WillAyd commented May 3, 2019

vandenn commented May 3, 2019

vandenn commented May 3, 2019 •

edited

Loading

pep8speaks commented May 3, 2019 •

edited

Loading

codecov bot commented May 3, 2019 •

edited

Loading