Skip to content

Odd behaviour when slicing Pandas dataframe with numeric column headings #15824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lvphj opened this issue Mar 28, 2017 · 3 comments
Open

Odd behaviour when slicing Pandas dataframe with numeric column headings #15824

lvphj opened this issue Mar 28, 2017 · 3 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@lvphj
Copy link

lvphj commented Mar 28, 2017

This is a copy of a question I asked at http://stackoverflow.com/questions/43058734/odd-behaviour-when-slicing-pandas-dataframe-with-numeric-column-headings as suggested by a comment.

Consider a Pandas dataframe containing case-control data that can be represented by the following structure:

       caseA  caseN catA
    0      y      1    a
    1      y      1    a
    2      y      1    b
    3      y      1    b
    4      y      1    c
    5      y      1    d
    6      y      1    a
    7      y      1    c
    8      n      0    c
    9      n      0    d
    10     n      0    a
    11     n      0    b
    12     n      0    c
    13     n      0    a
    14     n      0    d
    15     n      0    a
    16     n      0    b
    17     n      0    c
    18     n      0    a
    19     n      0    d

The caseA and caseN variables represent cases and controls as strings and integers, respectively.

I can calculate a 2x2 table to facilitate the calculation of odds and odds ratios using the pandas crosstab method. The default order of the columns is control-case but I change this to case-control which, to my way of thinking, is a bit more intuitive. (This stage may not be relevant to the issue but illustrates the need for changing the order of the columns.)

I then slice the dataframe to print just a select number of rows with columns in the order case - control. This works exactly as expected.

However, if I add a new column to the dataframe (e.g. a column containing the odds values) and then slice the dataframe in exactly the same way, the cases and controls are printed in the wrong order.

The following code snippet illustrates this point:

    df = pd.DataFrame({'caseN':[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0],
                       'caseA':['y','y','y','y','y','y','y','y','n','n','n','n','n','n','n','n','n','n','n','n'],
                       'catA':['a','a','b','b','c','d','a','c','c','d','a','b','c','a','d','a','b','c','a','d']})
    
    print('\nCross tabulation\n')
    continTab = pd.crosstab(df['catA'],df['caseN'])
    print(continTab)
    
    print('\nReorderd cross tabulation\n')
    continTab = continTab[[1,0]]
    print(continTab)
    
    #print('\n<-- An extra column containg odds has been entered here -->')
    #continTab['odds'] = continTab[1]/continTab[0]
    
    print('\nPrint just a slice contains rows a and c only with 1 - 0 column order\n')
    print(continTab.loc[['a','c'],[1,0]])

On the first run through (with no additional columns added) the sliced table produced is just as expected:

    caseN  1  0
    catA       
    a      3  4
    c      2  3

But if you uncomment the code that calculates the odds column and then re-run the exact same code, the sliced table produced has the column order reversed:

    caseN  0  1
    catA       
    a      4  3
    c      3  2

However, repeating the process using the case-control data described as strings (as found in variable caseA) produces the correct results, just as expected.

Output of pd.show_versions()

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.4.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 1.5.6
setuptools: 3.6
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: 2.3.0
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: 0.7.5.None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

Smaller example(without the case study context, not needed to reproduce):

In [45]: df = pd.DataFrame(np.arange(9).reshape(3, 3))

In [46]: df
Out[46]: 
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8

In [47]: df.loc[[0,1], [1, 0]]
Out[47]: 
   1  0
0  1  0
1  4  3

In [48]: df.columns = [1, 0, 'str']

In [49]: df
Out[49]: 
   1  0  str
0  0  1    2
1  3  4    5
2  6  7    8

In [50]: df.loc[[0,1], [1, 0]]     ## <----- this is 'wrong'
Out[50]: 
   0  1
0  1  0
1  4  3

In [51]: df.loc[:, [1, 0]]    ## <----- but it works correctly when index is sliced
Out[51]: 
   1  0
0  0  1
1  3  4
2  6  7

In [52]: df.columns = [ 0, 1, 'str']  ## and also works when the integers 
                                      ## in the mixed index are in sorted

In [53]: df
Out[53]: 
   0  1  str
0  0  1    2
1  3  4    5
2  6  7    8

In [54]: df.loc[[0,1], [1, 0]]
Out[54]: 
   1  0
0  1  0
1  4  3

@jreback jreback added Bug Difficulty Advanced Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 28, 2017
@jreback jreback added this to the Next Major Release milestone Mar 28, 2017
@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

cc @toobaz if you are interested

@toobaz
Copy link
Member

toobaz commented Mar 28, 2017

@jreback sure, thanks, I took a note of this for a future PR (my next one shouldn't directly touch indexing code)

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

6 participants