setting values in a dataframe with duplicated keys #34034

c-foschi · 2020-05-06T19:23:19Z

I have a dataframe with duplicated keys, and I need to assign values to some column of that dataframe, for some different keys. Dataframe indexes support duplicated keys, so I assume that this kind of work should be easy, if not I don't see why dataframes should be allowed to have duplicated keys. Anyway, setting the values like this:

df.loc[key, 'column']= vector

gives me the following error:

ValueError: Must have equal len keys and value when setting with an iterable

even if the number of times key appears in the index of df is equal to the length of vector. I think this should be fixed.

Thank you,
c. foschi

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-05-07T02:00:43Z

@c-foschi Can you provide an example that can be copy / pasted which shows the problem? It seems that this functionality does work:

[ins] In [1]: df = pd.DataFrame([1, 2, 3], index=[1, 1, 2])                                                                                                                                                  

[ins] In [2]: df                                                                                                                                                                                             
Out[2]: 
   0
1  1
1  2
2  3

[ins] In [3]: df.loc[1, 0] = [9, 9]                                                                                                                                                                          

[ins] In [4]: df                                                                                                                                                                                             
Out[4]: 
   0
1  9
1  9
2  3

c-foschi · 2020-05-07T12:55:16Z

@dsaxton @MarcoGorelli I tried to reproduce the error with toy examples but I failed many times. Still, every time I run my code, the same error appears. Here it is, I hope it makes some sense to you:

In:

df= pd.read_csv('earthquake_signals.txt')
df.set_index('code', inplace= True)
df.index

Out:

Int64Index([6342156, 6342156, 6342156, 6342156, 6342156, 6342156, 6342156,
        6342156, 6342156, 6342156,
        ...
        1256211, 1256211, 1256211, 1256211, 1256211, 1256211, 1256211,
        1256211, 1256211, 1256211],
       dtype='int64', name='code', length=389750)

In:

len(df.loc[4848666, 'longitude'])

Out:

In:

df.loc[4848666, 'longitude']= np.arange(9)

Out:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-58-e7eaa328acef> in <module>
----> 1 df.loc[4848666, 'longitude']= np.arange(9)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value)
    188             key = com.apply_if_callable(key, self.obj)
    189         indexer = self._get_setitem_indexer(key)
--> 190         self._setitem_with_indexer(indexer, value)
    191 
    192     def _validate_key(self, key, axis):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _setitem_with_indexer(self, indexer, value)
    609 
    610                     if len(labels) != len(value):
--> 611                         raise ValueError('Must have equal len keys and value '
    612                                          'when setting with an iterable')
    613 

ValueError: Must have equal len keys and value when setting with an iterable

MarcoGorelli · 2020-05-07T13:12:26Z

What version of pandas are you using?

c-foschi · 2020-05-07T13:18:58Z

from pd.__version__ it appears that it is version 0.24.2. May be outdated actually.

MarcoGorelli · 2020-05-07T13:21:40Z

Yes, please upgrade to the latest version (1.0.3)

c-foschi · 2020-05-07T14:42:57Z

Done. Same error occurs:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-e7eaa328acef> in <module>
----> 1 df.loc[4848666, 'longitude']= np.arange(9)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value)
    669             key = com.apply_if_callable(key, self.obj)
    670         indexer = self._get_setitem_indexer(key)
--> 671         self._setitem_with_indexer(indexer, value)
    672 
    673     def _validate_key(self, key, axis: int):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _setitem_with_indexer(self, indexer, value)
   1017                     if len(labels) != len(value):
   1018                         raise ValueError(
-> 1019                             "Must have equal len keys and value "
   1020                             "when setting with an iterable"
   1021                         )

ValueError: Must have equal len keys and value when setting with an iterable

MarcoGorelli · 2020-05-07T14:49:53Z

OK, once you read in 'earthquake_signals.txt', can you try making the dataframe as small as possible (by only considering a subset of all rows) such that the error reproduces, and then post here a reproducible example using that data?

c-foschi · 2020-05-07T15:16:34Z

ok I finally managed to create a toy examples with random values:

I have a file like this:

code,longitude,date
6342156,0.966747,a
6342156,0.756199,b
6342156,0.054222,c
6342156,0.743996,d
6342156,0.486753,a
6342156,0.464093,s
6342156,0.430592,d
2261019,0.827252,f
2261019,0.864456,f
2261019,0.866847,d

And my code is:

X= pd.read_csv('toy.csv')
X.set_index('code', inplace= True)
X.loc[2261019, 'longitude']= np.arange(3)

This raises the error to me.

I should probably add that without the string column everything worked.

MarcoGorelli · 2020-05-07T15:27:36Z

Great, thanks @c-foschi ! This reproduces

import pandas as pd
import numpy as np
from io import StringIO

X = pd.read_csv(
    StringIO(
        """code,longitude,date
6342156,0.966747,a
6342156,0.756199,b
6342156,0.054222,c
6342156,0.743996,d
6342156,0.486753,a
6342156,0.464093,s
6342156,0.430592,d
2261019,0.827252,f
2261019,0.864456,f
2261019,0.866847,d"""
    )
)
X.set_index("code", inplace=True)
X.loc[2261019, "longitude"] = np.arange(3)

CloseChoice · 2020-05-07T22:32:05Z

Seems for me to work on master:

import pandas as pd
import numpy as np
from io import StringIO

X = pd.read_csv(
    StringIO(
        """code,longitude,date
6342156,0.966747,a
6342156,0.756199,b
6342156,0.054222,c
6342156,0.743996,d
6342156,0.486753,a
6342156,0.464093,s
6342156,0.430592,d
2261019,0.827252,f
2261019,0.864456,f
2261019,0.866847,d"""
    )
)
X.set_index("code", inplace=True)
X.loc[2261019, "longitude"] = np.arange(3)
print(X)
# result
         longitude date
code                   
6342156   0.966747    a
6342156   0.756199    b
6342156   0.054222    c
6342156   0.743996    d
6342156   0.486753    a
6342156   0.464093    s
6342156   0.430592    d
2261019   0.000000    f
2261019   1.000000    f
2261019   2.000000    d
# check version
pd.__version__
'1.1.0.dev0+1502.g3ed7dff48

Do we need a test?

MarcoGorelli · 2020-05-08T10:30:56Z

Thanks @CloseChoice for checking, can confirm it works on master.

Do we need a test?

I would think so, yes

I should probably add that without the string column everything worked.

Thanks @c-foschi - yes, if we run this on 1.0.3 but add in

X = X.drop('date', axis=1)

then it works

CloseChoice · 2020-05-08T12:20:52Z

take

simonjayhawkins · 2020-05-08T16:13:31Z

fixed in #31897

3da053c is the first new commit
commit 3da053c
Author: jbrockmendel [email protected]
Date: Mon Feb 17 16:15:45 2020 -0800

BUG: fix length_of_indexer with boolean mask (#31897)

MarcoGorelli added the Needs Info Clarification about behavior needed to assess issue label May 7, 2020

MarcoGorelli added this to the No action milestone May 7, 2020

MarcoGorelli removed this from the No action milestone May 7, 2020

MarcoGorelli added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Info Clarification about behavior needed to assess issue labels May 7, 2020

github-actions bot assigned CloseChoice May 8, 2020

CloseChoice mentioned this issue May 8, 2020

add test for setitem from duplicate axis #34071

Merged

5 tasks

simonjayhawkins added the Needs Tests Unit test(s) needed to prevent regressions label May 8, 2020

simonjayhawkins added this to the 1.1 milestone May 8, 2020

mroeschke closed this as completed in #34071 May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setting values in a dataframe with duplicated keys #34034

setting values in a dataframe with duplicated keys #34034

c-foschi commented May 6, 2020

dsaxton commented May 7, 2020

c-foschi commented May 7, 2020 •

edited

Loading

MarcoGorelli commented May 7, 2020

c-foschi commented May 7, 2020

MarcoGorelli commented May 7, 2020

c-foschi commented May 7, 2020

MarcoGorelli commented May 7, 2020

c-foschi commented May 7, 2020 •

edited

Loading

MarcoGorelli commented May 7, 2020

CloseChoice commented May 7, 2020 •

edited

Loading

MarcoGorelli commented May 8, 2020 •

edited

Loading

CloseChoice commented May 8, 2020

simonjayhawkins commented May 8, 2020

setting values in a dataframe with duplicated keys #34034

setting values in a dataframe with duplicated keys #34034

Comments

c-foschi commented May 6, 2020

dsaxton commented May 7, 2020

c-foschi commented May 7, 2020 • edited Loading

MarcoGorelli commented May 7, 2020

c-foschi commented May 7, 2020

MarcoGorelli commented May 7, 2020

c-foschi commented May 7, 2020

MarcoGorelli commented May 7, 2020

c-foschi commented May 7, 2020 • edited Loading

MarcoGorelli commented May 7, 2020

CloseChoice commented May 7, 2020 • edited Loading

MarcoGorelli commented May 8, 2020 • edited Loading

CloseChoice commented May 8, 2020

simonjayhawkins commented May 8, 2020

c-foschi commented May 7, 2020 •

edited

Loading

c-foschi commented May 7, 2020 •

edited

Loading

CloseChoice commented May 7, 2020 •

edited

Loading

MarcoGorelli commented May 8, 2020 •

edited

Loading