Skip to content

setting values in a dataframe with duplicated keys #34034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
c-foschi opened this issue May 6, 2020 · 13 comments · Fixed by #34071
Closed

setting values in a dataframe with duplicated keys #34034

c-foschi opened this issue May 6, 2020 · 13 comments · Fixed by #34071
Assignees
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@c-foschi
Copy link

c-foschi commented May 6, 2020

I have a dataframe with duplicated keys, and I need to assign values to some column of that dataframe, for some different keys. Dataframe indexes support duplicated keys, so I assume that this kind of work should be easy, if not I don't see why dataframes should be allowed to have duplicated keys. Anyway, setting the values like this:

df.loc[key, 'column']= vector

gives me the following error:

ValueError: Must have equal len keys and value when setting with an iterable

even if the number of times key appears in the index of df is equal to the length of vector. I think this should be fixed.

Thank you,
c. foschi

@dsaxton
Copy link
Member

dsaxton commented May 7, 2020

@c-foschi Can you provide an example that can be copy / pasted which shows the problem? It seems that this functionality does work:

[ins] In [1]: df = pd.DataFrame([1, 2, 3], index=[1, 1, 2])                                                                                                                                                  

[ins] In [2]: df                                                                                                                                                                                             
Out[2]: 
   0
1  1
1  2
2  3

[ins] In [3]: df.loc[1, 0] = [9, 9]                                                                                                                                                                          

[ins] In [4]: df                                                                                                                                                                                             
Out[4]: 
   0
1  9
1  9
2  3

@MarcoGorelli MarcoGorelli added the Needs Info Clarification about behavior needed to assess issue label May 7, 2020
@c-foschi
Copy link
Author

c-foschi commented May 7, 2020

@dsaxton @MarcoGorelli I tried to reproduce the error with toy examples but I failed many times. Still, every time I run my code, the same error appears. Here it is, I hope it makes some sense to you:

In:

df= pd.read_csv('earthquake_signals.txt')
df.set_index('code', inplace= True)
df.index

Out:

Int64Index([6342156, 6342156, 6342156, 6342156, 6342156, 6342156, 6342156,
        6342156, 6342156, 6342156,
        ...
        1256211, 1256211, 1256211, 1256211, 1256211, 1256211, 1256211,
        1256211, 1256211, 1256211],
       dtype='int64', name='code', length=389750)

In:

len(df.loc[4848666, 'longitude'])

Out:

9

In:

df.loc[4848666, 'longitude']= np.arange(9)

Out:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-58-e7eaa328acef> in <module>
----> 1 df.loc[4848666, 'longitude']= np.arange(9)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value)
    188             key = com.apply_if_callable(key, self.obj)
    189         indexer = self._get_setitem_indexer(key)
--> 190         self._setitem_with_indexer(indexer, value)
    191 
    192     def _validate_key(self, key, axis):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _setitem_with_indexer(self, indexer, value)
    609 
    610                     if len(labels) != len(value):
--> 611                         raise ValueError('Must have equal len keys and value '
    612                                          'when setting with an iterable')
    613 

ValueError: Must have equal len keys and value when setting with an iterable

@MarcoGorelli
Copy link
Member

What version of pandas are you using?

@c-foschi
Copy link
Author

c-foschi commented May 7, 2020

from pd.__version__ it appears that it is version 0.24.2. May be outdated actually.

@MarcoGorelli
Copy link
Member

Yes, please upgrade to the latest version (1.0.3)

@MarcoGorelli MarcoGorelli added this to the No action milestone May 7, 2020
@c-foschi
Copy link
Author

c-foschi commented May 7, 2020

Done. Same error occurs:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-e7eaa328acef> in <module>
----> 1 df.loc[4848666, 'longitude']= np.arange(9)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value)
    669             key = com.apply_if_callable(key, self.obj)
    670         indexer = self._get_setitem_indexer(key)
--> 671         self._setitem_with_indexer(indexer, value)
    672 
    673     def _validate_key(self, key, axis: int):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _setitem_with_indexer(self, indexer, value)
   1017                     if len(labels) != len(value):
   1018                         raise ValueError(
-> 1019                             "Must have equal len keys and value "
   1020                             "when setting with an iterable"
   1021                         )

ValueError: Must have equal len keys and value when setting with an iterable

@MarcoGorelli
Copy link
Member

OK, once you read in 'earthquake_signals.txt', can you try making the dataframe as small as possible (by only considering a subset of all rows) such that the error reproduces, and then post here a reproducible example using that data?

@c-foschi
Copy link
Author

c-foschi commented May 7, 2020

ok I finally managed to create a toy examples with random values:

I have a file like this:

code,longitude,date
6342156,0.966747,a
6342156,0.756199,b
6342156,0.054222,c
6342156,0.743996,d
6342156,0.486753,a
6342156,0.464093,s
6342156,0.430592,d
2261019,0.827252,f
2261019,0.864456,f
2261019,0.866847,d

And my code is:

X= pd.read_csv('toy.csv')
X.set_index('code', inplace= True)
X.loc[2261019, 'longitude']= np.arange(3)

This raises the error to me.

I should probably add that without the string column everything worked.

@MarcoGorelli
Copy link
Member

Great, thanks @c-foschi ! This reproduces

import pandas as pd
import numpy as np
from io import StringIO

X = pd.read_csv(
    StringIO(
        """code,longitude,date
6342156,0.966747,a
6342156,0.756199,b
6342156,0.054222,c
6342156,0.743996,d
6342156,0.486753,a
6342156,0.464093,s
6342156,0.430592,d
2261019,0.827252,f
2261019,0.864456,f
2261019,0.866847,d"""
    )
)
X.set_index("code", inplace=True)
X.loc[2261019, "longitude"] = np.arange(3)

@MarcoGorelli MarcoGorelli removed this from the No action milestone May 7, 2020
@MarcoGorelli MarcoGorelli added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Info Clarification about behavior needed to assess issue labels May 7, 2020
@CloseChoice
Copy link
Member

CloseChoice commented May 7, 2020

Seems for me to work on master:

import pandas as pd
import numpy as np
from io import StringIO

X = pd.read_csv(
    StringIO(
        """code,longitude,date
6342156,0.966747,a
6342156,0.756199,b
6342156,0.054222,c
6342156,0.743996,d
6342156,0.486753,a
6342156,0.464093,s
6342156,0.430592,d
2261019,0.827252,f
2261019,0.864456,f
2261019,0.866847,d"""
    )
)
X.set_index("code", inplace=True)
X.loc[2261019, "longitude"] = np.arange(3)
print(X)
# result
         longitude date
code                   
6342156   0.966747    a
6342156   0.756199    b
6342156   0.054222    c
6342156   0.743996    d
6342156   0.486753    a
6342156   0.464093    s
6342156   0.430592    d
2261019   0.000000    f
2261019   1.000000    f
2261019   2.000000    d
# check version
pd.__version__
'1.1.0.dev0+1502.g3ed7dff48

Do we need a test?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented May 8, 2020

Thanks @CloseChoice for checking, can confirm it works on master.

Do we need a test?

I would think so, yes

I should probably add that without the string column everything worked.

Thanks @c-foschi - yes, if we run this on 1.0.3 but add in

X = X.drop('date', axis=1)

then it works

@CloseChoice
Copy link
Member

take

@simonjayhawkins simonjayhawkins added the Needs Tests Unit test(s) needed to prevent regressions label May 8, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1 milestone May 8, 2020
@simonjayhawkins
Copy link
Member

fixed in #31897

3da053c is the first new commit
commit 3da053c
Author: jbrockmendel [email protected]
Date: Mon Feb 17 16:15:45 2020 -0800

BUG: fix length_of_indexer with boolean mask (#31897)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants