-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
.loc[] assignment gives broadcast error #3777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Seems like its In [11]: x = np.arange(3**3).reshape((3, 3, 3))
In [12]: x.shape
Out[12]: (3, 3, 3)
In [15]: x[[0, 1], :, :].shape
Out[15]: (2, 3, 3)
In [16]: x[:, :, [0,1]].shape
Out[16]: (3, 3, 2)
In [17]: x[[0,1], :, [0,1]].shape
Out[17]: (2, 3)
In [20]: x[[0,1], :, 0:2].shape
Out[20]: (2, 3, 2) |
hah...just about to write back
its straightforward to align the rhs (e.g. the other panel), but then I have to assign the values |
possibly related: #3738 |
I suppose one can only fancy-index in one dimension at a time. |
It seems in my case there is a work-around: wp.update(wp2.loc[['Item1', 'Item2'], :, ['A', 'B']]) Would there be any problem with this? |
ahh..yes..that is a good solution fyi....will be a fair amount slower as its basically going frame-by-frame which then goes series by series.... another way to approach this is to break the panel into frames, copy/update them as needed, then concat back together |
OK, thanks for the info! I might wait until 0.12 then if the performance hit is not too dramatic for my case. |
I just noticed this issue (or something related) silently caused one of my processing flows to assign erroneous data. I find myself constantly wanting to group on several columns, compute a new column based on grouped data, and assign that result column back to the original data frame, indexing the original DataFrame using the groups multi-index name. This worked in 0.10.0. Assignment to a df column indexed using .ix[] a multi-index tuple and a column name used to work. E.G. df.ix[(1,2,3,4), 'new_col'] = np.arange(100). Now (in Pandas 0.11.0) it just replaced the entire assignment by repeating the first entry in the np.array. Is there any fix to this in a dev version? I might have to drop down 0.10.0. In [1]: import numpy as np
In [3]: print pd.__version__
0.11.0
In [4]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [5]: NUM_COLS = 10
In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [7]: index_cols = col_names[:5]
In [8]: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [9]: df = df.set_index(index_cols).sort_index()
In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [11]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:4])
....:
In [12]: # Find index of largest group.
....: big_loc = grp.size().idxmax()
....:
In [13]: # Create new empty column in DataFrame
....: df['new_col'] = np.nan
....:
In [14]: # Loop through groups and assign to orignal array to new_col column
....: for name, df2 in grp:
....: new_vals = np.arange(df2.shape[0])
....: print 'Group: ', name
....: print 'Expected:\n', pd.Series(new_vals).value_counts()
....: df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
....: print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
....: print '\n'
....:
Group: (0, 0, 0, 0)
Expected:
155 1
48 1
55 1
54 1
53 1
52 1
51 1
50 1
49 1
47 1
57 1
46 1
45 1
44 1
43 1
...
113 1
112 1
111 1
110 1
109 1
108 1
107 1
106 1
105 1
104 1
103 1
102 1
101 1
100 1
0 1
Length: 156, dtype: int64
Assigned:
0 156
dtype: int64
Group: (0, 0, 0, 1)
Expected:
147 1
54 1
52 1
51 1
50 1
49 1
48 1
47 1
46 1
45 1
44 1
43 1
42 1
41 1
40 1
...
108 1
107 1
106 1
105 1
104 1
103 1
101 1
94 1
100 1
99 1
98 1
97 1
96 1
95 1
0 1
Length: 148, dtype: int64
Assigned:
0 148
dtype: int64 |
@dragoljub It think the issue is related to #3668, which has been fixed in master and will be in upcoming 0.11.1 (very soon) The issue described here is related to Panel <-> Panel assignment via ix/loc |
Any idea if the nightly builds at http://pandas.pydata.org/pandas-build/dev/ have this update? I have not seen any updated binary since April. |
I'm seeing this issue again: In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: print pd.__version__
0.12.0
In [4]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [5]: NUM_COLS = 10
In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [7]: index_cols = col_names[:5]
In [8]: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [9]: df = df.set_index(index_cols).sort_index()
In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [11]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:4])
....:
In [12]: # Find index of largest group.
....: big_loc = grp.size().idxmax()
....:
In [13]: # Create new empty column in DataFrame
....: df['new_col'] = np.nan
....:
In [14]: # Loop through groups and assign to orignal array to new_col column
....: for name, df2 in grp:
....: new_vals = np.arange(df2.shape[0])
....: print 'Group: ', name
....: print 'Expected:\n', pd.Series(new_vals).value_counts()
....: df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
....: print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
....: print '\n'
....:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-06123ee45450> in <module>()
4 print 'Group: ', name
5 print 'Expected:\n', pd.Series(new_vals).value_counts()
----> 6 df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
7 print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
8 print '\n'
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __setitem__(self, key, value)
86 indexer = self._convert_to_indexer(key)
87
---> 88 self._setitem_with_indexer(indexer, value)
89
90 def _has_valid_tuple(self, key):
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _setitem_with_indexer(self, indexer, value)
156 # we have an equal len list/ndarray
157 elif len(labels) == 1 and (
--> 158 len(self.obj[labels[0]]) == len(value) or len(plane_indexer[0]) == len(value)):
159 setter(labels[0], value)
160
TypeError: object of type 'slice' has no len()
Group: (0, 0, 0, 0)
Expected:
162 1
40 1
58 1
57 1
56 1
55 1
54 1
53 1
52 1
51 1
50 1
49 1
48 1
47 1
46 1
...
117 1
116 1
115 1
114 1
113 1
111 1
103 1
110 1
109 1
108 1
107 1
106 1
105 1
104 1
0 1
Length: 163, dtype: int64 |
@dragoljub this is a bug, fixing in master...but I wouldn't do it this way anyhow....do somethign like
|
Looks interesting. The ultimate goal I have is to apply various machine learning/clustering etc algorithm across a DataFrame's subgroups, where each algorithm returns a series of results (one for each row in a subgroup). I think the ultimate solution would be the flexible-apply function (http://pandas.pydata.org/pandas-docs/dev/groupby.html#flexible-apply) df.groupby([['a','b','c','d']]).apply(my_ml_function) which would expect a function to return the results series and then the groupby apply method would combine everything into a data frame with the new column labeled 'my_ml_function' with the results for each application of my_ml_function. Maybe I just need to think about it differently and expect the results of flexible apply to be a long series of results and then just join the results back to the original DataFrame. It would be nice to have the option to just augment the data frame with just the column I wish to add. :) Then one command could run a bunch of analytics across subgroups while keeping results joined to the original data, for plotting etc. |
you can do this now. What is too complicated is your assignment method. Just compute what you need in the apply (which is essentially what I did, but more 'manually') and create the resulting structure. |
@dragoljub see PR #4766 for the fixes for this issue (on master) |
WOW! df.apply() is awesome! Last time I checked apply did not work with MultiIndex. For anyone who is interested here is an example of applying ~625 clustering jobs across 625 groups in a 100k element DataFrame. The syntax is great, the semantics is powerful. There is just one little thing. There is no lazy way to add a named series as a column to a DataFrame in-place. It seems like we have a df.pop() method maybe we need a df.push() method to do something as simple as df[series.name] = series. df.append() takes a data frame and takes longer and returns a copy. Why does pandas seem to be doing away with in-place methods? In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from sklearn.cluster import DBSCAN as DBSCAN
In [4]: print pd.__version__
0.12.0
In [5]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [6]: NUM_COLS = 10
In [7]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [8]: index_cols = col_names[:5]
In [9]: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [10]: df = df.set_index(index_cols).sort_index()
In [11]: df
Out[11]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [12]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:4])
....:
In [13]: # Find index of largest group.
....: big_loc = grp.size().idxmax()
....:
In [14]: # Create function to apply clustering on groups
....: def grp_func(df):
....: """Run clustering on subgroup and return series of results."""
....: db = DBSCAN(eps=1, min_samples=1, metric='euclidean').fit(df.values)
....: return pd.Series(db.labels_, name='cluster_id')
....:
....: # Apply clustering on each subgroup of DataFrame
....:
In [15]: %time out = grp.apply(grp_func)
CPU times: user 33.32 s, sys: 0.00 s, total: 33.32 s
Wall time: 33.32 s
In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
....: %timeit df.join(out)
....:
1 loops, best of 3: 217 ms per loop
In [17]: # Much faster but you have to specify column name :(
....: %timeit df['cluster_id'] = out
....:
10 loops, best of 3: 48.3 ms per loop |
try |
Thanks! Not sure how I missed insert. Still, having to specify location and column_name is too much work. 😸 Anyone for adding a df.push() method? df.insert(len(df.columns), series.name, series) or maybe we can just add defaults to loc and column parameters? |
use
|
Join returns a copy (which I don't need) and therefore appears to be 4x slower than simply inserting a column with alignment. Unless of course the assign statement does not align on MultiIndex. In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
....: %timeit df.join(out)
....:
1 loops, best of 3: 217 ms per loop
In [17]: # Much faster but you have to specify column name :(
....: %timeit df['cluster_id'] = out
....:
10 loops, best of 3: 48.3 ms per loop |
assignment does not copy (though it may do an internal copy), e.g. if you already have float data and add a float column then it will 'copy' it but its pretty cheap join will for sure copy inplace operations are not generally a good thing; state can change when least expected so avoid if at all possible. why is it a problem to specify the column name? that seems a natural thing IMHO. and your timings on the assignment are not valid (they happen to be faster, but not that much faster). because only the first time is it a valid timeing, aftern that its a set, not an insert (which is where the copying happens). A set will just operwrite the data, but an insert can copy it. You need to do the timings in a separate function, where you first copy the frame, then do an action on it. |
Interesting. Good info Jeff. I guess I'm trying to avoid specifying the column name again because in my apply function I set the series name and want that to just propagate. I profiled it again with the functions you described and got similar results. In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from sklearn.cluster import DBSCAN as DBSCAN
In [4]: print pd.__version__
0.12.0
In [5]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [6]: NUM_COLS = 10
In [7]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [8]: index_cols = col_names[:5]
In [9]: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [10]: df = df.set_index(index_cols).sort_index()
In [11]: df
Out[11]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [12]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:4])
....:
In [13]: # Find index of largest group.
....: big_loc = grp.size().idxmax()
....:
In [14]: # Create function to apply clustering on groups
....: def grp_func(df):
....: """Run clustering on subgroup and return series of results."""
....: db = DBSCAN(eps=1, min_samples=1, metric='euclidean').fit(df.values)
....: return pd.Series(db.labels_, name='cluster_id')
....:
....: # Apply clustering on each subgroup of DataFrame
....:
In [15]: %time out = grp.apply(grp_func)
CPU times: user 34.27 s, sys: 0.00 s, total: 34.27 s
Wall time: 34.27 s
In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
....: %timeit df.join(out)
....:
1 loops, best of 3: 232 ms per loop
In [17]: # Much faster but you have to specify column name :(
....: #%timeit df[out.name] = out
....:
....: # Here is another way to insert
....: #%time df.insert(len(df.columns), out.name + '_2', out)
....:
....: def insert_col(df, ser):
....: df2 = df.copy()
....: ser2 = ser.copy()
....: df2[ser2.name] = ser2
....: return df2
....:
....:
In [18]: def join_col(df, ser):
....: df2 = df.copy()
....: ser2 = ser.copy()
....: df2.join(ser2)
....: return df2
....:
In [19]: %timeit dfa = insert_col(df, out)
10 loops, best of 3: 53.5 ms per loop
In [20]: %timeit dfb = join_col(df, out)
1 loops, best of 3: 223 ms per loop |
closing as Panel deprecated |
Those should be the same shape so I don't see why this assignment shouldn't work. This is with pandas 0.11.0.
The text was updated successfully, but these errors were encountered: