-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
No way to construct mixed dtype DataFrame without total copy, proposed solution #9216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you can simply create an empty frame with an index and columns you could create these with np.empty if you wish |
Looks like it's copying to me. Unless the code I wrote isn't what you meant, or the copying that occurred is not the copy you thought I was trying to elide. |
you changed the dtype |
The dtype is object, so its changed regardless of whether I use a float or an int. |
Constructing w/o a copy on mixed type could be done but is quite tricky. The problem is some types require a copy (e.g. object to avoid memory contention issues). And the internal structure consolidates different types, so adding a new type will necessitatte a copy. Avoiding a copy is pretty difficult in most cases. You should just create what you need, get pointers to the data and then overwrite it. Why is that a problem? |
The problem is that in order to create what I need, I have to copy in stuff of the correct dtype, the data of which I have no intention of using. Even assuming that your suggestion of creating an empty DataFrame uses no significant RAM, this doesn't alleviate the cost of copying. If I want to create a 1 gigabyte DataFrame and populate it somewhere else, I'll have to pay the cost of copying a gigabyte of garbage around in memory, which is completely needless. Do you not see this as a problem? Yes, I understand that the internal structure consolidates different types. I'm not sure exactly what you mean by memory contention issues, but in any case objects are not really what's of interest here. Actually, while avoiding copies in general is a hard problem, avoiding them in the way I suggested is fairly easy because I'm supplying all the necessary information from the get-go. It's identical to constructing from data, except that instead of inferring the dtypes and the # of rows from data and copying the data, you specify the dtypes and # of rows directly, and do everything else exactly as you would have done minus the copy. You need an "empty" constructor for every supported column type. For numpy numeric types this is obvious, it needs non-zero work for Categorical, unsure about DatetimeIndex. |
passing a dict to the constructor and copy=False should work |
So this will work. But you have to be SURE that the arrays that you are passing are distinct dtypes. And once you do anything to this it could copy the underlying data. So YMMV. you can of course pass in
|
Iniital attempt deleted since does not work since Have to use 'method', which make this attempt a little less satisfactory: arr = np.empty(1, dtype=[('x', np.float), ('y', np.int)])
df = pd.DataFrame.from_records(arr).reindex(np.arange(100)) If you are really worried about performance, I'm not sure why one wouldn't just use numpy as much as possible since it is conceptually much simpler. |
jreback, thank you for your solution. This seems to work, even for Categoricals (which surprised me). If I encounter issues I'll let you know. I'm not sure what you mean by: if you do anything to this, it could copy. What do you mean by anything? Unless there are COW semantics I would think what you see is what you get with regards to deep vs shallow copies, at construction time. I still think a from_empty constructor should be implemented, and I don't think it would be that difficult, while this technique works, it does involve a lot of code overhead. In principle this could be done by specifying a single composite dtype and a number of rows. bashtage, these solutions still write into the entire DataFrame. Since writing is generally slower than reading, this means at best it saves less than half the overhead in question. Obviously if I haven't gone and used numpy, its because pandas has many awesome features and capabilities that I love, and I don't want to give those up. Were you really asking, or just implying that I should use numpy if I don't want to take this performance hit? |
Scratch this, please, user error, and my apologies. reindex_axis with copy=False worked perfectly. |
True, but all that you need to a new
It was a bit rhetorical - although also a serious suggestion from a performance point of view since numpy makes it much easier to get close to the data-as-a-blob-of-memory access that is important if you are trying to write very high performance code. You can always convert from numpy to pandas when code simplicity is more important than performance. |
I see what you are saying. I still think it should more cleanly be part of the interface rather than a workaround, but as workarounds go it is a good one and easy to implement. Pandas still emphasizes performance as one if its main objectives. Obviously it has higher level features compared to numpy, and those have to be paid for. What we're talking about has nothing to do with those higher level features, and there's no reason why one should be paying for massive copies in places where you don't need them. Your suggestion would be appropriate if someone was making a stink about the cost of setting up the columns, index, etc, which is completely different from this discussion. |
I think you are overestimating the cost of writing vs. the code of alloating memory in Python -- the expensive part is the memory allocation. The object creation is also expensive. Both allocate 1GB of memory, one empty and one zeros. %timeit np.empty(1, dtype=[('x', float), ('y', int), ('z', float)])
100000 loops, best of 3: 2.44 µs per loop
%timeit np.zeros(1, dtype=[('x', float), ('y', int), ('z', float)])
100000 loops, best of 3: 2.47 µs per loop
%timeit np.zeros(50000000, dtype=[('x', float), ('y', int), ('z', float)])
100000 loops, best of 3: 11.7 µs per loop
%timeit np.empty(50000000, dtype=[('x', float), ('y', int), ('z', float)])
100000 loops, best of 3: 11.4 µs per loop 3µs for zeroing 150,000,000 values. Now compare these for a trivial DataFrame. %timeit pd.DataFrame([[0]])
1000 loops, best of 3: 426 µs per loop Around 200 times slower for trivial. But it is far worse for larger arrays. %timeit pd.DataFrame(np.empty((50000000, 3)),copy=False)
1 loops, best of 3: 275 ms per loop Now it takes 275ms -- note that this is not copying anything. The cost is in setting up the index, etc which is clearly very slow when the array is nontrivially big. This feels a like a premature optimization to me since the other overheads in pandas are so large that the malloc + filliing component is near 0 cost. It seems that if you want to allocate anything in a tight loop that is must be a numpy array for performance reasons. |
ok, here's what I think we should do, @quicknir if you'd like to make some improvements. 2 issues.
This is slightly non-trivial but would then allow one to pass in an already created ndarray (could be empty) with mixed types pretty easily. Note that this would likely (in a first pass implementation) handle only (int/float/string). as datetime/timedelta need special sanitizing and would make this slighlty more complicated. so @bashtage is right from a perf perspective. It makes a lot of sense to simply construct the frame as you want then modify the ndarrays (but you MUST do this by grabbing the blocks, otherwise you will get copies). What I meant above is this. Pandas groups any like-dtype (e.g. int64,int32 are different) into a 'block' (2-d in a frame). These are a contiguous memory ndarray (that is newly allocated, unless it is simply passed in which only currently works for a single dtype). If you then do a setitem, e.g. |
Even with an optimized I think this can only be considered a convenience function, and not a performance issue.It could be useful to initialize a mixed type dtype=np.dtype([('GDP', np.float64), ('Population', np.int64)])
pd.Panel(items=['AU','AT'],
major_axis=['1972','1973'],
minor_axis=['GDP','Population'],
dtype=[np.float, np.int64]) |
this is only an API / convenience issue agreed the perf is really an incidental issue (and not the driver) |
%timeit pd.DataFrame(np.empty((100, 1000000))) %timeit pd.DataFrame(np.empty((100, 1000000)), copy=True) So copying into a dataframe seems to take 20 times longer than all the other work involved in creating the DataFrame, i.e. the copy (and extra allocation) is 95% of the time. The benchmarks you did do not benchmark the correct thing. Whether the copy itself or the allocation is what's taking time doesn't really matter, the point is that if I could avoid copies for a multiple dtype DataFrame the way I can for a single dtype DataFrame I could save a huge amount of time. Your two order of magnitude reasoning is also deceiving. This is not the only operation being performed, there are other operations being performed that take time, like disk reads. Right now, the extra copy I need to do to create the DataFrame is taking about half the time in my simple program that just reads the data off disk and into a DataFrame. If it took 1/20 th as much time, then the disk read would be dominant (as it should be) and further improvements would have almost no effect. So I want to again emphasize to both of you: this is a real performance issue. jreback, given that the concatenation strategy does not work for Categoricals, don't think that the improvements you suggested above will work. I think a better starting point would be reindex. The issue right now is that reindex does lots of extra stuff. But in principle, a DataFrame with zero rows has all the information necessary to allow the creation of a DataFrame with the correct number of rows, without doing any unnecessary work. Btw, this makes me really feel like pandas needs a schema object, but that's a discussion for another day. |
I think we wil have to agree to disagree. IMO DataFrames are not extreme performance objects in the numeric ecosystem, as show by the order of magntude difference between a basic numpy array and a DataFrame creation. %timeit np.empty((1000000, 100))
1000 loops, best of 3: 1.61 ms per loop
%timeit pd.DataFrame(np.empty((1000000,100)))
100 loops, best of 3: 15.3 ms per loop
I think this is even less reason to care about DataFrame performance -- even if you can make it 100% free, the total program time only declines by 50%. I agree that there is scope for you to do a PR here to resolve this issue, whether you want to think of it as a performance issue or as a convenience issue. From my POV, I see it as the latter since I will always use a numpy array when I care are performance. Numpy does other things like not use a block manager which is relatively efficient for some things (like growing the array by adding columns). but bad from other points of view. There could be two options. The first, an empty constructor as in the example I gave above. This would not copy anything, but would probably Null-fill to be consistent with other things in pandas. Null filling is pretty cheap and is not at the root of the problem IMO. The other would be to have a method DataFrame.from_blocks([np.empty((100,2)),
np.empty((100,3), dtype=np.float32),
np.empty((100,1), dtype=np.int8)],
columns=['f8_0','f8_1','f4_0','f4_1','f4_2','i1_0'],
index=np.arange(100)) A method of this type would enforce that the blocks have compatible shape, all blocks have unique types, as well as the usual checks for shape of the index and columns. This type of method would do nothing to the data and would use it in the BlockManger. |
@quicknir you are trying to combine pretty complicated things. Categorical don't exist in numpy, rather they are a compound dtype like that is a pandas construct. You have to construct and assign then separately (which is actually quite cheap - these are not combined into blocks like other singular dtypes). @bashtage soln seems reasonable. This could provide some simple checks and simply pass thru the data (and be called by the other internal routines). Normally the user need not concern themselves with the internal repr. Since you really really want to, then you need to be cognizant of this. All that said, I am still not sure why you don't just create a frame exactly like you want. Then grab the block pointers and change the values. It costs the same memory, and as @bashtage points out this is pretty cheap to create essentially a null frame (that has all of the dtype,index,columns) already set. |
Not sure what you mean by the empty constructor, but if you mean constructing a dataframe with no rows and the desired schema and calling reindex, this is the same amount of time as creating with copy=True. Your second proposal is reasonable, but only if you can figure out how to do Categoricals. On that subject, I was going through the code and I realized that Categoricals are non-consolidatable. So on a hunch, I created an integer array and two categorical Series, I then created three DataFrames, and concatenated all three. Sure enough, it did not perform a copy even though two of the DataFrames had the same dtype. I will try to see how to get this to work for Datetime Index. @jreback I still do not follow what you mean by create the frame exactly like you want. |
@quicknir why don't you show a code/pseudo-code sample of what you are actually trying to do. |
The code previous was constructing a dictionary of numpy arrays first, and then constructing a DataFrame from that because it was copying everything. About half the time was being spent on that. So I am trying to change it to this scheme. The thing is, that constructing df as above even when you don't care about the contents is extremely expensive. |
@quicknir dict of np arrays requires lots of copying. You should simply do this:
if you do this by dtype it will be cheap then.
as these block values are views into numpy arrays |
The composition of types isn't known in advance in general, and in the most common use case there is a healthy mix of floats and ints. I guess I don't follow how this will be cheap, if I have 30 float columns and 10 int columns, then yes, the floats will be very cheap. But when you do the ints, unless there is some way to do them all at once that I'm missing, each time you add one more column of ints it will cause the entire int block to be reallocated. The solution you gave me previously is close to working, I can't seem to make it work out for DatetimeIndex. |
An empty constructor would look like dtype=np.dtype([('a', np.float64), ('b', np.int64), ('c', np.float32)])
df = pd.DataFrame(columns='abc',index=np.arange(100),dtype=dtype) This would produce the same output as dtype=np.dtype([('a', np.float64), ('b', np.int64), ('c', np.float32)])
arr = np.empty(100, dtype=dtype)
df = pd.DataFrame.from_records(arr, index=np.arange(100)) only it wouldn't copy data. Basically the constructor would allow for a mixed dtype for the following call which works but only a single basic dtype. df = pd.DataFrame(columns=['a','b','c'],index=np.arange(100), dtype=np.float32) The only other feature would be to prevent it from null-filling int arrays which has the side effect of converting them to object dtype since there is no missing value for ints. |
The |
@jreback It's not reinventing the wheel, as the dtype spec has several major problems:
In short, the dtype of the record representation is more of a workaround for the lack of a proper spec. It lacks several key features and is much poorer performance wise. |
- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902 - closes pandas-dev#8571 by defining __copy__/__deepcopy__
There are many many threads on SO asking for this feature. It seems to me that all these problem stem from BlockManager consolidating separate columns into a single memory chunks (the 'blocks'). I have a non-consolidating monkey-patched BlockManager: |
- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902 - closes pandas-dev#8571 by defining __copy__/__deepcopy__
- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902
- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902
- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902
- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902
You can now pass a dict of Series and copy=False and should be OK |
Might be good to mention the dict of Series with copy=False solution in the DataFrame docs. |
take |
After hours of tearing my hair, I've come to the conclusion that it is impossible to create a mixed dtype DataFrame without copying all of its data in. That is, no matter what you do, if you want to create a mixed dtype DataFrame, you will inevitably create a temporary version of the data (e.g. using np.empty), and the various DataFrame will constructors will always make copies of this temporary. This issue has already been brought up, a year ago: #5902.
This is especially terrible for interoperability with other programming languages. If you plan to populate the data in the DataFrame from e.g. a call to C, the easiest way to do it by far is to create the DataFrame in python, get pointers to the underlying data, which are np.arrays, and pass these np.arrays along so that they can be populated. In this situation, you simply don't care what data the DataFrame starts off with, the goal is just to allocate the memory so you know what you're copying to.
This is also just generally frustrating because it implies that in principle (depending potentially on the specific situation, and the implementation specifics, etc) it is hard to guarantee that you will not end up using twice the memory you really should.
This has an extremely simple solution that is already grounded in the quantitative python stack: have a method analagous to numpy's empty. This allocates the space, but does not actually waste any time writing or copying anything. Since empty is already taken, I would propose calling the method from_empty. It would accept an index (mandatory, most common use case would be to pass np.arange(N)), columns (mandatory, typically a list of strings), types (list of acceptable types for columns, same length as columns). The list of types should include support for all numpy numeric types (ints, floats), as well as special Pandas columns such as DatetimeIndex and Categorical.
As an added bonus, since the implementation is in a completely separate method, it will not interfere with the existing API at all.
The text was updated successfully, but these errors were encountered: