-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame.from_records() duplicates rows #6011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cannot reproduce this with '0.13.0-246-g1e1907c': In [1]: x=[dict(a=1,b=2),dict(a=2,b=2)]
...: pd.DataFrame.from_records(x)
Out[1]:
a b
0 1 2
1 2 2 Can you provide a self contained example that demonstrates this issue? preferably using |
You only get to that part of code if your data container it's an iterator type (have You can test:
You should get an StopIteration exception after the 99 (100th) value if its ok. |
Could you post a minimal example that reproduces this issue? |
For completing my project, I am afraid to response delayed. Here is an procedure how to find this behavior. Before starting analysis, it is not simple to reproduce with a simple code because this issue is occurred with an special condition as below. Assumption : PEEWEE (2.1.7) as an ORM for MySQL / CPython 3.3.3
Please do not suspect the PEEWEE get a wrong records. PEEWEE is working fine.
Now, let's jump into the code "frame.py" of Pandas.
|
@museghost w/o a minimal reproducible example this cannot even be tested. Take the tuples that are generated from your SQL process and try to create an example from them (obfuscate the date if necessary if its not showable) |
@jreback here is the sample and please think the above part of 'frame.py' logically. This is not a technical bug. It is a logical thing as emphased !
----- << result >> ------------------------------------------------------------------------------------------------------------- <peewee.TuplesQueryResultWrapper object at 0x7f6479a48dd0> as you see, the DataFrame makes one more rows as metioned. |
@museghost I suspect the problem is that qr is NOT a list of tuples, but some other type of object (like a list of Row objects or something). Try this w/o the sql in the mix and it works just fine. That's why we have tests. Try using |
@jreback Before posting this issue, I expected how this issue was going on. You know, this is "logical" thing, not a technical or simple bug. As mentioned in my previous post, whatever the qr object is tuple or not, please look at around 754 line and 759 line. Those lines can make the duplicated rows like that. before 759 line, after 759 line, That's the problem yes, I understand that if the qr object is original and simple dictionary or tuples, the total counts might be same because Python gurantees that. |
their may be a bug. but unless we have a reproducible test, then how does this help? someone may change the code in the future and may change whatever code is changed now. that does noone any good. and if it cannot be reproduced w/o the SQL embeded then I highly doubt it is a bug in that part of the code. code inspection only goes so far. |
@museghost if you look a little higher in the code, the first of the values are explicity popped off of the iterator ( |
@jreback SQL is not a part of problem. The root cause is that when any iterator object passes to from_records() in frame.py, this duplicated issue is always happened becasue of the lines around 754 ~ 766. It is a logical thing. |
@museghost prove it by making an example which duplicates this w/o the SQL |
@museghost I'll be the first to admit to a bug, but w/o a definitive test case this is impossible to tell |
@jreback yes. you are almost closed to the root cause. In the line 754, value=[first_row] is the root of this problem. that line shoudl be value = []. |
@museghost pls put up a test case to support your 'view' that this is a bug |
@jreback , I suggest we close pending a self-contained example. enough is enough. |
if you have a self-contained example, pls open a new issue thank you |
@museghost you are wrong on what are happening. Pandas code, even it's not the best code, works well, at least with generators. The fault it's your PEEWEE qr object that it's an iterator object and an iterable object at the same time. Generators have both protocols, they are iterator and iterable at the same time, but they track the position of the next value. I thought the problem it's the implementation of the iterator protocol on the qr PEEWEE object. Firtsly pandas detects the object as an iterator and read the first value, then read the rest of the values using Counter class works with the iterable protocol, not the iterator, so it's not valid for count. You can test the behaviour of your qr object:
@y-p, @jreback if you want this weekend I could make some refactor and clean to the from_records method to ensure that only the iterator protocol is used, and add some test to that kind of mixed type objects. |
@tinproject go for it! pls create a separate issue as well |
@museghost please behave yourself in the future. your treatment of the lead pandas developers (which I see has already been redacted from history) is unacceptable. these are highly technical, experienced developers who are working to make Python a better data language for no compensation; if you would bother looking at the statistics for the project (https://github.com/pydata/pandas/graphs/contributors) you would see that. thank you |
First of all, especially for jreback and y-p, I deeply apologize my rude attitude for you. There is no execue, It was my fault. @tinproject @y-p @jreback : 1282 def iter(self): So, my query is the following: Thank you all |
I don't know if this is stablished on Python PEPs but if your object have to comply generator behaviour track the position of the iterator. For your case to skip the duplicates values:
Until there is a better way to load internally from an iterator, the result preformance will be the same. I'm still owe a PR to ensure that only the iterator protocol is used, I'm looking for time. |
Using the latest Pandas "pandas (0.13.0-246-g1e1907c)", one critical bug is still existed.
For instance, if tuples has some results from SQL and is converted to Dataframe, the first row of the tuples is added "twice" times into Dataframe as below.
qr = # number of records : 100
df = pd.DataFrame.from_records(qr) # number of records : 101
In the log, you can find the first and second row have been duplicated.
redacted
As a result, diggling down the source code "frame.py", some parts caused this issue.
Based on the latest version "frame.py", the line number is around 756.
Whatever happened, the list "values" has [first_row] around 756 line, and then the whole "data" is added into "value" list again around 760 ~ 769 line.
It make the first and second row duplicated.
Could you please fix this issue and then update the master branch as well ?
751 dtype = None
752 if hasattr(first_row, 'dtype') and first_row.dtype.names:
753 print("hasattr dtype")
754 dtype = first_row.dtype
755
756 #values = [first_row] ## caused the duplicated first and second row
757 values = [] ## should be fixed
758
759 # if unknown length iterable (generator)
760 if nrows is None:
761 # consume whole generator
762 values += list(data)
763 else:
764 #i = 1 ## caused the duplicated first and second row
i = 0 ## should be fixed
765 for row in data:
766 values.append(row)
767 i += 1
768 if i >= nrows:
769 break
The text was updated successfully, but these errors were encountered: