Skip to content

BUG: DataFrame.from_records() duplicates rows #6011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
museghost opened this issue Jan 20, 2014 · 22 comments
Closed

BUG: DataFrame.from_records() duplicates rows #6011

museghost opened this issue Jan 20, 2014 · 22 comments

Comments

@museghost
Copy link

Using the latest Pandas "pandas (0.13.0-246-g1e1907c)", one critical bug is still existed.

For instance, if tuples has some results from SQL and is converted to Dataframe, the first row of the tuples is added "twice" times into Dataframe as below.

qr = # number of records : 100
df = pd.DataFrame.from_records(qr) # number of records : 101

In the log, you can find the first and second row have been duplicated.

redacted

As a result, diggling down the source code "frame.py", some parts caused this issue.

Based on the latest version "frame.py", the line number is around 756.

Whatever happened, the list "values" has [first_row] around 756 line, and then the whole "data" is added into "value" list again around 760 ~ 769 line.

It make the first and second row duplicated.

Could you please fix this issue and then update the master branch as well ?

751 dtype = None
752 if hasattr(first_row, 'dtype') and first_row.dtype.names:
753 print("hasattr dtype")
754 dtype = first_row.dtype
755
756 #values = [first_row] ## caused the duplicated first and second row
757 values = [] ## should be fixed
758
759 # if unknown length iterable (generator)
760 if nrows is None:
761 # consume whole generator
762 values += list(data)
763 else:
764 #i = 1 ## caused the duplicated first and second row
i = 0 ## should be fixed
765 for row in data:
766 values.append(row)
767 i += 1
768 if i >= nrows:
769 break

2014-01-21 01 51 42

@ghost
Copy link

ghost commented Jan 20, 2014

cannot reproduce this with '0.13.0-246-g1e1907c':

In [1]: x=[dict(a=1,b=2),dict(a=2,b=2)]
   ...: pd.DataFrame.from_records(x)

Out[1]: 
   a  b
0  1  2
1  2  2

Can you provide a self contained example that demonstrates this issue? preferably using
less exclamation marks in the process.

@tinproject
Copy link
Contributor

You only get to that part of code if your data container it's an iterator type (have __next__() method). Which kind of object are you passing to the DataFrame.from_records() funcion? With whitch pameters?
Are you sure that you container only yields 100 values?

You can test:

for i in range(200):
    print(i, next(data_container))

You should get an StopIteration exception after the 99 (100th) value if its ok.

@jtratner
Copy link
Contributor

Could you post a minimal example that reproduces this issue?

@museghost
Copy link
Author

For completing my project, I am afraid to response delayed.

Here is an procedure how to find this behavior. Before starting analysis, it is not simple to reproduce with a simple code because this issue is occurred with an special condition as below.

Assumption : PEEWEE (2.1.7) as an ORM for MySQL / CPython 3.3.3

  1. retrieving data from MySQL and then making them as a tuple.

    executions = Execution() # this is a model by PEEWEE
    items = executions.select().where((Execution.Trading Server == "1")).tuples()
    qr = items.execute()

Please do not suspect the PEEWEE get a wrong records. PEEWEE is working fine.

  1. Converting the tuple to DataFrame. (** important part **)

    df = pd.DataFrame.from_records(qr, columns=Executions._meta.get_field_names())

Now, let's jump into the code "frame.py" of Pandas.

  1. In the above line, DataFrame recognized the "qr" object as "iterator". So, it runs 738 line as below.
  2. And then, the code goes to 744 or 746 line. Due to this code, "first_row" object gets the first record from PEEWEE result.
  3. (**) At 754 line, "values" object get the result of "first_row", that is, the first record of the PEEWEE result.
  4. (**) If nrows is None, the line 759 is excuted. At this point, you need to think this code more logical.
    Before excuting "values += list(data)", values has already the "first_row". So, when executing the line 759, the "values" object finally faces the situation that is the first and second row are duplicated as below.
   before 759 line, 
   values = ["1st record"],

   after 759 line,
   values = ["1st record", "1st record", "2nd record",....etc...]
  1. Even the codes between 761 and 766 makes the same result as above because of the line "763" values.append(row)

2014-01-24 01 05 31

  1. Conclusion : To avoid this issue,
    • at 754 line should be "values = []"
    • at 761 line should be "i=0".

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@museghost w/o a minimal reproducible example this cannot even be tested.

Take the tuples that are generated from your SQL process and try to create an example from them (obfuscate the date if necessary if its not showable)

@museghost
Copy link
Author

@jreback here is the sample and please think the above part of 'frame.py' logically. This is not a technical bug. It is a logical thing as emphased !

executions = Executions()
items = executions.select().where(
    (Executions.TradingServer == "1")).tuples()
qr = items.execute()
print(qr)

cnt = Counter(item[0] for item in qr)
print("len(cnt) : %d" % len(cnt))

itemsCount = 0
for item in qr:
    itemsCount = itemsCount + 1
print("itemCount %d" % itemsCount)

df = pd.DataFrame.from_records(qr, columns=CME.CMEExecution._meta.get_field_names())
print(df.Id.size)
print(df)

----- << result >> -------------------------------------------------------------------------------------------------------------

<peewee.TuplesQueryResultWrapper object at 0x7f6479a48dd0>
len(cnt) : 8204
itemCount 8204
8205
[8205 rows x 13 columns]
8205
Id ExecId Side OrderType
0 1 641 B L
1 1 641 B L
2 2 652 S L

as you see, the DataFrame makes one more rows as metioned.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@museghost I suspect the problem is that qr is NOT a list of tuples, but some other type of object (like a list of Row objects or something). Try this w/o the sql in the mix and it works just fine. That's why we have tests.

Try using pd.read_sql and see

@museghost
Copy link
Author

@jreback Before posting this issue, I expected how this issue was going on. You know, this is "logical" thing, not a technical or simple bug.
The result is not matter of problem. The problem is that the logicial procedure how to manage "values" object in the frame.py code.

As mentioned in my previous post, whatever the qr object is tuple or not, please look at around 754 line and 759 line. Those lines can make the duplicated rows like that.

before 759 line,
values = ["1st record"],

after 759 line,
values = ["1st record", "1st record", "2nd record",....etc...]

That's the problem

yes, I understand that if the qr object is original and simple dictionary or tuples, the total counts might be same because Python gurantees that.
However, if not, frame.py copies all data from user's ojbect to their internal "values" object. that's the problem. what I want to fix in frame.py's logic.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@museghost

their may be a bug. but unless we have a reproducible test, then how does this help?

someone may change the code in the future and may change whatever code is changed now.

that does noone any good.

and if it cannot be reproduced w/o the SQL embeded then I highly doubt it is a bug in that part of the code.

code inspection only goes so far.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@museghost if you look a little higher in the code, the first of the values are explicity popped off of the iterator (next(data); so don't think your duplicates are causes by this part of the code

@museghost
Copy link
Author

@jreback SQL is not a part of problem. The root cause is that when any iterator object passes to from_records() in frame.py, this duplicated issue is always happened becasue of the lines around 754 ~ 766.

It is a logical thing.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@museghost prove it by making an example which duplicates this w/o the SQL

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@museghost I'll be the first to admit to a bug, but w/o a definitive test case this is impossible to tell

@museghost
Copy link
Author

@jreback yes. you are almost closed to the root cause. In the line 754, value=[first_row] is the root of this problem. that line shoudl be value = [].

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@museghost pls put up a test case to support your 'view' that this is a bug

@ghost
Copy link

ghost commented Jan 23, 2014

@jreback , I suggest we close pending a self-contained example. enough is enough.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@museghost

if you have a self-contained example, pls open a new issue

thank you

@jreback jreback closed this as completed Jan 23, 2014
@tinproject
Copy link
Contributor

@museghost you are wrong on what are happening. Pandas code, even it's not the best code, works well, at least with generators.

The fault it's your PEEWEE qr object that it's an iterator object and an iterable object at the same time.
Both things are not the same, an iterator have a __next__() method that returns the current value and steps for the next one. An iterable have a __iter__() method that returns an iterator.

Generators have both protocols, they are iterator and iterable at the same time, but they track the position of the next value.

I thought the problem it's the implementation of the iterator protocol on the qr PEEWEE object. Firtsly pandas detects the object as an iterator and read the first value, then read the rest of the values using list() that use the iterable protocol. When you call iter(qr) surely the qr object returns an iterator to the whole container, starting at the first position. If you previously call next(qr) the iterator returned with iter(qr) should point to the second element, not the first one, hence the duplicated values.

Counter class works with the iterable protocol, not the iterator, so it's not valid for count.

You can test the behaviour of your qr object:

# obtain qr object
print(next(qr))
>>> First value printed
print(next(qr))
>>> Second value printed
print(next(iter(qr)))
>>> Should be the third value the printed.

@y-p, @jreback if you want this weekend I could make some refactor and clean to the from_records method to ensure that only the iterator protocol is used, and add some test to that kind of mixed type objects.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@tinproject go for it!

pls create a separate issue as well

@wesm
Copy link
Member

wesm commented Jan 24, 2014

@museghost please behave yourself in the future. your treatment of the lead pandas developers (which I see has already been redacted from history) is unacceptable. these are highly technical, experienced developers who are working to make Python a better data language for no compensation; if you would bother looking at the statistics for the project (https://github.com/pydata/pandas/graphs/contributors) you would see that. thank you

@museghost
Copy link
Author

First of all, especially for jreback and y-p, I deeply apologize my rude attitude for you. There is no execue, It was my fault.
I know how hard you contribute this project for everyone, and I always have appreciated you and your team.
In the future, I am sure I am going to discuss any issue with the good manner.

@tinproject @y-p @jreback :
For technical side, yes, tinproject's saying is right. When analyzing qr object in PEEWEE, the below line is found.
It is about "iter" function and self.idx sets 0 everytime when __iter is called. That's the main root cause of the duplicated issue.
In fact, when qr object is called in "for" loop or list(), "for" loop or list() called the iter of the object (in the case, "qr"), and then qr always returns the first items.

1282 def iter(self):
1283 print("iter")
1284 self.__idx = 0
1285
1286 if not self._populated:
1287 return self
1288 else:
1289 return iter(self._result_cache)

So, my query is the following:
if any user of PANDAS uses their own object with customized iterator or generator, how does PANDAS handle such complated or customized case ?

Thank you all

@tinproject
Copy link
Contributor

I don't know if this is stablished on Python PEPs but if your object have to comply generator behaviour track the position of the iterator.

For your case to skip the duplicates values:

df = pd.DataFrame.from_records(list(qr))

Until there is a better way to load internally from an iterator, the result preformance will be the same.

I'm still owe a PR to ensure that only the iterator protocol is used, I'm looking for time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants