-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Iterating through TableIterator with where clause can incorrectly ignore data #8014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you would need to narrow this down to a small reproducible example.
|
This is the reproducible example which I did narrow down. As I noted simply writing test frames of length chunksize * some multiple followed by a DataFrame of len 64 does not reproduce the problem. I explained that the problem only occurs once 200000 + 58689 + 41375 rows are written. Comparing the concatted frame to the original will fail because the concatted frame is missing the last 64 elements as I noted. |
in order for this to be tested it has to be constructed as a test. Once you have a test that reliably fails then you can debug. Pls see the format here: https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_pytables.py |
It looks like a bug, but it could be that your selection is simply incorrect. Hard to test w/o a reliable test. |
@jreback, understood, I'll work up a test. Thanks. |
This can be added to https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_pytables.py and reproduces the problem. Regards.
|
@bboerner ok, was very subtle, fixed in #8029 essentially if a chunk happened to select more than the chunksize (which happens in this example when the date crosses a day boundary). This might actually be a bug in pytables, not sure. In any event can completely bypass this but simply selecting the coordinates with the where up front, then using the chunksize to iterate over them (and selecting the data). also this then conforms to the docs where you will get chunksize in every chunk (possibly less in the last one). OF THE RESULTS. |
@jreback, thanks verified fix. Regards. |
@jreback, testing fix with my app I'm seeing another problem. I'll debug further and look for a reproducible test case. |
ok, I haven't merged this yet as I cleaned up the code (but it passes the same tests). |
@jreback, I've updated the test case to check for subset of data - the result will be a "hang" in the list comprehension (the iterator never exits). I committed this to: bboerner@8e24e8a or can submit a pull request if you prefer (ignore the commit of test_pytables2.py ; that was a subset I was using and didn't mean to commit). Let me know. Regards. |
Fix test case: bboerner@7f7e1b5 |
This reverts commit a6d0ec3.
This reverts commit 3fcb774. Conflicts: pandas/io/tests/test_pytables2.py
This reverts commit b8447d7. Conflicts: pandas/io/tests/test_pytables2.py
Expected behaviour: Using appendable table stored using HDFStore summed length of DataFrames returned using an iterator with a where clause should equal the length of the DataFrame when returned using the same where clause but with iterator=False e.g. TableIterator.get_values().
The attached code generates appendable tables of size 100064, 200064, ..., 400064. It uses a where clause which is a superset of all possible values to get DataFrames with iterator=False, with and without the where clause, and with iterator=True, also with and without the where clause. In all cases except for iterator=True with the where clause the length of the returned DataFrames is correct.
For the failure cases in closer inspection in iPython it is the last 64 rows which are not being returned.
Note: in create_file() the appending of DataFrames with lengths of 58689 and 41375 was chosen specifically to reproduce the problem. I originally encountered the problem with a dataset with length 174000064 and the last append was size 41375. I attempted to reproduce the problem by creating various length tables in chunks of 100000 with a final append of 64 and wasn't able to do so.
Creating the table with the last chunk = 41375 with total length exceeding 300000 does in my tests reproduce the problem.
Output:
INSTALLED VERSIONS
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-32-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 1.2.1
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 3.3.3
bs4: 4.3.2
html5lib: 0.999
bq: None
apiclient: None
The text was updated successfully, but these errors were encountered: