-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Update tokenizer to fix #8679 #8661 #8752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -3036,7 +3036,7 @@ def test_line_comment(self): | |||
|
|||
def test_comment_skiprows(self): | |||
data = """# empty | |||
random line | |||
random line with trailing spaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave the original test in, and add a new test (you can just copy-paste), same method is fine. Also add the issue numbers as a comment.
pls add a release note in v0.15.2 (reference both issues) |
cc @mdmueller pls have a look |
|
@selasley ok, rebase again, you'll have a merge conflict (as you are including the whole of v0.15.2.txt again). You should be fine otherwise, repush and i'll have a look. |
I added my comments to whatsnew/v0.15.2.txt from e8b3862, did a rebase/squash and pushed the result. Let me know if I was supposed to do something different. |
commit 3c16f33 contains a bug fix in tokenize_delim_customterm, general code cleanup and updates to various files to e8b3862 versions |
you seem to be picking up other commits and then rebasing them in what are you doing to rebase? |
On Nov 10, 2014, at 3:37 AM, jreback [email protected] wrote: Thanks for the link. I'll definitely read through it. I've been doing git rebase -i HEAD~2 then changing the second pick to squash and editing the commit comments. Then I do a git push origin +trailing_spaces_fix For the latest commit I copied several files from the latest master branch to my trailing_spaces_fix branch to make sure the tests still passed. In retrospect that was probably not a good thing to do.= |
@selasley can you rebase this. and can you post a memory comparison of old vs new? (use a relatively small example) |
I'll try again to rebase tonight. According to the Allocations instrument with Xcode, reading the file allocated 4.67GB total with pandas 0.15.1 and 1.40GB with the new pandas. With 100000 rows to skip the numbers are |
I found a problem with custom line terminators in 0.15.1 and in my version |
@selasley fine to fix (I think their is an open issue about line terminators - can't find it in s quick search) |
I think I rebased properly this time for the trailing spaces fix. Thanks for your patience |
rebase looks good! https://github.com/pydata/pandas/wiki/Performance-Testing see: I think it makes send to add s vbench for skiprows as well |
I added a 20000 row, skiprows=10000 benchtest to io_bench.py
|
perf looks good (and vbench). you need to rebase again. somehow picking up odd things.
|
git rebase trailing_spaces_fix -i origin/master and git rebase -i trailing_spaces_fix origin/master gave fatal errors so I tried the rebase that seemed to work yesterday |
@selasley pls rebase and push. is this working locally? |
I am trying to follow the recipe on the wiki. Starting from scratch |
the |
OK, I did a girl push with -f. I mistakenly thought +trailing_spaces_fix was the same as --force from some stack overflow answer. Now that I read the git push docs I see they are different. |
https://travis-ci.org/pydata/pandas/jobs/41103767#L251 it's not cythonizing/compiling correctly |
I misinterpreted the failure message on Travis. It builds without errors on my mac with clang from Xcode 6.1 and Cython 0.21.1, but if I run nosetests in the pandas directory I get the ImportError: cannot import name 'hashtable' error. |
@@ -66,6 +66,7 @@ Enhancements | |||
- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`). | |||
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`. | |||
- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`). See :ref:`here <io.stata-categorical>` for more information on importing categorical variables from Stata data files. | |||
- Reduce memory usage when skiprows is an integer in read_csv (:issue:`8681`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to performance section
looks good. minor comment. pls rebase / squash. ping when green. |
I moved the comment to performance in whatsnew and rebased |
df = self.read_csv(StringIO(data.replace(',', ' ')), | ||
header=None, delim_whitespace=True, | ||
skiprows=[0,1,2,3,5,6], skip_blank_lines=True) | ||
tm.assert_almost_equal(df.values, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare an actual dataframe here (and below), NOT the values, just tm.assert_frame_equal
, this makes sure dtypes, index, columns are all correct
… 8661, 8679 ENH CSV: Reduce memory usage when skiprows is an integer in read_csv, issue 8681
replaced tm.assert_almost_equal(df.values, expected) with tm.assert_frame_equal(df, expected) where expected is now a dataframe. I had to push a couple of times. The latest travis build should succeed |
thanks! |
Although it works for "skiprows = integer" now, there is still problem with "skiprows = list of rows" |
I can reproduce the problem. I'll look into it |
Test name | head[ms] | base[ms] | ratio |frame_to_csv_date_formatting | 10.0880 | 10.4263 | 0.9676 | read_csv_default_converter | 1.2707 | 1.1950 | 1.0633 |Test name | head[ms] | base[ms] | ratio |Seed used: 1234 Target [8602c23] : BUG: Fix buffer overflows in tokenizer.c with certain malformed input files. GH9205 |
Update tokenizer's handling of skipped lines.
Fixes a problem with read_csv(delim_whitespace=True) and
read_table(engine='c') when lines being skipped have trailing
whitespace.
closes #8679
closes #8681