adding a NotImplementedError for simultaneous use of nrows and chunksize... #7085

michaelaye · 2014-05-09T01:24:13Z

..., as the user intention most likely is to get a TextFileReader, when using the chunksize option.
Fixes #6774

jreback · 2014-05-09T01:26:40Z

needs a test ( which should fail w/o your fix) and passes with

in pandas/io/tests/test_parser.py

michaelaye · 2014-05-09T01:36:51Z

THAT's where they are hidden, was looking for parsing tests. :)
But I'm totally puzzled there: I see self.assertRaises everywhere and can't find it's definition? the ParserTests class is not derived from anything else than object so where does the self.assertRaises come from??

jreback · 2014-05-09T01:40:00Z

that class is just includes in other classes which inherit from TestCase which inherits from unit test intimately (and that's whee assertRaises is defined)

you can look at

klass.mro to see the hierarchy if u r curious

michaelaye · 2014-05-09T02:01:20Z

Sorry, again confused. What's the idea of the separation of ParserTests and TestPythonParser? Because I see both classes performing asserts ?
Also, many test methods are defined more than once? Does it not matter because they are all executed after each other, independent of the method name, because they are tests? Feels dirty though...

jreback · 2014-05-09T03:40:52Z

when tests are used in multiple test classes they are separated to a base class

most of the tests in the parser are run 3x with different arguments (mostly the engine type eg c or python)

use -v when u run tests to see how tests are executed

michaelaye · 2014-05-09T03:55:38Z

Not really all having different arguments. Look at these two, isn't it just a copy-paste forgotten or lack of decision if to run these 2 tests separately or within one method? The first test in first method looks like to be exactly the same as the test in the second method definition.

    def test_integer_overflow_bug(self):
        # #2601
        data = "65248E10 11\n55555E55 22\n"

        result = self.read_csv(StringIO(data), header=None, sep=' ')
        self.assertTrue(result[0].dtype == np.float64)

        result = self.read_csv(StringIO(data), header=None, sep='\s+')
        self.assertTrue(result[0].dtype == np.float64)

    def test_integer_overflow_bug(self):
        # #2601
        data = "65248E10 11\n55555E55 22\n"

        result = self.read_csv(StringIO(data), header=None, sep=' ')
        self.assertTrue(result[0].dtype == np.float64)

jreback · 2014-05-09T03:57:13Z

it's calling self.read_cav

that sets different engines on each run

michaelaye · 2014-05-09T04:05:17Z

you mean there's something that keeps track how often a method signature was called and is using a different engine the next time it's there?
IOW, the execution of the method and its results depend on what methods have been defined before it in the source?
How would I know then what engine is used, which one is used first? Can I not control it?

jreback · 2014-05-09T04:09:44Z

that's why u define multiple classes to have the same methods called but with different arguments to test different cases

it makes it simpler then writing a tests twice

nose calls the tests in alphabetical order I think

no state is kept

it's all in each tests

idea being is it should behave identically on both parsers

their are specifc tests on one or the other parser for certain options that are only supported on one ( which are in effect uninplemented features ATM)

jtratner · 2014-05-09T06:55:30Z

Super trivial point: unless this will be supported someday, or could be supported, you should make this a TypeError instead (for bad function signature - eg what Python throws if you pass the wrong number of arguments).

Your call on what it should be tho.

jreback · 2014-05-09T12:49:09Z

@michaelaye TypeError is better here as @jtratner points out, though I do see from your original request that this could be an enhancement proposal (e.g. n chunks from first nrows)...either ok for now

michaelaye · 2014-05-09T18:34:01Z

Ok, I digged deeper into the parser tests and still don't get why above method is defined twice in the class ParserTests.
I understand that only classes with Test in the front of the name are being called, and that TestCParserHighMemory, TestCParserLowMemory and TestPythonParser derive from ParserTests and that the read_csv method has different keywords and is being overwritten at the beginning of these Test-classes.
What I don't get:

when overwriting read_csv, I don't understand how read_csv is available in the scope of that method? Shouldn't it say return self.read_csv(...) here?

class TestCParserHighMemory(ParserTests, tm.TestCase):

    def read_csv(self, *args, **kwds):
        kwds = kwds.copy()
        kwds['engine'] = 'c'
        kwds['low_memory'] = False
        return read_csv(*args, **kwds)

This not explain why there would be two exactly same-named methods in ParserTests. If they would be both called, then the total number of times test_integer_overflow_bug is being called should be 6, not 3, but it is 3:

$ cat allout.txt|grep "test_integer_overflow_bug "
test_integer_overflow_bug (pandas.io.tests.test_parsers.TestCParserHighMemory) ... ok
test_integer_overflow_bug (pandas.io.tests.test_parsers.TestCParserLowMemory) ... ok
test_integer_overflow_bug (pandas.io.tests.test_parsers.TestPythonParser) ... ok

jreback · 2014-05-09T20:38:23Z

so delete the first one (looks like an older version); it doesn't matter really, because the later definition overrides the first definition

e.g.

class Foo:

     def func(....):
          ........

     def func(....):
            ....

The first 'func' never is simply overwritten

michaelaye · 2014-05-09T21:20:15Z

Ok, will remove the first version. Don't know what the difference between override and overwrite is, and why the later definition would override the first one, but would NOT overwrite it, but I guess that's some internals I don't care now.

jreback · 2014-05-30T14:27:44Z

looks good, pls rebase on master and add a whatsnew entry for 0.14.1. (we are now just putting in 1 place)

jreback · 2014-06-03T23:45:07Z

@michaelaye can you add a whatsnew entry? otherwise good to go (you can put in API changes section)

michaelaye · 2014-06-03T23:46:53Z

Nah, still have to do the tests and learn about rebasing.

jreback · 2014-06-03T23:48:20Z

ahh..rght...ok...lmk

jreback · 2014-06-10T15:41:23Z

update?

jreback · 2014-06-22T12:41:26Z

@michaelaye ?

michaelaye · 2014-06-23T05:52:12Z

i'm working on the test. saw now that read_csv is being imported into the namespace. But considering all the redefinitions of read_csv that are happening in the test module, I find that really obscuring. Having a parsers.read_csv would be much clearer and less confusing, IMHO.

I'm hampered by my main work machine being broken this week, so i'm working with this awfully slow Win7 netbook 🐌

michaelaye · 2014-06-23T09:44:01Z

Okay, as required, I verified that my test fails when my patch is not in. Now a quick read on rebase-ing, and deciding what commit to survive and I'm done.

michaelaye · 2014-06-23T09:57:47Z

So, I only have 2 commits to replay, is it okay to just execute a 'git rebase master' on my patch branch and push the results up into the PR?

jreback · 2014-06-23T10:18:57Z

yep

jreback · 2014-06-23T10:21:00Z

perfect

just need a short release note in v0.14.1 (ref the original issue)

then squash (if u can't no worries)

but almost the same as a rebase
just do it -i and use s

michaelaye · 2014-06-24T01:07:58Z

one last help to understand what is the preferred git procedure:
doc/source/v0.14.1.txt is not yet in my patch branch, but of course available in master. How should I get this file into my patch branch? If I do a rebase now I would have it, but then I create another commit. If I merge master into my patch branch, then there's nothing to rebase. Please advise.

jreback · 2014-06-24T01:13:35Z

git rebase -i origin/master

will rebase u on top of pandas master

michaelaye · 2014-06-24T01:22:10Z

yes, and then i have to a make a change to the what's new file, creating another commit, meaning even so I squash my previous 2 i now have to create another one.

jreback · 2014-06-24T01:24:13Z

that's fine
u create commits
then rebase / reorder / squash as needed

michaelaye · 2014-06-24T01:31:32Z

Are you suggesting a second rebase after I made the changes to v0.14.1.txt?

michaelaye · 2014-06-24T01:34:27Z

or should I cherry-pick the file, make the changes and then rebase? That's sounds quite hackish for the git log, I think.

jreback · 2014-06-24T01:34:31Z

u can rebase and much as u need

all rebasing does is rewrites the commit history

eg allows u to reorder and/or combine commits

michaelaye · 2014-06-24T01:35:15Z

ok, thanks.

jreback · 2014-06-24T01:36:38Z

you can make whatever changes u want
it's in your branch

I think rebasing is really much nicer than merging in as it allows u to make a consistent history - purist don't like this because it rewrites the history - but I don't think it's a big deal

when I merge it will be a single merge commit that takes your commit (kind of like a cherry pick)

…hunksize. For read_csv() the user intention most likely is to get a TextFileReader, when using the chunksize option, but simultaneous use of nrows is not implemented yet. This raises now a NotImplementedError. Test and entry to current whatsnew source (v0.14.1.txt) added. Fixes pandas-dev#6774

michaelaye · 2014-06-24T02:36:20Z

darn git has some empty diffs with github.. i will let u know when i'm done.. :(

michaelaye · 2014-06-24T02:41:02Z

ok, this should be it. Thanks for your patience.

adding a NotImplementedError for simultaneous use of nrows and chunksize...

jreback · 2014-06-24T13:16:58Z

thanks @michaelaye !

jreback added API Design and removed API Design labels May 9, 2014

jreback added this to the 0.15.0 milestone May 9, 2014

jreback modified the milestones: 0.14.1, 0.15.0 May 30, 2014

jreback added this to the 0.15.0 milestone Jun 22, 2014

jreback removed this from the 0.14.1 milestone Jun 22, 2014

jreback modified the milestones: 0.14.1, 0.15.0 Jun 24, 2014

jreback added a commit that referenced this pull request Jun 24, 2014

Merge pull request #7085 from michaelaye/add_notimp_error

647f771

adding a NotImplementedError for simultaneous use of nrows and chunksize...

jreback merged commit 647f771 into pandas-dev:master Jun 24, 2014

adding a NotImplementedError for simultaneous use of nrows and chunksize... #7085

adding a NotImplementedError for simultaneous use of nrows and chunksize... #7085

Conversation

michaelaye commented May 9, 2014

jreback commented May 9, 2014

michaelaye commented May 9, 2014

jreback commented May 9, 2014

michaelaye commented May 9, 2014

jreback commented May 9, 2014

michaelaye commented May 9, 2014

jreback commented May 9, 2014

michaelaye commented May 9, 2014

jreback commented May 9, 2014

jtratner commented May 9, 2014

jreback commented May 9, 2014

michaelaye commented May 9, 2014

jreback commented May 9, 2014

michaelaye commented May 9, 2014

jreback commented May 30, 2014

jreback commented Jun 3, 2014

michaelaye commented Jun 3, 2014

jreback commented Jun 3, 2014

jreback commented Jun 10, 2014

jreback commented Jun 22, 2014

michaelaye commented Jun 23, 2014

michaelaye commented Jun 23, 2014

michaelaye commented Jun 23, 2014

jreback commented Jun 23, 2014

jreback commented Jun 23, 2014

michaelaye commented Jun 24, 2014

jreback commented Jun 24, 2014

michaelaye commented Jun 24, 2014

jreback commented Jun 24, 2014

michaelaye commented Jun 24, 2014

michaelaye commented Jun 24, 2014

jreback commented Jun 24, 2014

michaelaye commented Jun 24, 2014

jreback commented Jun 24, 2014

michaelaye commented Jun 24, 2014

michaelaye commented Jun 24, 2014

jreback commented Jun 24, 2014