Skip to content

adding a NotImplementedError for simultaneous use of nrows and chunksize... #7085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 24, 2014

Conversation

michaelaye
Copy link
Contributor

..., as the user intention most likely is to get a TextFileReader, when using the chunksize option.
Fixes #6774

@jreback
Copy link
Contributor

jreback commented May 9, 2014

needs a test ( which should fail w/o your fix) and passes with

in pandas/io/tests/test_parser.py

@michaelaye
Copy link
Contributor Author

THAT's where they are hidden, was looking for parsing tests. :)
But I'm totally puzzled there: I see self.assertRaises everywhere and can't find it's definition? the ParserTests class is not derived from anything else than object so where does the self.assertRaises come from??

@jreback
Copy link
Contributor

jreback commented May 9, 2014

that class is just includes in other classes which inherit from TestCase which inherits from unit test intimately (and that's whee assertRaises is defined)

you can look at

klass.mro to see the hierarchy if u r curious

@michaelaye
Copy link
Contributor Author

Sorry, again confused. What's the idea of the separation of ParserTests and TestPythonParser? Because I see both classes performing asserts ?
Also, many test methods are defined more than once? Does it not matter because they are all executed after each other, independent of the method name, because they are tests? Feels dirty though...

@jreback
Copy link
Contributor

jreback commented May 9, 2014

when tests are used in multiple test classes they are separated to a base class

most of the tests in the parser are run 3x with different arguments (mostly the engine type eg c or python)

use -v when u run tests to see how tests are executed

@michaelaye
Copy link
Contributor Author

Not really all having different arguments. Look at these two, isn't it just a copy-paste forgotten or lack of decision if to run these 2 tests separately or within one method? The first test in first method looks like to be exactly the same as the test in the second method definition.

    def test_integer_overflow_bug(self):
        # #2601
        data = "65248E10 11\n55555E55 22\n"

        result = self.read_csv(StringIO(data), header=None, sep=' ')
        self.assertTrue(result[0].dtype == np.float64)

        result = self.read_csv(StringIO(data), header=None, sep='\s+')
        self.assertTrue(result[0].dtype == np.float64)

    def test_integer_overflow_bug(self):
        # #2601
        data = "65248E10 11\n55555E55 22\n"

        result = self.read_csv(StringIO(data), header=None, sep=' ')
        self.assertTrue(result[0].dtype == np.float64)

@jreback
Copy link
Contributor

jreback commented May 9, 2014

it's calling self.read_cav

that sets different engines on each run

@michaelaye
Copy link
Contributor Author

you mean there's something that keeps track how often a method signature was called and is using a different engine the next time it's there?
IOW, the execution of the method and its results depend on what methods have been defined before it in the source?
How would I know then what engine is used, which one is used first? Can I not control it?

@jreback
Copy link
Contributor

jreback commented May 9, 2014

that's why u define multiple classes to have the same methods called but with different arguments to test different cases

it makes it simpler then writing a tests twice

nose calls the tests in alphabetical order I think

no state is kept

it's all in each tests

idea being is it should behave identically on both parsers

their are specifc tests on one or the other parser for certain options that are only supported on one ( which are in effect uninplemented features ATM)

@jtratner
Copy link
Contributor

jtratner commented May 9, 2014

Super trivial point: unless this will be supported someday, or could be supported, you should make this a TypeError instead (for bad function signature - eg what Python throws if you pass the wrong number of arguments).

Your call on what it should be tho.

@jreback
Copy link
Contributor

jreback commented May 9, 2014

@michaelaye TypeError is better here as @jtratner points out, though I do see from your original request that this could be an enhancement proposal (e.g. n chunks from first nrows)...either ok for now

@jreback jreback added this to the 0.15.0 milestone May 9, 2014
@michaelaye
Copy link
Contributor Author

Ok, I digged deeper into the parser tests and still don't get why above method is defined twice in the class ParserTests.
I understand that only classes with Test in the front of the name are being called, and that TestCParserHighMemory, TestCParserLowMemory and TestPythonParser derive from ParserTests and that the read_csv method has different keywords and is being overwritten at the beginning of these Test-classes.
What I don't get:

  • when overwriting read_csv, I don't understand how read_csv is available in the scope of that method? Shouldn't it say return self.read_csv(...) here?
class TestCParserHighMemory(ParserTests, tm.TestCase):

    def read_csv(self, *args, **kwds):
        kwds = kwds.copy()
        kwds['engine'] = 'c'
        kwds['low_memory'] = False
        return read_csv(*args, **kwds)
  • This not explain why there would be two exactly same-named methods in ParserTests. If they would be both called, then the total number of times test_integer_overflow_bug is being called should be 6, not 3, but it is 3:
$ cat allout.txt|grep "test_integer_overflow_bug "
test_integer_overflow_bug (pandas.io.tests.test_parsers.TestCParserHighMemory) ... ok
test_integer_overflow_bug (pandas.io.tests.test_parsers.TestCParserLowMemory) ... ok
test_integer_overflow_bug (pandas.io.tests.test_parsers.TestPythonParser) ... ok

@jreback
Copy link
Contributor

jreback commented May 9, 2014

so delete the first one (looks like an older version); it doesn't matter really, because the later definition overrides the first definition

e.g.

class Foo:

     def func(....):
          ........

     def func(....):
            ....

The first 'func' never is simply overwritten

@michaelaye
Copy link
Contributor Author

Ok, will remove the first version. Don't know what the difference between override and overwrite is, and why the later definition would override the first one, but would NOT overwrite it, but I guess that's some internals I don't care now.

@jreback jreback modified the milestones: 0.14.1, 0.15.0 May 30, 2014
@jreback
Copy link
Contributor

jreback commented May 30, 2014

looks good, pls rebase on master and add a whatsnew entry for 0.14.1. (we are now just putting in 1 place)

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

@michaelaye can you add a whatsnew entry? otherwise good to go (you can put in API changes section)

@michaelaye
Copy link
Contributor Author

Nah, still have to do the tests and learn about rebasing.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

ahh..rght...ok...lmk

@jreback
Copy link
Contributor

jreback commented Jun 10, 2014

update?

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

@michaelaye ?

@jreback jreback added this to the 0.15.0 milestone Jun 22, 2014
@jreback jreback removed this from the 0.14.1 milestone Jun 22, 2014
@michaelaye
Copy link
Contributor Author

i'm working on the test. saw now that read_csv is being imported into the namespace. But considering all the redefinitions of read_csv that are happening in the test module, I find that really obscuring. Having a parsers.read_csv would be much clearer and less confusing, IMHO.

I'm hampered by my main work machine being broken this week, so i'm working with this awfully slow Win7 netbook 🐌

@michaelaye
Copy link
Contributor Author

Okay, as required, I verified that my test fails when my patch is not in. Now a quick read on rebase-ing, and deciding what commit to survive and I'm done.

@michaelaye
Copy link
Contributor Author

So, I only have 2 commits to replay, is it okay to just execute a 'git rebase master' on my patch branch and push the results up into the PR?

@jreback
Copy link
Contributor

jreback commented Jun 23, 2014

yep

@jreback
Copy link
Contributor

jreback commented Jun 23, 2014

perfect

just need a short release note in v0.14.1 (ref the original issue)

then squash (if u can't no worries)

but almost the same as a rebase
just do it -i and use s

@michaelaye
Copy link
Contributor Author

one last help to understand what is the preferred git procedure:
doc/source/v0.14.1.txt is not yet in my patch branch, but of course available in master. How should I get this file into my patch branch? If I do a rebase now I would have it, but then I create another commit. If I merge master into my patch branch, then there's nothing to rebase. Please advise.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

git rebase -i origin/master

will rebase u on top of pandas master

@michaelaye
Copy link
Contributor Author

yes, and then i have to a make a change to the what's new file, creating another commit, meaning even so I squash my previous 2 i now have to create another one.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

that's fine
u create commits
then rebase / reorder / squash as needed

@michaelaye
Copy link
Contributor Author

Are you suggesting a second rebase after I made the changes to v0.14.1.txt?

@michaelaye
Copy link
Contributor Author

or should I cherry-pick the file, make the changes and then rebase? That's sounds quite hackish for the git log, I think.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

u can rebase and much as u need

all rebasing does is rewrites the commit history

eg allows u to reorder and/or combine commits

@michaelaye
Copy link
Contributor Author

ok, thanks.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

you can make whatever changes u want
it's in your branch

I think rebasing is really much nicer than merging in as it allows u to make a consistent history - purist don't like this because it rewrites the history - but I don't think it's a big deal

when I merge it will be a single merge commit that takes your commit (kind of like a cherry pick)

…hunksize.

For read_csv() the user intention most likely is to get a TextFileReader, when using the chunksize option, but simultaneous use of nrows is not implemented yet. This raises now a NotImplementedError. Test and entry to current whatsnew source (v0.14.1.txt) added.
Fixes pandas-dev#6774
@michaelaye
Copy link
Contributor Author

darn git has some empty diffs with github.. i will let u know when i'm done.. :(

@michaelaye
Copy link
Contributor Author

ok, this should be it. Thanks for your patience.

@jreback jreback modified the milestones: 0.14.1, 0.15.0 Jun 24, 2014
jreback added a commit that referenced this pull request Jun 24, 2014
adding a NotImplementedError for simultaneous use of nrows and chunksize...
@jreback jreback merged commit 647f771 into pandas-dev:master Jun 24, 2014
@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

thanks @michaelaye !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv: chunksize clashes with nrows
3 participants