Skip to content

nrows limit fails reading well formed csv files from Australian electricity market data #7626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ChristopherShort opened this issue Jul 1, 2014 · 18 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@ChristopherShort
Copy link

Reading Australian electricity market data files, read_csv reads past the nrows limit for certain nrows values and consequently fails.

These market data files are 4 csv files combined into a single csv file and so the file has multiple headers and variable field size across the rows.

The first set of data is from rows 1-1442.

Intent was to extract first set of data with nrows = 1442.

Testing several arbitrary CSV files from this data source shows well formed CSV - 120 fields between rows 1 to 1442 (with a 10 field at row 0)

lines = [len(line.strip().split(',')) for i,line in enumerate(csvFile) if i < 1442]
s = pd.Series(lines)
print (s.value_counts())

returns
120 1441
10 1
dtype: int64

Other python examples of reading the market data using csv module work fine

In the reproducible example below, code works for nrows< 824, but fails on any value above it.

Testing on arbitrary files suggests the 824 limit is variable - sometimes a few more rows, sometimes a few less rows.

import requests, io, zipfile
import pandas as pd

url = 'http://www.nemweb.com.au/Reports/CURRENT/Public_Prices/PUBLIC_PRICES_201406290000_20140630040528.zip'

    # get the zip-archive
request = requests.get(url)

    # make the archive available as a byte-stream
zipdata = io.BytesIO()
zipdata.write(request.content)
thezipfile = zipfile.ZipFile(zipdata, mode='r')

    # there is only one csv file per arhive - read it into a Pandas DataFrame
fname = thezipfile.namelist()[0] 

with thezipfile.open(fname) as csvFile:

        #works for nrows < = 823
    df1 = pd.read_csv(csvFile, header=1, index_col=4, parse_dates=True, nrows=823)
    print(df1.head())

        #fails for n> 823
    df1 = pd.read_csv(csvFile, header=1, index_col=4, parse_dates=True, nrows=824)
    print(df1.head())
@ChristopherShort
Copy link
Author

I meant to say - the sort of error returned is:

CParserError: Error tokenizing data. C error: Expected 120 fields in line 1443, saw 130

Which is expected - the field size changes past row 1442 - but for these files, the nrows limit reads past the 1442 (or 823+) value.

I also tested nrows on arbitrarily created csv files via numpy arrays but couldn't reproduce the error from the real data I was working with.

(And apologies for poorly formed markdown above - first time posting :-)

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

why don't you create a test. Pull the header and 2 rows from each section (then limit the number of fields). Then try this using nrows to skip. If this is a bug, would need to create a reproducible example.

@ChristopherShort
Copy link
Author

thanks - but I'm unclear on your request - that is, I thought I did what you asked already.

I created a reproducible example with the code at the bottom of my post - admittedly with iPython rather than a straight python file.

I'm trying to extract the first section (rows 1-1442 of a 3366 row file) - this is where my problem occurs.

Was my code example unclear?

For reproducibility purposes, the bulk of the code deals with downloading a zip file, but the test is in the five lines from 'with thezipfile.open(fname) as csvFile:' onwards

I'm expecting it to be a subtle bug (or I'm doing something very wrong) - nrows parameter clearly works on the various examples I threw at it that were much larger in row length.

But at the same time, these electricity market files are well formed CSV files (they are part of the market data process in a live electricity market where auctions are run every 5 minutes for the past 15 years) - and pandas is failing to parse the files I used in developing the code.

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

no, what I mean is we need an example that you can simply copy and paste (and not use an external URL).

@ChristopherShort
Copy link
Author

thanks - you've given me a thought that I can test it by just breaking the relevant part of the CSV file out.

But if it turns out to be related to the file structure itself - not sure how to provide a test without a link to a sample file. Would creating a github repo with some sample csv files and a few lines of code be suitable as the test?

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

see what you come up with. this is either a bug, which can be reproduced via a generated type of test file (e.g. create a specific structure), a problem with the csv, or incorrect use of the options.

We need a self-contained test in order to narrow down the problem.

lmk what you find.

@ChristopherShort
Copy link
Author

Thanks.

If the file is only the relevant section (or rows to skip at the front) - no error.

Implies not a problem with use of options too I think.

If the file structure includes the very next line (no. 1443) - the 130 field header for the next section - it fails with any nrows>823.

I also experimented with deleting arbitrary number of rows (but small number) at the end of the section before the next header row - to see if the issue related to that particular line ending. Again fail.

I'm not sure I can create a test file - other than the sample files I've been experimenting with.

I'll go and figure out how to make a github repo and perhaps we can take it from there.

For info - the full error at the fail point is:

/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
1128
1129 try:
-> 1130 data = self._reader.read(nrows)
1131 except StopIteration:
1132 if nrows is None:

/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader.read (pandas/parser.c:7146)()

/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7547)()

/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._read_rows (pandas/parser.c:7979)()

/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7853)()

/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.raise_parser_error (pandas/parser.c:19604)()

CParserError: Error tokenizing data. C error: Expected 120 fields in line 1443, saw 130

@ChristopherShort
Copy link
Author

how about this for a test.

Create two CSV files ( 1442 rows by 120 cols and 5 rows by 130 cols)
Concatenate them
Read the joined CSV file back into a dataframe with nrows option <= 1442

It fails in the same way - though the nrows parameter was much larger before failure occurred relative to the examples above (where the files contained more strings)

In the example below - it fails for nrow>1360. Works fine for lower values.

pd.DataFrame(np.random.uniform(size=1442*120).reshape(1442,120)).to_csv('test120.csv',index=False)

pd.DataFrame(np.random.uniform(size=5*130).reshape(5, 130)).to_csv('test130.csv',index=False)


filenames = ['test120.csv', 'test130.csv']
with open('testNrows.csv', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

df = pd.read_csv('testNrows.csv', nrows=1361)

@jreback jreback added this to the 0.15.0 milestone Jul 3, 2014
@jreback
Copy link
Contributor

jreback commented Jul 3, 2014

ok great, must be a bug somewhere

@ChristopherShort
Copy link
Author

I have some time now to try and look at this bug, but not much experience.

Do you have any recommendations on things I should know first?

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

well it's going to be in parser.pyx

IMHO not so easy to debug cython

I would start by putting print statements to figure out what it is doing on this file

@ChristopherShort
Copy link
Author

OK - thanks

@ChristopherShort
Copy link
Author

This is a small update (and to see if any thoughts occur to you).

Before I went to look at parser.py, I tried to generalise the test file above in order to explore row/column variations to see if there was a boundary to the error.

I didn't get far in exploring row parameters before realising the error appears to occur randomly.

In the code below, it loops over the 'test' 3 times, printing out the number of rows in the failed example, as well as the memory size of the dataframe in the failed run.

Different number of errors across different runs (I've seen one where there was no error at all).

The dataframe memory size doesn't appear relevant - when i printed it for all tests, bigger ones passed, smaller ones failed, nothing obvious to look for.

And indicating that it's not a memory thing - a typo that set the number of columns to 12, instead of 120, got that error each and every time read_csv was called.

I'll go look at parser.py to see where I could put some print statements - but, as you say, probably in cython call somewhere - and I'm an economist (not a programer) who last used C sparingly 20 years ago.

def test_RowCount (size_1=(1442,120), rowCount=1361):   #original parameters where failure occurred

    df_1 = pd.DataFrame (np.random.uniform(size=size_1))
    df_1.to_csv('test120.csv',index=False)

    #create file of combined csv file ('testNrows.csv') of different record lengths
    filenames = ['test120.csv', 'test130.csv']
    with open('testNrows.csv', 'w') as outfile:
        for fname in filenames:
            with open(fname) as infile:
                for line in infile:
                    outfile.write(line)


    try:
        df = pd.read_csv('testNrows.csv', header=0, nrows=rowCount)

    except (pd.parser.CParserError) as error:
        print (error)
        print ('Rows: ', size_1[0])
        print ('Memory (MB): ', df_1.memory_usage(index=True).sum()/1024/1024, '\n')

#    except:
#        print ("Unexpected error: ", sys.exc_info()[0])



### Write out 1 file of different record length for later use in test_RowCount function 
size_2 = (1,130)
df_2 = pd.DataFrame(np.random.uniform(size=size_2))
df_2.to_csv('test130.csv',index=False)       


### Loop for testing various row counts and record lengths
for j in range(3):
    print ('Run ', j)
    for i in range(1442, 1361, -1): 
        #print (i)
        test_RowCount(size_1=(i,120), rowCount=1360)

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jzwinck
Copy link
Contributor

jzwinck commented Nov 21, 2016

@jreback This is a real problem. It's still present in 0.19. It can be worked around by engine='python' but this is not a real solution. Stack Overflow has now discovered this problem at least twice:

  1. http://stackoverflow.com/questions/25985817/why-is-pandas-read-csv-not-reading-the-right-number-of-rows
  2. http://stackoverflow.com/questions/37040634/pandas-read-csv-with-engine-c-issue-bug-or-feature

@jreback
Copy link
Contributor

jreback commented Nov 21, 2016

well if u have a reproducible example pls show it

@jzwinck
Copy link
Contributor

jzwinck commented Nov 21, 2016

@jreback OK, the input file is 516 KB. Where would you like me to put it? I tried removing "unnecessary" rows from it but this bug doesn't reproduce if you shrink the file a lot.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2016

best to put this up on a separate repo or gist, and use a URL to access.

@jzwinck
Copy link
Contributor

jzwinck commented Nov 22, 2016

@jreback I have uploaded a file which reproduces this error: https://gist.githubusercontent.com/jzwinck/838882fbc07f7c3a53992696ef364f66

Simply download that file and run this:

import pandas as pd
pd.read_table('pandas_issue_7626.txt', skiprows=2195, nrows=100)

It fails, saying:

File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:9884)
File "pandas/parser.pyx", line 880, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10347)
File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)
File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)
File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:25878)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 6 fields in line 2355, saw 14

Since we told Pandas to read from line 2195 for 100 rows, it should never have seen line 2355.

jeffcarey added a commit to jeffcarey/pandas that referenced this issue Nov 25, 2016
jeffcarey added a commit to jeffcarey/pandas that referenced this issue Nov 26, 2016
jeffcarey added a commit to jeffcarey/pandas that referenced this issue Nov 26, 2016
jeffcarey added a commit to jeffcarey/pandas that referenced this issue Nov 26, 2016
jeffcarey added a commit to jeffcarey/pandas that referenced this issue Nov 26, 2016
jeffcarey added a commit to jeffcarey/pandas that referenced this issue Nov 26, 2016
jeffcarey added a commit to jeffcarey/pandas that referenced this issue Nov 29, 2016
jeffcarey added a commit to jeffcarey/pandas that referenced this issue Dec 2, 2016
…andas-dev#7626)

Fixed code formatting

Added test to C Parser Only suite, added whatsnew entry
@jreback jreback modified the milestones: 0.19.2, Next Major Release Dec 5, 2016
@jreback jreback closed this as completed in 4378f82 Dec 6, 2016
jorisvandenbossche pushed a commit that referenced this issue Dec 15, 2016
closes #7626

Subsets of tabular files with different "shapes"
will now load when a valid skiprows/nrows is given as an argument   -

Conditions
for error:  1) There are different "shapes" within a tabular data
file, i.e. different numbers of columns.  2) A "narrower" set of
columns is followed by a "wider" (more columns) one, and the narrower
set is laid out such that the end of a 262144-byte block occurs within
it.    Issue summary:   The C engine for parsing files reads in 262144
bytes at a time. Previously, the "start_lines" variable in
tokenizer.c/tokenize_bytes() was set incorrectly to the first line in
that chunk, rather than the overall first row requested. This lead to
incorrect logic on when to stop reading when nrows is supplied by the
user. This always happened but only caused a crash when a wider set of
columns followed in the file. In other cases, extra rows were read in
but then harmlessly discarded.    This pull request always uses the
first requested row for comparisons, so only nrows will be parsed
when supplied.

Author: Jeff Carey <[email protected]>

Closes #14747 from jeffcarey/fix/7626 and squashes the following commits:

cac1bac [Jeff Carey] Removed duplicative test
6f1965a [Jeff Carey] BUG: Corrects stopping logic when nrows argument is supplied (Fixes #7626)

(cherry picked from commit 4378f82)

 Conflicts:
	pandas/io/tests/parser/c_parser_only.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants