-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
nrows limit fails reading well formed csv files from Australian electricity market data #7626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I meant to say - the sort of error returned is:
Which is expected - the field size changes past row 1442 - but for these files, the nrows limit reads past the 1442 (or 823+) value. I also tested nrows on arbitrarily created csv files via numpy arrays but couldn't reproduce the error from the real data I was working with. (And apologies for poorly formed markdown above - first time posting :-) |
why don't you create a test. Pull the header and 2 rows from each section (then limit the number of fields). Then try this using nrows to skip. If this is a bug, would need to create a reproducible example. |
thanks - but I'm unclear on your request - that is, I thought I did what you asked already. I created a reproducible example with the code at the bottom of my post - admittedly with iPython rather than a straight python file. I'm trying to extract the first section (rows 1-1442 of a 3366 row file) - this is where my problem occurs. Was my code example unclear? For reproducibility purposes, the bulk of the code deals with downloading a zip file, but the test is in the five lines from 'with thezipfile.open(fname) as csvFile:' onwards I'm expecting it to be a subtle bug (or I'm doing something very wrong) - nrows parameter clearly works on the various examples I threw at it that were much larger in row length. But at the same time, these electricity market files are well formed CSV files (they are part of the market data process in a live electricity market where auctions are run every 5 minutes for the past 15 years) - and pandas is failing to parse the files I used in developing the code. |
no, what I mean is we need an example that you can simply copy and paste (and not use an external URL). |
thanks - you've given me a thought that I can test it by just breaking the relevant part of the CSV file out. But if it turns out to be related to the file structure itself - not sure how to provide a test without a link to a sample file. Would creating a github repo with some sample csv files and a few lines of code be suitable as the test? |
see what you come up with. this is either a bug, which can be reproduced via a generated type of test file (e.g. create a specific structure), a problem with the csv, or incorrect use of the options. We need a self-contained test in order to narrow down the problem. lmk what you find. |
Thanks. If the file is only the relevant section (or rows to skip at the front) - no error. Implies not a problem with use of options too I think. If the file structure includes the very next line (no. 1443) - the 130 field header for the next section - it fails with any nrows>823. I also experimented with deleting arbitrary number of rows (but small number) at the end of the section before the next header row - to see if the issue related to that particular line ending. Again fail. I'm not sure I can create a test file - other than the sample files I've been experimenting with. I'll go and figure out how to make a github repo and perhaps we can take it from there. For info - the full error at the fail point is: /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows) /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader.read (pandas/parser.c:7146)() /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7547)() /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._read_rows (pandas/parser.c:7979)() /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7853)() /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.raise_parser_error (pandas/parser.c:19604)() CParserError: Error tokenizing data. C error: Expected 120 fields in line 1443, saw 130 |
how about this for a test. Create two CSV files ( 1442 rows by 120 cols and 5 rows by 130 cols) It fails in the same way - though the nrows parameter was much larger before failure occurred relative to the examples above (where the files contained more strings) In the example below - it fails for nrow>1360. Works fine for lower values.
|
ok great, must be a bug somewhere |
I have some time now to try and look at this bug, but not much experience. Do you have any recommendations on things I should know first? |
well it's going to be in parser.pyx IMHO not so easy to debug cython I would start by putting print statements to figure out what it is doing on this file |
OK - thanks |
This is a small update (and to see if any thoughts occur to you). Before I went to look at parser.py, I tried to generalise the test file above in order to explore row/column variations to see if there was a boundary to the error. I didn't get far in exploring row parameters before realising the error appears to occur randomly. In the code below, it loops over the 'test' 3 times, printing out the number of rows in the failed example, as well as the memory size of the dataframe in the failed run. Different number of errors across different runs (I've seen one where there was no error at all). The dataframe memory size doesn't appear relevant - when i printed it for all tests, bigger ones passed, smaller ones failed, nothing obvious to look for. And indicating that it's not a memory thing - a typo that set the number of columns to 12, instead of 120, got that error each and every time read_csv was called. I'll go look at parser.py to see where I could put some print statements - but, as you say, probably in cython call somewhere - and I'm an economist (not a programer) who last used C sparingly 20 years ago. def test_RowCount (size_1=(1442,120), rowCount=1361): #original parameters where failure occurred
df_1 = pd.DataFrame (np.random.uniform(size=size_1))
df_1.to_csv('test120.csv',index=False)
#create file of combined csv file ('testNrows.csv') of different record lengths
filenames = ['test120.csv', 'test130.csv']
with open('testNrows.csv', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line)
try:
df = pd.read_csv('testNrows.csv', header=0, nrows=rowCount)
except (pd.parser.CParserError) as error:
print (error)
print ('Rows: ', size_1[0])
print ('Memory (MB): ', df_1.memory_usage(index=True).sum()/1024/1024, '\n')
# except:
# print ("Unexpected error: ", sys.exc_info()[0])
### Write out 1 file of different record length for later use in test_RowCount function
size_2 = (1,130)
df_2 = pd.DataFrame(np.random.uniform(size=size_2))
df_2.to_csv('test130.csv',index=False)
### Loop for testing various row counts and record lengths
for j in range(3):
print ('Run ', j)
for i in range(1442, 1361, -1):
#print (i)
test_RowCount(size_1=(i,120), rowCount=1360) |
@jreback This is a real problem. It's still present in 0.19. It can be worked around by |
well if u have a reproducible example pls show it |
@jreback OK, the input file is 516 KB. Where would you like me to put it? I tried removing "unnecessary" rows from it but this bug doesn't reproduce if you shrink the file a lot. |
best to put this up on a separate repo or gist, and use a URL to access. |
@jreback I have uploaded a file which reproduces this error: https://gist.githubusercontent.com/jzwinck/838882fbc07f7c3a53992696ef364f66 Simply download that file and run this:
It fails, saying:
Since we told Pandas to read from line 2195 for 100 rows, it should never have seen line 2355. |
…s supplied Added test for pandas-dev#7626 Added test for pandas-dev#7626
…andas-dev#7626) Fixed code formatting
…andas-dev#7626) Fixed code formatting Added test to C Parser Only suite, added whatsnew entry
closes #7626 Subsets of tabular files with different "shapes" will now load when a valid skiprows/nrows is given as an argument - Conditions for error: 1) There are different "shapes" within a tabular data file, i.e. different numbers of columns. 2) A "narrower" set of columns is followed by a "wider" (more columns) one, and the narrower set is laid out such that the end of a 262144-byte block occurs within it. Issue summary: The C engine for parsing files reads in 262144 bytes at a time. Previously, the "start_lines" variable in tokenizer.c/tokenize_bytes() was set incorrectly to the first line in that chunk, rather than the overall first row requested. This lead to incorrect logic on when to stop reading when nrows is supplied by the user. This always happened but only caused a crash when a wider set of columns followed in the file. In other cases, extra rows were read in but then harmlessly discarded. This pull request always uses the first requested row for comparisons, so only nrows will be parsed when supplied. Author: Jeff Carey <[email protected]> Closes #14747 from jeffcarey/fix/7626 and squashes the following commits: cac1bac [Jeff Carey] Removed duplicative test 6f1965a [Jeff Carey] BUG: Corrects stopping logic when nrows argument is supplied (Fixes #7626) (cherry picked from commit 4378f82) Conflicts: pandas/io/tests/parser/c_parser_only.py
Reading Australian electricity market data files, read_csv reads past the nrows limit for certain nrows values and consequently fails.
These market data files are 4 csv files combined into a single csv file and so the file has multiple headers and variable field size across the rows.
The first set of data is from rows 1-1442.
Intent was to extract first set of data with nrows = 1442.
Testing several arbitrary CSV files from this data source shows well formed CSV - 120 fields between rows 1 to 1442 (with a 10 field at row 0)
returns
120 1441
10 1
dtype: int64
Other python examples of reading the market data using csv module work fine
In the reproducible example below, code works for nrows< 824, but fails on any value above it.
Testing on arbitrary files suggests the 824 limit is variable - sometimes a few more rows, sometimes a few less rows.
The text was updated successfully, but these errors were encountered: