-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_csv: date_parser called once with arrays and then many times with strings (from each single row) as arguments #9376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The reason the date parser is called both on the whole column as on the individual rows, is because it first tries the provided function as a vectorized function (on the whole column at once), and if that fails (which is the case with your function), it will try to apply it on each row. So in this way the user can provide both a vectorized or scalar function. It could indeed possibly be better performance wise to call the date parser once on the whole array, but in that case you have to provide a function that is able to do that yourself. |
BTW, the reason your function fails for vectorized input is that the addition of the different parts ( |
Thank you, I understand the problem now and I think pandas could have handled this better (by informing me of the error). I never knew something went wrong, because pandas caught this error silently. If this small fact could be added to the docs (to read_csv and any other similar functions), that would help immensely. Or even better, pandas could warn me (though I see the problem with implementing that giving the current usage). |
@cmeeren In this case, I don't think it is fully pandas its fault, as you yourself silenced the warning by adding the bare except, and then going to the string version of your function. So anyway, the numpy TypeError you would never have seen (coincidently, I just read a blog post about that yesterday :-) https://realpython.com/blog/python/the-most-diabolical-python-antipattern/, but I agree, we still use it too much in pandas as well) Eg if you limit your function to only the first part like
you would already get a more informative error ("AttributeError: 'str' object has no attribute 'astype'"), but I agree, you don't see the error from the vectorized try (the error is for the scalar case). The easiest way, if you have an error, is to just try the But certainly welcome to add a clarification to the docs of vectorized/scalar. |
@jorisvandenbossche My original function did indeed look like what you suggest. However, I did not think it would be fruitful to try the function on some arrays manually since I had no idea how pandas called the function and this was not mentioned in the docs. I have now submitted a pull request (#9377). |
Small GitHub-related question: If this PR is merged, can I safely delete my fork? |
Indeed, once it is merged, you can delete it (there will be even a button on github to do that on your remote origin) |
Pandas version info:
Consider the following script, containing data to be loaded, a date parsing function, and
pd.read_csv()
:The
print
statement inparse_date
outputsWith
parse_date
as written above, the DataFrame is loaded properly, even when the firstreturn
statement inparse_date
is changed toreturn None
:It strikes me as odd that the date parser is called with both whole columns and individual rows. I could not find any documentation regarding this behaviour. It immediately raises two questions in my mind:
The text was updated successfully, but these errors were encountered: