-
Notifications
You must be signed in to change notification settings - Fork 104
Duplicated rows when using fetchmany_arrow method #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the fullsome write-up. I'm writing here to acknowledge that we've seen this and are working up a reproduction. Can you please share your version of Python, |
Ah, I forgot to include those:
Thanks for the quick reply! Let me know if there's any other information I can provide. |
@susodapop Totally off-topic but you should probably look up 'fulsome' before you accidentally insult someone with it, like I did once in front of over 200 people. (-: |
Fair point, @Adomatic ;) Let's just say I'm trying to accelerate the revival of its positive connotations as discussed here:
|
Hi @morleytj , is this issue still occurring? I have not been able to reproduce it. |
Hi @andrefurlan-db, I've just rerun my pipeline and the issue is stil occurring, its very consistent in always duplicating the header line so that's an easy way for me to check, even in the new queries I've added, so it doesn't seem to be related to anything specific to a given query. Is there any additional information you think I could potentially provide to help reproduce? I'm happy to provide other package versions or environment info, if its relevant, the cluster I'm running these scripts on is using CentOS 7 in terms of OS. |
For example, I've just pulled a file in this manner which has an individual's ID, and a couple of summary variables such as the most recent date associated with that individual (each line is unique, since the query going in has group by ID) -- the file that was pulled had 3738958 lines, and after dropping duplicates has 3560930 lines, coming out to 178,028 duplicate lines. The query is along the lines of
|
Checking in to update that this issue is still occurring April 2024, and am curious if anyone has replicated it. |
Some extra context I noticed today -- generally there has very consistently been an extra header column generated as an indicator that rows were duplicated, one of the files I pulled today didn't have that extra column, so I checked and it didn't have any duplicates, though the other two did. This file is the smallest of the three I pulled, and had only two columns. At first I thought this might be because of the batch size being larger than the retrieved table, but that isn't the case. Batch size was 10000 and row number of the nonduplicated table was 37,616. Unsure if relevant, but in the interest of providing all information, this unduplicated table is a table of two columns of unique ID's, and it's a select distinct of two of the columns from one of the two larger queries, one of which is duplicated. |
@morleytj can you please try to pass |
@kravets-levko I just ran my pipeline with that argument in the |
Thank you @morleytj! This indicates that probably we have one more issue with CloudFetch feature, which is sad. But at least we have a direction. I'll ask you to test it a bit more with CloudFetch disabled - just to make sure it indeed helps. If you see duplicated rows again - please let me know |
Unfortunate, but thank you for the help in identifying the source of the error! The pipeline I'm using is run daily so I will keep you updated as to the output health. |
@kravets-levko is this already resolved? I was still using 2.9.6 and playing around with enabling cloud fetch, which gives improved performance, but if it returns an incorrect output I probably need to avoid it for now |
Hello here 👋 We just released v3.3.0 which includes a refactoring of CloudFetch-related code. Please give it a try and let me know if it helped with your issues or not (remember to enable CloudFetch via use_cloud_fetch=True). If you still see any issues - please enable debug logging (see #383 (comment)) and share log output. Thank you! |
Hello, I've been using this package to automate some SQL pulldowns of a fairly large dataset, but have realized after running it that the fetchmany_arrow() method is potentially overlapping its returned results. I have included details as to the code I am running and the results below.
MWE
Example data format
The data is in a long format in the table, and looks like this
Error
The resulting file written to disk has duplicated rows in it (including the header), similar to the following example:
Interestingly the first duplicated row is directly at the 10000th index (same size I used for fetchmany_arrow). The first duplicated value is not the same as the first value in the first batch, meaning it is not a complete repasting of it, the duplicated row's first occurrence is at index=9216. I am wondering if this indicates some level of overlap in the fetch commands. This is also supported by the fact that all the following rows for a set number of rows are duplicated.
The total number of duplicates is 3849407, representing ~0.05% of the total number of records.
Initial investigation of potential error sources:
I have checked to make sure the duplicates are not on the end of the SQL database, by running the following query:
However, this returns an empty set, indicating that the duplicates are being generated during retrieval or writing. The possibilities I am seeing as to the source of the error are either step of writing or the step of retrieval, but given the appearance of duplicated rows at the start of each batch I believe the error to be originated somewhere in the fetchmany_arrow() call.
Hopefully the error is not in my code somewhere, haha, and hopefully this is helpful in tracking down the potential issue.
Best,
Theodore
The text was updated successfully, but these errors were encountered: