Skip to content

Merge() without a common column (LARGE DATASET, so no fake_key=1) #21742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lucasmengual92 opened this issue Jul 5, 2018 · 4 comments
Closed
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question

Comments

@lucasmengual92
Copy link

Code Sample, a copy-pastable example if possible

#RIPE FEATURES
    start_decimal_ip  end_decimal_ip  isSL  ...    isSUBNET  isINTERNET  isISP
0          621457152       621457663     1  ...           0           1      0
1         3273214112      3273214143     0  ...           0           0      0
2         1280513280      1280513535     1  ...           1           0      0
3         3581584960      3581584975     0  ...           0           0      0
4         3564859176      3564859179     1  ...           0           0      0
5         1049781248      1049782271     1  ...           0           0      0
6         3262479184      3262479194     0  ...           0           0      0

#IPS DATA
  target              ip  decimal_ip
0      1    62.184.112.1  1052274689
1      1     194.30.89.1  3256768769
2      1  194.117.123.83  3262479187
3      1  195.235.249.73  3287021897

Problem description

I want to marge two dataframes, one table (ips_data) with the IP, IP in decimal format, target (if isp or not). The other table (ripe_features) has the start and end point of an IP range (in decimal), and several values. My goal is to obtain the values of ripe_features when the IP of ips_data is between the range. I did a first attampt creating a fake_key =1, but my ripe_features database is large (over 4 million rows), so this method seems unhealthy.

I want to know if there´s a better way to dod this kind of merge. I was thinking of doing a new list/column 'ip' in the ripe_features with for() loop for each IP in start/end range to return the IP, else return null, and then do a merge with on='ip'.

Expected Output

target,ip,isSL,isSA,isADSL,isDSL,isWIFI,isHOTSPOT,isBLOCK,isSUBNET,isINTERNET,isISP
1,194.117.123.83,0,1,0,0,0,0,0,0,0,0
1,195.55.122.48,1,0,0,0,0,0,0,0,0,0
1,37.15.99.177,0,0,1,0,0,0,0,0,0,1
@gfyoung gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Performance Memory or execution speed performance Usage Question labels Jul 5, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 5, 2018

cc @jreback

@jreback
Copy link
Contributor

jreback commented Jul 5, 2018

better to ask this question on SO; there are some examples of doing this type of interval based merging

cc @jschendel

@lucasmengual92
Copy link
Author

Yeah, I've already posted in SO too. thanks!

@chris-b1
Copy link
Contributor

folding into #21901

@chris-b1 chris-b1 added the Duplicate Report Duplicate issue or pull request label Aug 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants