Skip to content

WEB: Obfuscate workgroup email addresses #51209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Feb 7, 2023 · 9 comments · Fixed by #51266
Closed

WEB: Obfuscate workgroup email addresses #51209

datapythonista opened this issue Feb 7, 2023 · 9 comments · Fixed by #51266
Assignees
Labels

Comments

@datapythonista
Copy link
Member

We recently added our workgroup email addresses to our website: https://pandas.pydata.org/about/team.html#workgroups

While this has been useful, and we received relevant emails from people that otherwise wouldn't know how to contact us easily, we also started receiving spam. I'm unsure if spam is generated manually by people ending up in our website, or by bots fetching our email addresses automatically. But in case it's the latter, I think it'd be good to see if we can easily obfuscate the email addresses in the html code.

I guess there are many options, but it comes to my mind that something very easy that could possibly stop some of the spam would be to simply prepend a string to the email addresses in the html, and then remove it via javascript. This won't help with spammers getting our addresses manually, or using scrappers with javascript support like selenium, but with some luck most of the spam comes from simpler bots just fetching the html.

The idea would be that for example if the address is [email protected], the html generated from the markdown is something like <a href="mailto:[email protected]">[email protected]</a>, and then we have a simple javascript block that removes the no and makes the final html rendered to the user contain the right address.

This is the file where this should be implemented: https://github.com/pandas-dev/pandas/blob/main/web/pandas/about/team.md#-workgroupname-

@Kabiirk
Copy link
Contributor

Kabiirk commented Feb 7, 2023

Hi,

First time contributor to Pandas here. There are many ways to obfuscate emails on websites & prevent bots from scraping them for e.g. :

  • address[at]pandas[dot]pydata[dot]org (this would be text on the website itself, but more work at the user-end to copy & replace these characters, plus i think bots can just replace this)
  • Use special HTML characters to our advantage (this doesn't guarantee protection, but would reduce bot scraping a bit) :
    <a href="mailto:address&commat;pandas&period;pydata&period;com"> user&commat;domain&period;com</a
  • Encode it completely like :
    <a href="&#x6d;&#x61;&#x69;&#x6c;&#x74;&#x6f;&#x3a;&#x62;&#x65;&#x6e
    ;&#x75;&#x74;&#x7a;&#x65;&#x72;&#x40;&#x64;&#x6f;&#x6d;&#x61;&#x69;
    &#x6e;&#x2e;&#x64;&#x65;">email</a>.
    
  • or other methods (WIP, will see what I can find)

A few resources which I found were as follows :

As per my understanding of the scope of work for this issue, I would need to edit the Markdown file mentioned by you to parse the text differently but display it as close to the email id as possible. If so, please let me know if I can work on this issue.

Thanks & Regards

@datapythonista
Copy link
Member Author

Thanks for the help @Kabiirk. What you say is correct, just keep in mind these goals:

  • Avoid scrapers to get our correct email addresses
  • Allow visitors of the website to find and use our email addresses easily
  • Keep things easy in our code/markdown so maintenance is straightforward

Some of the things you mention make a lot of sense, but seem to overcomplicate things too much, since it'd require writing code that does the encoding or transformation of the email address in our web generator script. That's what I thought that just prepending some text to the addresses was a better idea. In any case, it's great if you can work on this, and I'm open to ideas, just keep in mind those goals. Thanks!

@Kabiirk
Copy link
Contributor

Kabiirk commented Feb 7, 2023

Thanks,
Will keep these goals in mind.

to assign this issue to me, to I need to do a TAKE command in this thread ?

@datapythonista
Copy link
Member Author

I assigned it to you. For next time, yes, you need to write just take (lowercase) in a comment (that's a hack we implemented since GitHub won't allow you to assign directly).

@Kabiirk
Copy link
Contributor

Kabiirk commented Feb 7, 2023

Thanks,
Understood

@Kabiirk
Copy link
Contributor

Kabiirk commented Feb 7, 2023

Hi,

Facing some challenges

Challenge 1

While building the website from source (pandas/web) with the the below command :

C:/path_to_local_pandas_fork/pandas/web> python pandas_web.py pandas

The static site is being generated, but looks like this
image

while the Official site looks like this:
image

Potential Reason

image

Way Forward

I think This is a rate-limiting & a CSS thing, Since my main work is with Emails, I don't think this should be a problem. I'll carry on with my work. But since I am going to run this command frequently during testing, I hope there would be no issues if I do that ?


Challenge 2

Also, while initially building the static site, I got the following error at 4 instances:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

the LOC instances which caused these errors were :

pandas/web/pandas_web.py

Lines 113 to 114 in 11d856f

with open(os.path.join(posts_path, fname)) as f:
html = md.convert(f.read())

pandas/web/pandas_web.py

Lines 343 to 344 in 11d856f

with open(config_fname) as f:
context = yaml.safe_load(f)

pandas/web/pandas_web.py

Lines 415 to 416 in 11d856f

with open(os.path.join(source_path, fname)) as f:
content = f.read()

pandas/web/pandas_web.py

Lines 425 to 426 in 11d856f

with open(os.path.join(target_path, fname), "w") as f:
f.write(content)

so I did some troubleshooting & found out that this error is caused because because we aren't telling the open() call what codec to use when reading the file. Because of this the file is opened with the system default codec, which is OS dependent.

Potential Reason

OS : Windows 10 Home Single Language
This maybe because my OS's default character encoding codec is not utf-8.

Possible Fix [This has only been implemented in my local Branch] :

In all 4 instances, I modified pandas_web.py by specifying the character encoding codec while opening the file solved this issue for me i.e. explicitly telling open that we are reading a utf-8 encoded file. For e.g. I did something like:

with open(filepath, encoding='utf8') as f:
                f.write(content)

Way Forward

After I am done with the current issue I am working on, Should I open a separate issue for this ?

@Kabiirk
Copy link
Contributor

Kabiirk commented Feb 8, 2023

Hi,

I have implemented a JavaScript based solution for protecting workgroup emails:
image

Which looks the same as current workgroup email on the website. mailto: is also functioning when I hover & click on the email:
image

To test this, I wrote a Web Scraper in BeautifulSoup which was not able to detect Workgroup email IDs (both after mailto: & between <a></a> tags) when webpage used my JavaScript implementation to write emails. The result is as follows :
image

Note : This will not stop all scraping bots, but should be able to stop a lot of them while at the same time be simple to maintain and easily accessible by the end-user.

Please let me know if I can go ahead and make the PR.

Regards.

@datapythonista
Copy link
Member Author

Thanks for the work on this @Kabiirk, sounds great. Sure, go ahead and open the PR (for next time, feel free to open a PR anytime, even if you're unsure of the approach...). You can tag me on it.

@Kabiirk
Copy link
Contributor

Kabiirk commented Feb 9, 2023

@datapythonista Thanks for the help 😄 ! I'll keep that in mind. Opening the PR in a while.

Please do let me know if I should open an Issue for the UnicodeDecodeError I was facing while building the website from pandas_web.py ? Or fix it in this PR itself ?

Kabiirk added a commit to Kabiirk/pandas that referenced this issue Feb 9, 2023
@pandas-dev pandas-dev deleted a comment from jayam30 Feb 16, 2023
mroeschke pushed a commit that referenced this issue Feb 24, 2023
* WEB: Obfuscating workgroup email addresses to fix Issue #51209

* WEB: Fixtrailling whitespace: Obfuscating workgroup email addresses to fix Issue

* WEB: Obfuscated & regex-proof workgroup email addresses

* WEB: refactor suggestions post code review

---------

Co-authored-by: Marc Garcia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants