You are browsing as a guest. Sign up (or log in) to start making projects!

PrivacyLens

  • 2 Devlogs
  • 6 Total hours

Reads privacy policies and gives you summaries so you know about where your data is going!

Open comments for this post

4h 31m 46s logged

Devlog 2!!

It turns out, there’s a reason that webscraping is such a pain for so many people, and I found that out the HARD way. Normally, programs trying to get info off of a website have to check the robots.txt website to see if the creators of the website:

  1. Are okay with you scraping
  2. Will make it easy for you to scrape
  3. Will sue you into oblivion (satire)

These things are great for regulating scrapers, but unfortunately, browser extensions are collateral damage.

The way a browser extension get info off a webpage is VERY similar to web scraping but not exactly the same thing, so websites hostile to webcrawlers are also inadvertently harming extensions that try to read site content.

What exact problems did I face to find all this out you ask?

Well, my project reads privacy policies off of websites, which happen to be copyrighted material, so web developers take care to prevent any automating stealing of the policies; however, my project needs to read those policies in order to send them to the backend and summarize the whole thing.

Most of my initial attempts resulted in the scraped information being cluttered with JS, HTML Tags, Formatting, and other junk stuff that wayyy increased the tokens I used inputting the policy into AI.

This is when I discovered Mozilla’s Readability Library. This was SUCH a lifesaver as it automatically took out any junk and left the REAL text. It way outperformed my previous methods and reduced the token usage up to three fold!

Currently, I just imported the library js file into my frontend, but I know that’s not prod-ready, so it’s likely that I have to import it another way and add a build step 😑

(Image is of the Readability Library)

Devlog 2!!

It turns out, there’s a reason that webscraping is such a pain for so many people, and I found that out the HARD way. Normally, programs trying to get info off of a website have to check the robots.txt website to see if the creators of the website:

  1. Are okay with you scraping
  2. Will make it easy for you to scrape
  3. Will sue you into oblivion (satire)

These things are great for regulating scrapers, but unfortunately, browser extensions are collateral damage.

The way a browser extension get info off a webpage is VERY similar to web scraping but not exactly the same thing, so websites hostile to webcrawlers are also inadvertently harming extensions that try to read site content.

What exact problems did I face to find all this out you ask?

Well, my project reads privacy policies off of websites, which happen to be copyrighted material, so web developers take care to prevent any automating stealing of the policies; however, my project needs to read those policies in order to send them to the backend and summarize the whole thing.

Most of my initial attempts resulted in the scraped information being cluttered with JS, HTML Tags, Formatting, and other junk stuff that wayyy increased the tokens I used inputting the policy into AI.

This is when I discovered Mozilla’s Readability Library. This was SUCH a lifesaver as it automatically took out any junk and left the REAL text. It way outperformed my previous methods and reduced the token usage up to three fold!

Currently, I just imported the library js file into my frontend, but I know that’s not prod-ready, so it’s likely that I have to import it another way and add a build step 😑

(Image is of the Readability Library)

Replying to @saigiridhar_chitturi

0
89
Open comments for this post

1h 2m 23s logged

Are you tired of getting your data stolen by companies without even knowing why?

I’m fixing that with PrivacyLens, an AI tool that summarizes and gives you the information you need to know about when you visit a site. PrivacyLens gives you the major Red flags and some green flags about the website’s privacy policy. Right now, im working on making the UI better and looking for some clean, and thematic layouts that match the investigator type look i’m going for.

Are you tired of getting your data stolen by companies without even knowing why?

I’m fixing that with PrivacyLens, an AI tool that summarizes and gives you the information you need to know about when you visit a site. PrivacyLens gives you the major Red flags and some green flags about the website’s privacy policy. Right now, im working on making the UI better and looking for some clean, and thematic layouts that match the investigator type look i’m going for.

Replying to @saigiridhar_chitturi

0
3

Followers

Loading…