ME Vs GOOGLE

And I WON!!!

Pokedex Devlog #3

Hey everyone

I have been working on making a real life pokedex, one in which you just take the photo and get the stats of that pokemon on your screen. To achieve this I decided to train a yolov8n-cls model on images of pokemon. But getting those images were a nightmare due to GOOGLE!!!

Fighting Google’s anti bot systems

I built a custom scraper to scrape pokemon images from google images. But since the very start of the project, Google has been blocking me, it took a lot time for me, who had never touched selenium at that time, to bypass google’s anti bot systems. Since my laptop is very low end, I was using selenium in headless mode which triggered even more red flags for my bot. I switched from regular selenium to undetected_chromedriver, but still google’s system was flagging me with a big captcha that my little bot couldn’t solve. The main issue was that in headless mode, selenium runs a very small window size and it leaves clear machine tracks, I fixed that by using a larger window size and adding a user agent string that I chosen from a list of 10 different ua strings.

Getting those Images

Not being flagged by google was easy in comparison to this. Targetting the Images inside the search results was an absolute nightmare because instead of fixed static classes or ids google uses dynamic strings that change frequently, I wanted to write code that anyone can use anywhere and that is why couldn’t target those ever changing class names. I tried a lot things from targeting images inside nested divs to images inside divs with specific roles (eg: listitems) or targeting images that have /imgres in their href attribute. But it took me an entire, with 6+ hours of coding and trying and a lot more researching and inspecting the elements of google images results but I found out that by targetting the images div with role of main with the images having a src or data-deffered attribute using the XPATH: //div[@role='main']//img[@data-deferred or @src] , I get the images which are inside the search results and I can then just use infinite scrolling to load them.

The Results

Knowing how to target those images, I absolutely destroyed google in the fight by making the scraper and using it to scrape a total of 74,718 images in one night without getting blocked a single time. My goal was was 500 images for each pokemon in generation 1, that is 75,500 total images, The scraper worked better than expected and I was just a 700 hundred images behind the goal. But even for those 700 images, the culprit was google not my little scraper, google didn’t have that many images for a few unpopular pokemon to fulfill my goals.

What’s next?

After winning the battle against google, I will use the images to train my model today, I already split those. Since my laptop is slow I will utilize google collab for training, using their own tool to train the model using the images they were trying so hard to hide XD….. and I really hope that scraping that many images is not illegal.