Devlog by @Ansh904

@Ansh904 on Pokedex · about 2 months ago

10h 26m 13s logged

Heyy Everyone

Pokedex Devlog 2 !!!

The selenium scraper is finally complete and it is working perfectly, except it hits a captcha some times, though it is rare.

I started building this scraper without any knowledge of selenium. I started with a very basic bot that went to google and typed pokedex, I slowly added more features like using headless mode and WebDriverWait instead of hardcoded time delays. I once crashed my computer while printing the encrypted source code of the google result, XD. But I was also encountering captchas a lot so I switched to undetected_chromedriver and gave it a user agent string so that it won’t be flagged.

My first bot was able to scrape the headers of the google search result. I then moved to an image scraping bot, The first version was very simple, It simply went to google images and downloaded every img tag that it could find. However, regardless of the target it scrapped only 21 images, the first being the google logo. This was because of lazy loading.

To fix this I implemented an infinite scrolling loop that scrolled until it reached the end of the page or hit the target. But it still wasn’t near the target, after running the script without headless mode, I found out that the user agent string I gave it was of chrome 91, that was released in 2021, after switching to a newer version, chrome 135, it was able to complete it’s target but it was also downloading junk images like logos, icons or spacers that were present inside the search result.

I spent 2 days fixing this problem, trying to point towards the image tags of the thubmnail images we see in google images. I wanted to write a code that would work everytime, that is why i couldn’t just point towards the ever changing classname of the img tag. I tried targeting imgs inside specific divs like img inside the div with role of listitems or img tags whose href attribute started with “/imgres” but it didnt worked. After a lot of tries I was able to get it right by targetting the img inside the div with role of main and whose dimensions are more that 100*100px. It worked like a charm. I then quickly made it in the form of a function and gave it 10 pokemons with target of 100 imgs each and boom I got 1000 images in total.

I have also decided that labelling 15,100 images would take ridiculously long and the accuracy of the model would also be not accurate. So instead I am building a classification model and labelling it would be easy I am downloading the pokemon images inside the folder with their name. So now that I don’t have to worry about spending hours downloading or labelling them, I would use 500 images per pokemon and create a much accurate model. Currently I am running the script over all pokemons in gen 1, 151 in total, It has reached till Rattata, pokedex number - 19, that means images of 18 out of 151 pokemon are downloaded. I will keep the script running until all are complete and start training the classification model tomorrow.