Scraper and crawler built with Playwright and Cheerio
Go to file
2024-08-14 22:44:38 +03:00
bfs-scrape.js Added the BFS version 2024-08-14 20:49:07 +03:00
BFS.js Updated the BFS scraper to now properly work 2024-08-14 22:44:38 +03:00
LICENSE Initial commit 2024-08-14 20:47:49 +03:00
README.md Adding instructions 2024-08-14 21:23:25 +03:00
scrape-everything.js added the clusterfuck 2024-08-14 20:53:42 +03:00
scrape-within-domain-only.js added domain scope only scraper 2024-08-14 20:54:57 +03:00

Playwright_Scraper

Scraper and crawler built with Playwright and Cheerio

Versions and Differences

BFS version The BFS version uses the Breadth-First Search Approach To ensure the crawler explores all pages more thoroughly the crawler processes all immediate links (siblings) at the current depth level before moving on to deeper levels.

Scrape Everything This pretty much lets the crawler to go wild (can't recommend)

Scrape Domain Scope only Scrapes within the domain scope (worse BFS version as this goes in a straight line and doesn't scan everything)

Requirements

first install npm

Arch

sudo pacman -Sy nodejs

Debian/Ubuntu

curl -sL https://deb.nodesource.com/setup_18.x -o nodesource_setup.sh

sudo bash nodesource_setup.sh

sudo apt install nodejs

Then install Playwright and the other dependencies

npm init playwright@latest

npm install path

npm install url

npm install cheerio

npm install fs