Playwright_Scraper/README.md

51 lines
998 B
Markdown
Raw Permalink Normal View History

2024-09-10 20:26:42 +00:00
# Playwright Scraper
2024-08-14 18:23:25 +00:00
Scraper and crawler built with Playwright and Cheerio
2024-08-14 17:51:08 +00:00
# Versions and Differences
2024-08-14 18:05:04 +00:00
**BFS version**
2024-08-14 17:51:08 +00:00
The BFS version uses the Breadth-First Search Approach
To ensure the crawler explores all pages more thoroughly the crawler processes all immediate links (siblings) at the current depth level before moving on to deeper levels.
2024-08-14 18:05:04 +00:00
**Scrape Everything**
This pretty much lets the crawler to go wild (can't recommend)
**Scrape Domain Scope only**
Scrapes within the domain scope (worse BFS version as this goes in a straight line and doesn't scan everything)
2024-08-14 18:23:25 +00:00
# Requirements
first install npm
**Arch**
```bash
sudo pacman -Sy nodejs
yay -S playwright
```
2024-08-14 18:23:25 +00:00
**Debian/Ubuntu**
```bash
curl -sL https://deb.nodesource.com/setup_18.x -o nodesource_setup.sh
sudo bash nodesource_setup.sh
sudo apt install nodejs
```
Then install Playwright and the other dependencies
```bash
npm init playwright@latest
npm install path
npm install url
npm install cheerio
npm install fs
```