sebee.website🌱
  • Articles
  • CV
  • About
  • Contact

Sebastian Pieczynski's website and blog

Published on: yyyy.mm.dd

Back to Articles
sebee.website
  • Created with ❤️ by Sebastian Pieczyński © 2023-2025.

Links

  • Terms of Use
  • Privacy Policy
  • Cookie Policy
  • ethernal
  • @spieczynski

Web scraper with prisma and typescript (part 1)

Published on: 11/23/2023

pirates loading treasure chests from tropical island onto the ship, huge hole near the chests, some chests are still buried in the sand, some open treasure chests filled with gold are visible on the ship, view from the beach to the ship, day, warm, cumulus clouds, watercolor

Contents

  1. Welcome to the treasure island
  2. Acknowledgements
  3. What we're making
  4. Building a shovel
  5. Saving to local database and extracting data
    1. Extending the scraper
    2. Downloading images and instantiating prisma
  6. Finished scraper

Welcome to the treasure island

How arrrr... ya Engineers? 🦜

Seems to me we have found ourselves a mighty trrr...easure.. after our last trip. We just need to dig it up...

Stop dilly dallying and let's get to work. Today we are digging data from the web pages and claiming it as our own.

This is a two part article. In the first we will build full scraper from start to finish and in the second we'll take care of the backend and frontend.

Acknowledgements

Note to people who do this daily - please send your suggestions how to improve the crawler. It is my first attempt at such a project and it is not as efficient as it could be (no concurrency) but was a fun to make and if possible I would also love to learn more from You.

For those that want to follow along here's the link to a stepped solution . The main branch has finished code and other branches are separated by steps as per this article.

The original repo at: GitHub requires some changes but present a working and finished project. It is not a curated repository so you will be able to see the thought process and what went right and what needed changing in the course of building it.. and what bugs 🐛 crept in.

REST API will be very limited as it's not a proper implementation of a backend server.

The initial scraper implementation was inspired by: https://www.zenrows.com/blog/javascript-web-crawler-nodejs#create-web-crawler-in-node

After longer than usual intro let's get started.

What we're making

We found a treasure trove 🏝️ of data but the problem is it's on the website without a way to export it. To dig that treasure we'll need some tools:

  1. Web scraper that will get the data from the external website,
  2. it will save it to a local sqlite database,
  3. and download all the images associated with the products (so we're not overloading the site too much),
  4. modify data on the fly to adjust it for the frontend.
  5. Then a "REST API" that will serve the data from the database.
  6. Finally a frontend that will fetch the data from our "REST" server and display it to the user,
  7. it will also allow filtering and sorting the products,
  8. and state of the frontend app will be controlled with search parameters.

In part 1 we'll go through points 1-4 and in part 2 we'll finish our application.

Building a shovel

We have to get data for all the products listed on a certain website. And for that we need to dig our way through the DOM to get the information about the product:

  1. name
  2. price
  3. currency (if it can be extracted)
  4. image
  5. link

For the source of data we'll be using the Pokemon Shop ScrapeMe shop site that has all the products listed on ~50 pages.

To get the data we'll use the Axios and to query the DOM of the fetched pages we'll use the CheerioJS libraries.

Since we want to use both Typescript and ESM modules we'll use tsx package to run our scraper server.

To start clone the init branch from the stepped repo .

This will clone the repository to web-scraper-stepped-solution folder. If you already have a folder created you can add a . (dot) at the end to clone into current directory.

After the repository is cloned install the dependencies:

Then in the root of the project (next to src directory) create a scraper.ts file in the root of the project and import both axios and cheerio and create a main function that will run the scraper:

Then in the package.json file inside "scripts" section add the following:

Running the script should return

If you have issues at this stage you can open an issue on GitHub or contact me directly.

The maxPages parameter will be important to control the number of pages to crawl.

For index of visited pages we'll use a Set. This structure is simillar to an array but it does not allow duplicates so it will be perfect to keep track of what page was visited. Even if we try to add the same page a second time Set will not add the same value again. Read more about Sets on MDN here .

Inside main function we'll first look for all the pages that we can crawl and add the links to the queue. Then we'll visit each page (download with Axios) and check for the existence of data we need in the HTML content. Let's dive into the code and I'll guide you with the comments alongside it.

Well first check navigation for all pages and add them to the crawl list:

Navigation URLs accessible with '.page-numbers a' selector

The image shows that it is possible to access the link for a page with .page-numbers a selector.

Now for the actual product data we can use the li.product a.woocommerce-LoopProduct-link selector:

Product data in HTML accessible with 'li.product a.woocommerce-LoopProduct-link' selector

The keen eyed will also note that there is a lot more data there that we are interested in. We'll get it all later for now let's get minimal version up and running.

With that in mind let's modify the main function inside the scraper.ts:

Running this code should return an array of product URLs, but since it's just logged to the console we cannot do much with them.

The repository has a branch with code at this stage here .

The "magic" happens between the lines where we first get the content of the webpage and then send it to cheerio for parsing.

After that we can query it with jQuery like syntax:

For now our scraper doesn't do much but there is everything in place for us to get all that shiny data if we so desire.

For reference here's the full source of the initial scraper:

Saving to local database and extracting data

If you have any issues along the way here is the finished code for this section: prisma and seeding script .

Now we will add database connectivity to our scraper, this will allow us to process the data later without accessing the site again.

First we need to add prisma to our project as a dependency:

and then initialize the database connection:

This will create a new prisma folder with schema.prisma file inside and .env file in the root of the directory.

We'll modify these files in a moment.

Before continuing make sure that in .gitignore any .env files are set to ignored in the commits, BUT not .env.dist.

You can exclude a file pattern in .gitignore by adding an ! before the pattern or file name.

Add the following at the end of the .gitignore file:

Create additional file: .env.dist. We will use .env file as our source of truth in production and .env.dist as a sample or reference file.

When you clone the full repository .env file will not be present so .env.dist will tell you what environmental variables you need to set to successfully run the project.

In the .env set DATABASE_URL to the following:

You can set .env.dist file to have the same contents as .env but remember to keep all secrets out of .env.dist file.

With this configuration present Prisma (when run) will create a /data/dev.db file and subfolder inside /prisma folder.

Now we need to design our database schema. We'll keep it simple but fairly robust for the use case. We'll assume the most important aspect for our users is the price of the product but we'll keep all additional information in the database as JSON.

Prisma does not support JSON field type for sqlite so we'll use a string and parse it when necessary.

We'll also have the User model to test our database connection.

Modify the prisma/prisma.schema file:

Points to note about the schema:

  1. id field is auto-generated and unique and uses CUID as value of an id field.
  2. createdAt and updatedAt fields are auto-generated.
  3. We make sure that data is collected only once for a given url so it will be a unique field.
  4. data will hold any data we extract and do not need to directly operate on (ex. filter).
  5. dataType is there to distinguish between different types of data we could obtain and will not be used in our examples, it's there to give you an idea how to expand the scraper capabilities ex. templates.

Let's get ready to test our database connection.

And this is a real gem in this article:

It is possible to SEED prisma database with seed script written in typescript and using ESM imports!

Try it in a project without tsx installed. I have, it's not fun..

Add a prisma seed script to the package.json file. It's a "standalone" configuration that does not belong in "scripts" section.:

Before writing the seed script generate the prisma client:

This will create all the methods that we will need inside seed and scraper to work with prisma.

After that we need to synchronize model with the database. For now we can push the model as database does not exist yet and we do not care about migrations.

This will create all folders required and a database file.

We are finally ready to create the seed.ts file inside prisma folder:

Run the seeding script with:

npx will run the prisma script from node_modules and execute the seed.ts with tsx. It should return in the console:

Our database connection works and the seed script can be run multiple times thanks to using upsert, you can try again.

Update the .gitignore file to keep our database from being sent to the repository:

A side note to Next.js developers reading this: this method also works with Next.js! Tested and confirmed. You're welcome!

Extending the scraper

If you have any issues along the way here is the finished code for this section: saving product data to the database .

Now that we have our database connected and confirmed it works let's add the same capability to the scraper. We'll also extract more information about the product and save it to the database.

As you can see not much was really needed in the scraper code itself to make it work.

We have added a function that saves the product data to the database and extracted information about the product from the website. The function for prisma works the same way as the seeding function, but if you do have questions please contact me.

If you run the scraper now you should see the products in the database. You can use HeidiSQL to view the data in the database.

Added data should also be displayed in the console:

Now there are last twp things we need to take care of. If you check inside the database you'll see that the images are linked from the original site.

image showing that product images are linked from the original site

This is not good on many levels: we might be overusing the bandwidth or external site and these images may be removed at some point. Let's try limiting our impact on the site and download the images to our folder and serve them on our own.

The second issue is that at some point you can see an error that more than 10 PrismaClients have been created.

Downloading images and instantiating prisma

If you have any issues along the way here is the finished code for this section: saving product data to the database .

To remedy that we'll create a single instance of the PrismaClient. Create a src/lib/prisma.ts file.

From now on we can reference the prisma variable from the lib/prisma.ts instead of creating new the client every time. This will be important in part 2.

To download data we will use node and axios. First let's create a function that will receive a list of files to download, note that we are using prisma from our lib folder here:

We have imported the fs and path to make working with local files possible and imported prisma from the lib folder.:

Then we declare the function that will accept an Array of strings that will represent links to files we need to download:

As we loop through all the files with for loop we'll extract information from the link to get the name of the file. With it we can can set a custom save location for it.

We also save the full url to the original file as a local variable.

Before downloading files we need to make sure that folder is created:

Then we loop over all the files in the list and wait for the stream to be saved to disk with createFile function we declared earlier.

createFile function wraps a stream in a Promise so that it can resolve after file is written to disk and closed or rejects if something goes wrong. Without it we would not have saved all the files. File is created inside the ./public/images folder.

We keep a count of how many files have been downloaded and present that information to the user. That way we know if all the files were downloaded.

Finally in the main function we're adding a Set where we'll keep links to images we want to download called imageSrc:

If you have NOT copied the code above please remember to remove the creation of the new PrismaClient:

After having files available locally let's modify the data of the product before saving it in the database as so that image url will point to a local image not to one from the original website:

And finally just before the end of the main function download the images:

You can also add images folder to .gitignore file to keep source code clean:

If you inspect the data in HeidiSQL now you will see that we have relative images set for the products.

image showing that product images are linked from the original site

With this our scraper (shovel to unearth the treasures) is completed.

In this part of the series we have:

  • Created a web scraper that visits all related pages automatically.
  • Extracts the information about the product from the web page.
  • Modifies that information before it's ready to be saved.
  • Saves the product information in the database.
  • Downloads images.

In the next part we'll create a frontend and a backend to serve the data and visually present it to the users.

See you soon!

Below is the full source of the finished scraper.ts.

Finished scraper

It wasn't as hard as it seemed right?

Let me know what you think or send a PR with improvements!

See you next time!

Back to Articles
1DATABASE_URL="file:./dev.db"
1# local env files
2.env*.local
3.env
4.env.*
5!.env.dist
1DATABASE_URL="file:./data/dev.db"
1// This is your Prisma schema file,
2// learn more about it in the docs: https://pris.ly/d/prisma-schema
3
4generator client {
5 provider = "prisma-client-js"
6}
7
8datasource db {
9 provider = "sqlite"
10 url = env("DATABASE_URL")
11}
12
13model User {
14 id String @id @default(cuid())
15 email String @unique
16 firstName String
17 lastName String
18}
19
20model ScrappedData {
21 id String @id @default(cuid())
22 url String @unique
23 price Float
24 data String // serialize and deserialize from string until support is added in prisma
25 dataType String @default("product")
26 createdAt DateTime @default(now())
27 updatedAt DateTime @updatedAt
28}
1# prisma database location
2prisma/data
1# images folder
2public/images
1git clone --branch init https://github.com/ethernal/web-scraper-stepped-solution
1git clone --branch init https://github.com/ethernal/web-scraper-stepped-solution .
1npm install
scraper.ts
1import axios from 'axios';
2import cheerio from 'cheerio';
3
4async function main(maxPages = 5) {
5 console.log(`Hello World. Scraping ${maxPages} pages.`);
6}
7
8main()
9 .then(() => {
10 process.exit(0);
11 })
12 .catch((e) => {
13 // logging the error message
14 console.error(e);
15
16 process.exit(1);
17 });
1 Hello World. Scraping 5 pages.
scraper.ts
1async function main(maxPages = 5) {
2
3 // start with the initial webpage to visit
4 const paginationURLsToVisit = ["https://scrapeme.live/shop"];
5
6 // list of URLs the crawler has visited
7 const visitedURLs:Set<string> = new Set();
8
9 const productURLs = new Set();
10
11 // iterating until the queue is empty
12 // or the iteration limit is hit
13 while (
14 paginationURLsToVisit.length !== 0 &&
15 visitedURLs.size <= maxPages
16 ) {
17 // get the current url to crawl
18 const paginationURL = paginationURLsToVisit.pop();
19
20 // if the queue is empty, break the loop
21 if (paginationURL === undefined) {
22 break;
23 }
24
25 // retrieving the HTML content of the page from paginationURL
26 const pageHTML = await axios.get(paginationURL);
27
28 // adding the current webpage to the set of
29 // web pages already crawled
30 visitedURLs.add(paginationURL);
31
32 // initializing cheerio on the current webpage
33 const $ = cheerio.load(pageHTML.data);
34
35 // get all pagination URLs and for each page...
36 // see image above
37 $(".page-numbers a").each((index, element) => {
38 const paginationURL = $(element).attr("href");
39
40 // if the queue is empty, break the loop
41 if (paginationURL === undefined) {
42 return;
43 }
44 // adding the pagination URL to the queue
45 // of web pages to crawl, if it wasn't yet crawled
46 if (
47 !visitedURLs.has(paginationURL) &&
48 !paginationURLsToVisit.includes(paginationURL)
49 ) {
50 paginationURLsToVisit.push(paginationURL);
51 }
52 });
53
54 // retrieve the product URLs
55 $("li.product a.woocommerce-LoopProduct-link").each((index, element) => {
56 const productURL = $(element).attr("href");
57 productURLs.add(productURL);
58 });
59 }
60 //...
61 console.log([...productURLs]);
62}
1[
2 'https://scrapeme.live/shop/Bulbasaur/',
3 'https://scrapeme.live/shop/Ivysaur/',
4 'https://scrapeme.live/shop/Venusaur/',
5 'https://scrapeme.live/shop/Charmander/',
6 'https://scrapeme.live/shop/Charmeleon/',
7 'https://scrapeme.live/shop/Charizard/',
8 'https://scrapeme.live/shop/Squirtle/',
9 'https://scrapeme.live/shop/Wartortle/',
10 'https://scrapeme.live/shop/Blastoise/',
11 //...
12]
scraper.ts
1// retrieving the HTML content of the page from paginationURL
2 const pageHTML = await axios.get(paginationURL);
3
4 // adding the current webpage to the set of
5 // web pages already crawled
6 visitedURLs.add(paginationURL);
7
8 // initializing cheerio on the current webpage
9 const $ = cheerio.load(pageHTML.data);
scraper.ts
1$(".page-numbers a").each((index, element) => {
2//...
3});
scraper.ts
1import axios from 'axios';
2import cheerio from 'cheerio';
3
4async function main(maxPages = 5) {
5 // start with the initial webpage to visit
6 const paginationURLsToVisit = ["https://scrapeme.live/shop"];
7
8 // list of URLs the crawler has visited
9 const visitedURLs:Set<string> = new Set();
10 const productURLs = new Set();
11
12 // iterating until the queue is empty
13 // or the iteration limit is hit
14 while (
15 paginationURLsToVisit.length !== 0 &&
16 visitedURLs.size <= maxPages
17 ) {
18 // get the current url to crawl
19 const paginationURL = paginationURLsToVisit.pop();
20
21 // if the queue is empty, break the loop
22 if (paginationURL === undefined) {
23 break;
24 }
25
26 // retrieving the HTML content of the page from paginationURL
27 const pageHTML = await axios.get(paginationURL);
28
29 // adding the current webpage to the set of
30 // web pages already crawled
31 visitedURLs.add(paginationURL);
32
33 // initializing cheerio on the current webpage
34 const $ = cheerio.load(pageHTML.data);
35
36 // get all pagination URLs and for each page...
37 // see image above
38 $(".page-numbers a").each((index, element) => {
39 const paginationURL = $(element).attr("href");
40 // if the queue is empty, break the loop
41 if (paginationURL === undefined) {
42 return;
43 }
44
45 // adding the pagination URL to the queue
46 // of web pages to crawl, if it wasn't yet crawled
47 if (
48 !visitedURLs.has(paginationURL) &&
49 !paginationURLsToVisit.includes(paginationURL)
50 ) {
51 paginationURLsToVisit.push(paginationURL);
52 }
53 });
54
55 // retrieve the product URLs
56 $("li.product a.woocommerce-LoopProduct-link").each((index, element) => {
57 const productURL = $(element).attr("href");
58 productURLs.add(productURL);
59 });
60 }
61 console.log([...productURLs]);
62}
63
64main()
65 .then(() => {
66 process.exit(0);
67 })
68 .catch((e) => {
69 // logging the error message
70 console.error(e);
71
72 process.exit(1);
73 });
1npm install prisma --save-dev
1npx prisma init --datasource-provider sqlite
1npx prisma generate
1Prisma schema loaded from prisma\schema.prisma
2
3added 2 packages, and audited 192 packages in 4s
4
5✔ Installed the @prisma/client and prisma packages in your project
6✔ Generated Prisma Client (v5.6.0) to .\node_modules\@prisma\client in 80ms
1npx prisma db push
1Environment variables loaded from .env
2Prisma schema loaded from prisma\schema.prisma
3Datasource "db": SQLite database "dev.db" at "file:./data/dev.db"
4
5SQLite database dev.db created at file:./data/dev.db
6
7Your database is now in sync with your Prisma schema. Done in 37ms
seed.ts
1// prisma/seed.ts
2import { PrismaClient } from '@prisma/client';
3
4const prisma = new PrismaClient();
5
6async function main() {
7 // we're using upsert to make sure we're not trying to insert same data twice (it will error out)
8 await prisma.user.upsert({
9 where: { email: `admin@example.com` },
10 create: {
11 email: `admin@example.com`,
12 firstName: 'Admin',
13 lastName: 'Super',
14 },
15 update: {
16 firstName: 'Admin',
17 lastName: 'Super',
18 },
19 });
20}
21
22console.log('Seeding the database');
23main()
24 .catch((e) => {
25 console.error(e);
26 process.exit(1);
27 })
28 .finally(async () => {
29 await prisma.$disconnect();
30 console.log('Completed seeding the database');
31 });
1npx prisma db seed
1Seeding the database
2Completed seeding the database
scraper.ts
1import axios from 'axios';
2import cheerio from 'cheerio';
3
4import { PrismaClient } from '@prisma/client';
5
6async function main(maxPages = 5) {
7 const prisma = new PrismaClient();
8
9 // initialized with the webpage to visit
10 const paginationURLsToVisit = ["https://scrapeme.live/shop"];
11 const visitedURLs: Set<string> = new Set();
12 const products = new Set();
13
14 // iterating until the queue is empty
15 // or the iteration limit is hit
16 while (paginationURLsToVisit.length !== 0 && visitedURLs.size <= maxPages) {
17 // the current url to crawl
18 const paginationURL = paginationURLsToVisit.pop();
19
20 // if the queue is empty, skip the current iteration and continue the loop
21 if (paginationURL === undefined) {
22 continue;
23 }
24
25 // retrieving the HTML content from paginationURL
26 const pageHTML = await axios.get(paginationURL);
27
28 // adding the current webpage to the
29 // web pages already crawled
30 visitedURLs.add(paginationURL);
31
32 // initializing cheerio on the current webpage
33 const $ = cheerio.load(pageHTML.data);
34
35 // retrieving the pagination URLs
36 $(".page-numbers a").each((index, element) => {
37 const paginationURL = $(element).attr("href");
38
39 // if the queue is empty, skip to the next element in the loop
40 if (paginationURL === undefined) {
41 return;
42 }
43
44 // adding the pagination URL to the queue
45 // of web pages to crawl, if it wasn't yet crawled
46 if (
47 !visitedURLs.has(paginationURL) &&
48 !paginationURLsToVisit.includes(paginationURL)
49 ) {
50 paginationURLsToVisit.push(paginationURL);
51 }
52 });
53 console.log("Adding products...");
54
55 // retrieving the product URLs
56 $("li.product a.woocommerce-LoopProduct-link").each((index, element) => {
57 // extract all information about the product
58
59 const productURL = $(element).attr("href");
60 const productImg = $(element).find("img").attr("src");
61 const productName = $(element).find("h2").text();
62 const productPrice = $(element).find(".woocommerce-Price-amount").text();
63 const productPriceCurrency = $(element)
64 .find(".woocommerce-Price-currencySymbol")
65 .text();
66
67 const product = {
68 name: productName,
69 price: productPrice.replaceAll(productPriceCurrency, ""), // remove currency symbol from the price
70 currency: productPriceCurrency,
71 image: productImg,
72 url: productURL,
73 };
74
75 products.add(product);
76
77 if (productURL === undefined) {
78 return;
79 }
80
81 // create a function that will save the information about the product to the database
82 const addData = async (data: typeof product) => {
83 // use upsert to create row if it does not exist or update data if it has changed since last run
84 // sqlite and prisma don't support createMany so we need to use per element inserts
85 await prisma.scrappedData.upsert({
86 where: {
87 url: data.url,
88 },
89 create: {
90 url: data.url,
91 price: parseFloat(data.price),
92 data: JSON.stringify(data),
93 },
94 update: {
95 price: parseFloat(data.price),
96 data: JSON.stringify(data),
97 },
98 });
99 };
100
101 // Here we're saving scrapped data to the database
102 addData(product);
103
104 console.log(`Added: ${JSON.stringify(product, undefined, 2)}`);
105 });
106}
107
108 console.log("Products added.");
109}
110
111main()
112 .then(() => {
113 process.exit(0);
114 })
115
116 .catch((e) => {
117 // logging the error message
118
119 console.error(e);
120
121 process.exit(1);
122 });
scraper.ts
1const productURL = $(element).attr("href");
2const productImg = $(element).find("img").attr("src");
3const productName = $(element).find("h2").text();
4const productPrice = $(element).find(".woocommerce-Price-amount").text();
5const productPriceCurrency = $(element).find(".woocommerce-Price-currencySymbol").text();
1Adding products...
2Added: {
3 "name": "Bulbasaur",
4 "price": "63.00",
5 "currency": "£",
6 "image": "https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png",
7 "url": "https://scrapeme.live/shop/Bulbasaur/"
8}
9Added: {
10 "name": "Ivysaur",
11 "price": "87.00",
12 "currency": "£",
13 "image": "https://scrapeme.live/wp-content/uploads/2018/08/002-350x350.png",
14 "url": "https://scrapeme.live/shop/Ivysaur/"
15}
src/lib/prisma.ts
1// see: https://www.prisma.io/docs/guides/other/troubleshooting-orm/help-articles/nextjs-prisma-client-dev-practices
2import { PrismaClient } from '@prisma/client';
3
4const globalForPrisma = global as unknown as { prisma: PrismaClient }
5
6export const prisma =
7 globalForPrisma.prisma ||
8 new PrismaClient({
9 log: ['info','warn', 'error'],
10 })
11
12if (process.env.NODE_ENV !== 'production') globalForPrisma.prisma = prisma
scraper.tsx
1import axios from 'axios';
2import cheerio from 'cheerio';
3import fs from 'fs';
4import path from 'path';
5
6import { prisma } from './src/lib/prisma';
7
8async function createFile(name:string, response:AxiosResponse<any, any>) {
9 const file = fs.createWriteStream(`./public/images/${name}`);
10
11 return new Promise<boolean>((resolve, reject) => {
12 response.data.pipe(file);
13 file.on('finish',() => {
14 file.close();
15 resolve(true);
16
17 });
18 file.on('error', () => {file.close();
19 reject(false);
20 }
21 );
22 })
23}
24
25async function downloadFiles(downloadFiles: Array<string>) {
26 console.log(`Starting download of ${downloadFiles.length} files. This will take few minutes. Please be patient...`);
27
28 let fileDownloadCount = 0;
29
30 const dir='./public/images/'
31
32 if (!fs.existsSync(dir)){
33 fs.mkdirSync(dir, { recursive: true });
34 }
35
36 for (let i=0;i< downloadFiles.length;i++) {
37 const link = downloadFiles[i];
38 const name = path.basename(link);
39 const url = link
40 console.log('Downloading file: ' + name);
41
42 const response = await axios({
43 url,
44 method: 'GET',
45 responseType: 'stream'
46 });
47
48 const createdFile = await createFile(name, response);
49
50 if (createdFile) fileDownloadCount++;
51 }
52
53 console.log(`Finished downloading ${fileDownloadCount} of ${downloadFiles.length} files.`);
54}
55
56async function main(maxPages = 5) {
57
58 // initialized with the webpage to visit
59 const paginationURLsToVisit = ["https://scrapeme.live/shop"];
60 const visitedURLs: Set<string> = new Set();
61 const products = new Set();
62 const imageSrc = new Set<string>();
63 //...
64}
scraper.ts
1import fs from 'fs';
2import path from 'path';
3
4import { prisma } from './src/lib/prisma';
scraper.ts
1async function downloadFiles(downloadFiles: Array<string>) {
2//...
3}
scrapper.ts
1async function downloadFiles(downloadFiles: Array<string>) {
2 console.log(`Starting download of ${downloadFiles.length} files. This will take few minutes. Please be patient...`);
3
4 let fileDownloadCount = 0;
5
6 const dir='./public/images/'
7
8 if (!fs.existsSync(dir)){
9 fs.mkdirSync(dir, { recursive: true });
10 }
11 //...
12}
scraper.ts
1for (let i=0;i < downloadFiles.length;i++) {
2 const link = downloadFiles[i];
3 const name = path.basename(link);
4 const url = link
5 console.log('Downloading file: ' + name);
6
7 const response = await axios({
8 url,
9 method: 'GET',
10 responseType: 'stream'
11 });
12
13 const createdFile = await createFile(name, response);
14
15 if (createdFile) fileDownloadCount++;
16 }
scraper.ts
1async function createFile(name:string, response:AxiosResponse<any, any>) {
2 const file = fs.createWriteStream(`./public/images/${name}`);
3
4 return new Promise<boolean>((resolve, reject) => {
5 response.data.pipe(file);
6 file.on('finish',() => {
7 file.close();
8 resolve(true);
9
10 });
11 file.on('error', () => {file.close();
12 reject(false);
13 }
14 );
15 })
16}
1async function main(maxPages = 5) {
2
3 // initialized with the webpage to visit
4 const paginationURLsToVisit = ["https://scrapeme.live/shop"];
5 const visitedURLs: Set<string> = new Set();
6 const products = new Set();
7 const imageSrc = new Set<string>();
8 //...
9}
1import { PrismaClient } from '@prisma/client'; // <- remove this
2
3async function main(maxPages = 5) {
4 const prisma = new PrismaClient(); //<- remove this
5}
scraper.ts
1const productURL = $(element).attr("href");
2const productImg = $(element).find("img").attr("src");
3const productName = $(element).find("h2").text();
4const productPrice = $(element).find(".woocommerce-Price-amount").text();
5const productPriceCurrency = $(element).find(".woocommerce-Price-currencySymbol").text();
6
7// save original image url in the imageSrc to be downloaded later
8if (productImg !== undefined) {
9 imageSrc.add(productImg);
10}
11
12// but change the data that is saved to the DB to point to the local image
13// when using vite relative path files are served from the public folder by default so there is no need to add the folder to the path - it will produce a warning in the console
14const localProductImg = (productImg !== undefined) ? `./images/${path.basename(productImg)}` : productImg;
15
16const product = {
17 name: productName,
18 price: productPrice.replaceAll(productPriceCurrency, ""),
19 currency: productPriceCurrency,
20 image: localProductImg,
21 url: productURL
22}
scraper.ts
1function main(maxPages = 5) {
2//...
3 await addData(product);
4 });
5 }
6
7// logging the crawling results
8 console.log('Products added.');
9 console.log('Downloading images...');
10 await downloadFiles(Array.from(imageSrc));
11 console.log('Done!');
12}
scraper.ts
1import axios, { AxiosResponse } from 'axios';
2import cheerio from 'cheerio';
3import fs from 'fs';
4import path from 'path';
5
6import { prisma } from './src/lib/prisma';
7
8async function createFile(name:string, response:AxiosResponse<any, any>) {
9 const file = fs.createWriteStream(`./public/images/${name}`);
10
11 return new Promise<boolean>((resolve, reject) => {
12 response.data.pipe(file);
13 file.on('finish',() => {
14 file.close();
15 resolve(true);
16
17 });
18 file.on('error', () => {file.close();
19 reject(false);
20 }
21 );
22 })
23}
24
25async function downloadFiles(downloadFiles: Array<string>) {
26 console.log(`Starting download of ${downloadFiles.length} files. This will take few minutes. Please be patient...`);
27
28 let fileDownloadCount = 0;
29
30 const dir='./public/images/'
31
32 if (!fs.existsSync(dir)){
33 fs.mkdirSync(dir, { recursive: true });
34 }
35
36 for (let i=0;i< downloadFiles.length;i++) {
37 const link = downloadFiles[i];
38 const name = path.basename(link);
39 const url = link
40 console.log('Downloading file: ' + name);
41
42 const response = await axios({
43 url,
44 method: 'GET',
45 responseType: 'stream'
46 });
47
48 const createdFile = await createFile(name, response);
49
50 if (createdFile) fileDownloadCount++;
51 }
52
53 console.log(`Finished downloading ${fileDownloadCount} of ${downloadFiles.length} files.`);
54}
55
56async function main(maxPages = 1) {
57
58 // initialized with the webpage to visit
59 const paginationURLsToVisit = ["https://scrapeme.live/shop"];
60 const visitedURLs: Set<string> = new Set();
61 const products = new Set();
62 const imageSrc = new Set<string>();
63
64 // iterating until the queue is empty
65 // or the iteration limit is hit
66 while (paginationURLsToVisit.length !== 0 && visitedURLs.size <= maxPages) {
67 // the current url to crawl
68 const paginationURL = paginationURLsToVisit.pop();
69
70 // if the queue is empty, skip the current iteration and continue the loop
71 if (paginationURL === undefined) {
72 continue;
73 }
74
75 // retrieving the HTML content from paginationURL
76 const pageHTML = await axios.get(paginationURL);
77
78 // adding the current webpage to the
79 // web pages already crawled
80 visitedURLs.add(paginationURL);
81
82 // initializing cheerio on the current webpage
83 const $ = cheerio.load(pageHTML.data);
84
85 // retrieving the pagination URLs
86 $(".page-numbers a").each((index, element) => {
87 const paginationURL = $(element).attr("href");
88 // if the queue is empty, skip to the next element in the loop
89 if (paginationURL === undefined) {
90 return;
91 }
92
93 // adding the pagination URL to the queue
94 // of web pages to crawl, if it wasn't yet crawled
95 if (
96 !visitedURLs.has(paginationURL) &&
97 !paginationURLsToVisit.includes(paginationURL)
98 ) {
99 paginationURLsToVisit.push(paginationURL);
100 }
101 });
102 console.log("Adding products...");
103
104 // retrieving the product URLs
105 $("li.product a.woocommerce-LoopProduct-link").each((index, element) => {
106 // extract all information about the product
107 const productURL = $(element).attr("href");
108 const productImg = $(element).find("img").attr("src");
109 const productName = $(element).find("h2").text();
110 const productPrice = $(element).find(".woocommerce-Price-amount").text();
111 const productPriceCurrency = $(element).find(".woocommerce-Price-currencySymbol").text();
112
113
114 if (productImg !== undefined) {
115 imageSrc.add(productImg);
116 }
117
118 //when using vite relative path files are served from the public folder by default so there is no need to add the folder to the path - it will produce a warning in the console
119 const localProductImg = (productImg !== undefined) ? `./images/${path.basename(productImg)}` : productImg;
120
121 const product = {
122 name: productName,
123 price: productPrice.replaceAll(productPriceCurrency, ""),
124 currency: productPriceCurrency,
125 image: localProductImg,
126 url: productURL
127 }
128
129 products.add(product);
130
131 if (productURL === undefined) {
132 return;
133 }
134
135 // create a function that will save the information about the product to the database
136 const addData = async (data: typeof product) => {
137 // use upsert to create row if it does not exist or update data if it has changed since last run
138 // sqlite and prisma don't support createMany so we need to use per element inserts
139 await prisma.scrappedData.upsert({
140 where: {
141 url: data.url,
142 },
143 create: {
144 url: data.url,
145 price: parseFloat(data.price),
146 data: JSON.stringify(data),
147 },
148 update: {
149 price: parseFloat(data.price),
150
151 data: JSON.stringify(data),
152 },
153 });
154 };
155
156 // Here we're saving scrapped data to the database
157 addData(product);
158 });
159 }
160
161 console.log("Products added.");
162 console.log('Downloading images...');
163 await downloadFiles(Array.from(imageSrc));
164 console.log('Done!');
165}
166
167main()
168 .then(() => {
169 process.exit(0);
170 })
171 .catch((e) => {
172 // logging the error message
173 console.error(e);
174 process.exit(1);
175 });
1"scripts": {
2 //...
3 "scrap": "tsx scraper.ts"
4 //...
5}
package.json
1{
2 //...
3 "type": "module",
4 "scripts": {
5 "dev": "vite",
6 "scrap": "tsx scraper.ts",
7 //...
8 },
9 "prisma": {
10 "seed": "tsx prisma/seed.ts"
11 },
12 "dependencies": {
13 //...
14 }
15 //...
16}
schema.prisma
1generator client {
2 provider = "prisma-client-js"
3}
4
5datasource db {
6 provider = "sqlite"
7 url = env("DATABASE_URL")
8}