最新消息: 电脑我帮您提供丰富的电脑知识,编程学习,软件下载,win7系统下载。

Node js Scraper

IT培训 admin 4浏览 0评论

Node js Scraper

我在typescript中写了一个scrapper,在节点上运行:10.12.0,

问题:代码在几个小时后随机进入睡眠状态。我不得不重新启动它。我最好的猜测是它坚持网址请求

工具/包使用:

  • 木偶
  • Cheerio
  • 打字稿

码:

import * as cheerio from "cheerio";
import * as request from "request";
import * as fs from "fs";
import * as shell from "shelljs";
import pup = require("puppeteer");
class App {
    // @ts-ignore
    public browser: pup.Browser;
    public appendToFile(file: string, content: string): Promise < string > {
        return new Promise < string > ((resolve, reject) => {
            try {
                fs.appendFileSync(file, content);
                resolve("DONE");
            } catch (e) {
                reject(e);
            }
        });
    }
    public loadPage(url: string): Promise < any > {
        return new Promise < any > ((resolve, reject) => {
            request.get(url, async (err, res, html) => {
                if (!err && res.statusCode === 200) {
                    resolve(html);
                } else {
                    if (err) {
                        reject(err);
                    } else {
                        reject(res);
                    }
                }
            });
        });
    }
    public step1(url: string): Promise < string > {
        return new Promise < string > (async (resolve, reject) => {
            let page: pup.Page | undefined;
            try {
                let next = false;
                let urlLink = url;
                let first = true;
                let header = "unknown";
                let f = url.split("/");
                let folder = f[f.length - 3];
                folder = folder || header;
                let path = "data/" + folder;
                shell.mkdir("-p", path);
                page = await this.browser.newPage();

                await page.goto(url, {
                    timeout: 0
                });
                let count = 1;
                do {
                    next = false;
                    let res = await page.evaluate(() => {
                        let e = document.querySelectorAll(".ch-product-view-list-container.list-view li ul > li > h6 > a");
                        let p: string[] = [];
                        e.forEach((v) => {
                            p.push(("") + (v.getAttribute("href") as string));
                        });
                        return p;
                    });

                    // for(const l of res) {
                    //     try {
                    //         await this.step2(l, "" , "")
                    //     } catch(er) {
                    //         this.appendToFile("./error.txt", l + "::" + url + "\n").catch(e=>e)
                    //     }
                    // }

                    let p = [];
                    let c = 1;
                    for (const d of res) {
                        p.push(await this.step2(d, folder, c.toString()).catch((_e) => {
                            console.log(_e);
                            fs.appendFileSync("./error-2.txt", urlLink + " ### " + d + "\n");
                        }));
                        c++;
                    }
                    await Promise.all(p);

                    await this.appendToFile("./processed.txt", urlLink + ":" + count.toString() + "\n").catch(e => e);
                    count++;
                    console.log(urlLink + ":" + count);
                    let e = await page.evaluate(() => {
                        let ele = document.querySelector("#pagination-next") as Element;
                        let r = ele.getAttribute("style");
                        return r || "";
                    });
                    if (e === "") {
                        next = true;

                        await page.click("#pagination-next");
                        // console.log('waitng')
                        await page.waitFor(1000);
                        // console.log('done wait')
                        // await page.waitForNavigation({waitUntil: 'load'}).catch(e=> console.log(e));
                        //     await Promise.all([
                        //         page.click("#pagination-next"),
                        //         page.waitForNavigation({ waitUntil: 'networkidle0'}),                //   ]);
                    }
                } while (next);
                // await page.close();
                resolve("page all scrapped");
            } catch (errrr) {
                reject(errrr);
            } finally {
                if (page !== undefined) {
                    await page.close().catch(e => e);
                }
            }
        });
    }
    public step2(url: string, folder: string, file: string): Promise < string > {
        return new Promise < string > (async (resolve, reject) => {
            try {
                let html = await this.loadPage(url).catch(e => reject(e));
                let $ = cheerio.load(html);
                let ress: any = {};
                let t = $(".qal_title_heading").text();
                if (t) {
                    ress.header = t.replace(/"/g, "'").replace(/\n|\r|\t/g, "");
                }
                let d = $("div.ch_formatted_text.qal_thread-content_text.asker").html();
                if (d) {
                    ress.body = d.replace(/"/g, "'").replace(/\n|\r|\t/g, "");
                }
                // let sprit = "-------------------------------";
                let filename = "data" + file + ".json"; // ((t.replace(/[^\w\s]/gi, "")).substring(0,250)+".txt")
                let data = JSON.stringify(ress) // t +sprit + d + "\n---end---\n";                await this.appendToFile("./data/"+ folder + "/" +filename, data+",\n")
                    .then((r) => {
                        resolve(r);
                    });
            } catch (err) {
                reject(err);
            }
        });
    }
}
async function main() {
    process.on("SIGTERM", () => {
        console.log("SigTerm received");
        process.exit(1);
    });
    process.on("SIGINT", () => {
        console.log("SigInt received");
        process.exit(1);
    });
    let path = "data/unknown";
    shell.mkdir("-p", path);
    let c = new App();
    let list: string[] = [];
    console.log(process.argv[2]);
    require("fs").readFileSync(process.argv[2], "utf-8").split(/\r?\n/).forEach((line: string) => {
        list.push(line);
    });
    console.log("total links->" + list.length);

    c.browser = await pup.launch({
        headless: true
    });
    for (const l of list) {
        await c.step1(l).then(e => {
            fs.appendFileSync("./processed.txt", l);
        }).catch(e => {
            fs.appendFileSync("./error.txt", l);
        });
    }
}
main();

如果你需要我的其他东西,请告诉我。这也是所有代码。

回答如下:

所以,我想到了两个问题。

  1. Chrome(在木偶操纵者下)消耗高CPU,这给出了这样的趋势: 在开始它是在适度使用。它逐渐增加。我的趋势是从4%的使用率开始,一天后,它达到了100%。我在他们的git上提交了一个问题
  2. 我没有在请求中指定超时 是: request.get(url, async (err, res, html) => { 应该: request.get(url,{timeout: 1500} async (err, res, html) => {

到目前为止,我的代码运行正常超过一天。唯一的问题是高CPU使用率。但这不是我现在关注的问题。

Node js Scraper

我在typescript中写了一个scrapper,在节点上运行:10.12.0,

问题:代码在几个小时后随机进入睡眠状态。我不得不重新启动它。我最好的猜测是它坚持网址请求

工具/包使用:

  • 木偶
  • Cheerio
  • 打字稿

码:

import * as cheerio from "cheerio";
import * as request from "request";
import * as fs from "fs";
import * as shell from "shelljs";
import pup = require("puppeteer");
class App {
    // @ts-ignore
    public browser: pup.Browser;
    public appendToFile(file: string, content: string): Promise < string > {
        return new Promise < string > ((resolve, reject) => {
            try {
                fs.appendFileSync(file, content);
                resolve("DONE");
            } catch (e) {
                reject(e);
            }
        });
    }
    public loadPage(url: string): Promise < any > {
        return new Promise < any > ((resolve, reject) => {
            request.get(url, async (err, res, html) => {
                if (!err && res.statusCode === 200) {
                    resolve(html);
                } else {
                    if (err) {
                        reject(err);
                    } else {
                        reject(res);
                    }
                }
            });
        });
    }
    public step1(url: string): Promise < string > {
        return new Promise < string > (async (resolve, reject) => {
            let page: pup.Page | undefined;
            try {
                let next = false;
                let urlLink = url;
                let first = true;
                let header = "unknown";
                let f = url.split("/");
                let folder = f[f.length - 3];
                folder = folder || header;
                let path = "data/" + folder;
                shell.mkdir("-p", path);
                page = await this.browser.newPage();

                await page.goto(url, {
                    timeout: 0
                });
                let count = 1;
                do {
                    next = false;
                    let res = await page.evaluate(() => {
                        let e = document.querySelectorAll(".ch-product-view-list-container.list-view li ul > li > h6 > a");
                        let p: string[] = [];
                        e.forEach((v) => {
                            p.push(("") + (v.getAttribute("href") as string));
                        });
                        return p;
                    });

                    // for(const l of res) {
                    //     try {
                    //         await this.step2(l, "" , "")
                    //     } catch(er) {
                    //         this.appendToFile("./error.txt", l + "::" + url + "\n").catch(e=>e)
                    //     }
                    // }

                    let p = [];
                    let c = 1;
                    for (const d of res) {
                        p.push(await this.step2(d, folder, c.toString()).catch((_e) => {
                            console.log(_e);
                            fs.appendFileSync("./error-2.txt", urlLink + " ### " + d + "\n");
                        }));
                        c++;
                    }
                    await Promise.all(p);

                    await this.appendToFile("./processed.txt", urlLink + ":" + count.toString() + "\n").catch(e => e);
                    count++;
                    console.log(urlLink + ":" + count);
                    let e = await page.evaluate(() => {
                        let ele = document.querySelector("#pagination-next") as Element;
                        let r = ele.getAttribute("style");
                        return r || "";
                    });
                    if (e === "") {
                        next = true;

                        await page.click("#pagination-next");
                        // console.log('waitng')
                        await page.waitFor(1000);
                        // console.log('done wait')
                        // await page.waitForNavigation({waitUntil: 'load'}).catch(e=> console.log(e));
                        //     await Promise.all([
                        //         page.click("#pagination-next"),
                        //         page.waitForNavigation({ waitUntil: 'networkidle0'}),                //   ]);
                    }
                } while (next);
                // await page.close();
                resolve("page all scrapped");
            } catch (errrr) {
                reject(errrr);
            } finally {
                if (page !== undefined) {
                    await page.close().catch(e => e);
                }
            }
        });
    }
    public step2(url: string, folder: string, file: string): Promise < string > {
        return new Promise < string > (async (resolve, reject) => {
            try {
                let html = await this.loadPage(url).catch(e => reject(e));
                let $ = cheerio.load(html);
                let ress: any = {};
                let t = $(".qal_title_heading").text();
                if (t) {
                    ress.header = t.replace(/"/g, "'").replace(/\n|\r|\t/g, "");
                }
                let d = $("div.ch_formatted_text.qal_thread-content_text.asker").html();
                if (d) {
                    ress.body = d.replace(/"/g, "'").replace(/\n|\r|\t/g, "");
                }
                // let sprit = "-------------------------------";
                let filename = "data" + file + ".json"; // ((t.replace(/[^\w\s]/gi, "")).substring(0,250)+".txt")
                let data = JSON.stringify(ress) // t +sprit + d + "\n---end---\n";                await this.appendToFile("./data/"+ folder + "/" +filename, data+",\n")
                    .then((r) => {
                        resolve(r);
                    });
            } catch (err) {
                reject(err);
            }
        });
    }
}
async function main() {
    process.on("SIGTERM", () => {
        console.log("SigTerm received");
        process.exit(1);
    });
    process.on("SIGINT", () => {
        console.log("SigInt received");
        process.exit(1);
    });
    let path = "data/unknown";
    shell.mkdir("-p", path);
    let c = new App();
    let list: string[] = [];
    console.log(process.argv[2]);
    require("fs").readFileSync(process.argv[2], "utf-8").split(/\r?\n/).forEach((line: string) => {
        list.push(line);
    });
    console.log("total links->" + list.length);

    c.browser = await pup.launch({
        headless: true
    });
    for (const l of list) {
        await c.step1(l).then(e => {
            fs.appendFileSync("./processed.txt", l);
        }).catch(e => {
            fs.appendFileSync("./error.txt", l);
        });
    }
}
main();

如果你需要我的其他东西,请告诉我。这也是所有代码。

回答如下:

所以,我想到了两个问题。

  1. Chrome(在木偶操纵者下)消耗高CPU,这给出了这样的趋势: 在开始它是在适度使用。它逐渐增加。我的趋势是从4%的使用率开始,一天后,它达到了100%。我在他们的git上提交了一个问题
  2. 我没有在请求中指定超时 是: request.get(url, async (err, res, html) => { 应该: request.get(url,{timeout: 1500} async (err, res, html) => {

到目前为止,我的代码运行正常超过一天。唯一的问题是高CPU使用率。但这不是我现在关注的问题。

与本文相关的文章

发布评论

评论列表 (0)

  1. 暂无评论