hene

hene.dev

puppeteer を使ってスクレイピングしてみた

puppeteer を使ってスクレイピングしてみた

puppeteer を使ってスクレイピングするところまでやってみました。

(ほぼ puppeteerREADME.md に書いている内容やってみただけです。)

ディレクトリの作成・移動

$ mkdir scraping
$ cd scraping

Dockerfile

puppeteer/troubleshooting.md at v1.12.1 · GoogleChrome/puppeteer · GitHub に書いてある内容をそのまま記述しました。

Dockerfile

FROM node:8-slim

# See https://crbug.com/795759
RUN apt-get update && apt-get install -yq libgconf-2-4

# Install latest chrome dev package and fonts to support major charsets (Chinese, Japanese, Arabic, Hebrew, Thai and a few others)
# Note: this installs the necessary libs to make the bundled version of Chromium that Puppeteer
# installs, work.
RUN apt-get update && apt-get install -y wget --no-install-recommends \
    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
    && apt-get update \
    && apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont \
      --no-install-recommends \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get purge --auto-remove -y curl \
    && rm -rf /src/*.deb

# It's a good idea to use dumb-init to help prevent zombie chrome processes.
ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
RUN chmod +x /usr/local/bin/dumb-init

# Uncomment to skip the chromium download when installing puppeteer. If you do,
# you'll need to launch puppeteer with:
#     browser.launch({executablePath: 'google-chrome-unstable'})
# ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true

# Install puppeteer so it's available in the container.
RUN npm i puppeteer

# Add user so we don't need --no-sandbox.
RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
    && mkdir -p /home/pptruser/Downloads \
    && chown -R pptruser:pptruser /home/pptruser \
    && chown -R pptruser:pptruser /node_modules

# Run everything after as non-privileged user.
USER pptruser

ENTRYPOINT ["dumb-init", "--"]
CMD ["google-chrome-unstable"]

ビルド

$ docker build -t puppeteer-chrome-linux .

...

Chromium downloaded to /node_modules/puppeteer/.local-chromium/linux-672088
npm WARN saveError ENOENT: no such file or directory, open '/package.json'
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN enoent ENOENT: no such file or directory, open '/package.json'
npm WARN !invalid#1 No description
npm WARN !invalid#1 No repository field.
npm WARN !invalid#1 No README data
npm WARN !invalid#1 No license field.

...

package.json がないと怒られたので、

$ npm init

とりあえず、全部 Enter で良いと思います。

$ npm init
package name: (scraping)
version: (1.0.0)
description:
entry point: (index.js)
test command:
git repository:
keywords:
author:
license: (ISC)
About to write to /Users/username/scraping/package.json:

{
  "name": "scraping",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC"
}


Is this OK? (yes)

package.json が作られます。

{
  "name": "scraping",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC"
}

もう一度、docker のビルドを行ってみます。

$ docker build -t puppeteer-chrome-linux .
Sending build context to Docker daemon  259.3MB
Step 1/10 : FROM node:8-slim
 ---> bce75035da07
Step 2/10 : RUN apt-get update && apt-get install -yq libgconf-2-4
 ---> Using cache
 ---> 5032dee55575
Step 3/10 : RUN apt-get update && apt-get install -y wget --no-install-recommends     && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -     && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list'     && apt-get update     && apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont       --no-install-recommends     && rm -rf /var/lib/apt/lists/*     && apt-get purge --auto-remove -y curl     && rm -rf /src/*.deb
 ---> Using cache
 ---> 184f2fc73b93
Step 4/10 : ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
Downloading [==================================================>]   46.4kB/46.4kB
 ---> Using cache
 ---> d5c9df362ff8
Step 5/10 : RUN chmod +x /usr/local/bin/dumb-init
 ---> Using cache
 ---> 10101be0f2da
Step 6/10 : RUN npm i puppeteer
 ---> Using cache
 ---> 6c21b8e2a3cc
Step 7/10 : RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser     && mkdir -p /home/pptruser/Downloads     && chown -R pptruser:pptruser /home/pptruser     && chown -R pptruser:pptruser /node_modules
 ---> Using cache
 ---> 5b1bc9596344
Step 8/10 : USER pptruser
 ---> Using cache
 ---> b954549a38d0
Step 9/10 : ENTRYPOINT ["dumb-init", "--"]
 ---> Using cache
 ---> 520e7b037be7
Step 10/10 : CMD ["google-chrome-unstable"]
 ---> Using cache
 ---> 4544bcada3d1
Successfully built 4544bcada3d1
Successfully tagged puppeteer-chrome-linux:latest

成功しました。

puppeteer

puppeteer をインストールします。

$ npm i -D puppeteer

package.json

   "author": "",
-  "license": "ISC"
+  "license": "ISC",
+  "devDependencies": {
+    "puppeteer": "^1.18.1"
+  }
 }

package.jsonscripts を追加します。

   "description": "",
   "main": "index.js",
   "scripts": {
-    "test": "echo \"Error: no test specified\" && exit 1"
+    "puppeteer": "node app/script/index.js"
   },

docker-compose

docker-compose.yml を書きます。

version: "2"

services:
  main:
    build: "."
    container_name: "scraping"
    volumes:
      - "./app/script:/app/script"

ビルドします。

$ docker-compose up --build

スクレイピング

サイトにアクセスして、スクリーンショットを撮ってきます。

app/script/index.js

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://example.com");
  await page.screenshot({ path: "./app/data/example.png" });

  await browser.close();
})();

スクレイピングしてみます。

$ yarn puppeteer
yarn run v1.6.0
$ node app/script/index.js
(node:77270) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open './app/data/example.png'
  -- ASYNC --
    at Page.<anonymous> (/Users/username/scraping/node_modules/puppeteer/lib/helper.js:111:15)
    at /Users/username/scraping/app/script/index.js:7:14
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
(node:77270) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:77270) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

app/data ディレクトリを作ってもう一度実行してみます。

$ yarn puppeteer
yarn run v1.6.0
$ node app/script/index.js
✨  Done in 1.78s.

app/data/example.png

exmaple

スクリーンショットを撮ることができました。

参考

関連記事