puppeteer を使ってスクレイピングしてみた
puppeteer を使ってスクレイピングしてみた
puppeteer を使ってスクレイピングするところまでやってみました。
(ほぼ puppeteer の README.md に書いている内容やってみただけです。)
ディレクトリの作成・移動
$ mkdir scraping
$ cd scraping
Dockerfile
puppeteer/troubleshooting.md at v1.12.1 · GoogleChrome/puppeteer · GitHub に書いてある内容をそのまま記述しました。
Dockerfile
FROM node:8-slim
# See https://crbug.com/795759
RUN apt-get update && apt-get install -yq libgconf-2-4
# Install latest chrome dev package and fonts to support major charsets (Chinese, Japanese, Arabic, Hebrew, Thai and a few others)
# Note: this installs the necessary libs to make the bundled version of Chromium that Puppeteer
# installs, work.
RUN apt-get update && apt-get install -y wget --no-install-recommends \
    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
    && apt-get update \
    && apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont \
      --no-install-recommends \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get purge --auto-remove -y curl \
    && rm -rf /src/*.deb
# It's a good idea to use dumb-init to help prevent zombie chrome processes.
ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
RUN chmod +x /usr/local/bin/dumb-init
# Uncomment to skip the chromium download when installing puppeteer. If you do,
# you'll need to launch puppeteer with:
#     browser.launch({executablePath: 'google-chrome-unstable'})
# ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true
# Install puppeteer so it's available in the container.
RUN npm i puppeteer
# Add user so we don't need --no-sandbox.
RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
    && mkdir -p /home/pptruser/Downloads \
    && chown -R pptruser:pptruser /home/pptruser \
    && chown -R pptruser:pptruser /node_modules
# Run everything after as non-privileged user.
USER pptruser
ENTRYPOINT ["dumb-init", "--"]
CMD ["google-chrome-unstable"]
ビルド
$ docker build -t puppeteer-chrome-linux .
...
Chromium downloaded to /node_modules/puppeteer/.local-chromium/linux-672088
npm WARN saveError ENOENT: no such file or directory, open '/package.json'
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN enoent ENOENT: no such file or directory, open '/package.json'
npm WARN !invalid#1 No description
npm WARN !invalid#1 No repository field.
npm WARN !invalid#1 No README data
npm WARN !invalid#1 No license field.
...
package.json がないと怒られたので、
$ npm init
とりあえず、全部 Enter で良いと思います。
$ npm init
package name: (scraping)
version: (1.0.0)
description:
entry point: (index.js)
test command:
git repository:
keywords:
author:
license: (ISC)
About to write to /Users/username/scraping/package.json:
{
  "name": "scraping",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC"
}
Is this OK? (yes)
package.json が作られます。
{
  "name": "scraping",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC"
}
もう一度、docker のビルドを行ってみます。
$ docker build -t puppeteer-chrome-linux .
Sending build context to Docker daemon  259.3MB
Step 1/10 : FROM node:8-slim
 ---> bce75035da07
Step 2/10 : RUN apt-get update && apt-get install -yq libgconf-2-4
 ---> Using cache
 ---> 5032dee55575
Step 3/10 : RUN apt-get update && apt-get install -y wget --no-install-recommends     && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -     && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list'     && apt-get update     && apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont       --no-install-recommends     && rm -rf /var/lib/apt/lists/*     && apt-get purge --auto-remove -y curl     && rm -rf /src/*.deb
 ---> Using cache
 ---> 184f2fc73b93
Step 4/10 : ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
Downloading [==================================================>]   46.4kB/46.4kB
 ---> Using cache
 ---> d5c9df362ff8
Step 5/10 : RUN chmod +x /usr/local/bin/dumb-init
 ---> Using cache
 ---> 10101be0f2da
Step 6/10 : RUN npm i puppeteer
 ---> Using cache
 ---> 6c21b8e2a3cc
Step 7/10 : RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser     && mkdir -p /home/pptruser/Downloads     && chown -R pptruser:pptruser /home/pptruser     && chown -R pptruser:pptruser /node_modules
 ---> Using cache
 ---> 5b1bc9596344
Step 8/10 : USER pptruser
 ---> Using cache
 ---> b954549a38d0
Step 9/10 : ENTRYPOINT ["dumb-init", "--"]
 ---> Using cache
 ---> 520e7b037be7
Step 10/10 : CMD ["google-chrome-unstable"]
 ---> Using cache
 ---> 4544bcada3d1
Successfully built 4544bcada3d1
Successfully tagged puppeteer-chrome-linux:latest
成功しました。
puppeteer
puppeteer をインストールします。
$ npm i -D puppeteer
package.json
   "author": "",
-  "license": "ISC"
+  "license": "ISC",
+  "devDependencies": {
+    "puppeteer": "^1.18.1"
+  }
 }
package.json に scripts を追加します。
   "description": "",
   "main": "index.js",
   "scripts": {
-    "test": "echo \"Error: no test specified\" && exit 1"
+    "puppeteer": "node app/script/index.js"
   },
docker-compose
docker-compose.yml を書きます。
version: "2"
services:
  main:
    build: "."
    container_name: "scraping"
    volumes:
      - "./app/script:/app/script"
ビルドします。
$ docker-compose up --build
スクレイピング
サイトにアクセスして、スクリーンショットを撮ってきます。
app/script/index.js
const puppeteer = require("puppeteer");
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://example.com");
  await page.screenshot({ path: "./app/data/example.png" });
  await browser.close();
})();
スクレイピングしてみます。
$ yarn puppeteer
yarn run v1.6.0
$ node app/script/index.js
(node:77270) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open './app/data/example.png'
  -- ASYNC --
    at Page.<anonymous> (/Users/username/scraping/node_modules/puppeteer/lib/helper.js:111:15)
    at /Users/username/scraping/app/script/index.js:7:14
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
(node:77270) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:77270) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
app/data ディレクトリを作ってもう一度実行してみます。
$ yarn puppeteer
yarn run v1.6.0
$ node app/script/index.js
✨  Done in 1.78s.
app/data/example.png

スクリーンショットを撮ることができました。
