puppeteer を使ってスクレイピングしてみた
puppeteer を使ってスクレイピングしてみた
puppeteer
を使ってスクレイピングするところまでやってみました。
(ほぼ puppeteer
の README.md
に書いている内容やってみただけです。)
ディレクトリの作成・移動
$ mkdir scraping
$ cd scraping
Dockerfile
puppeteer/troubleshooting.md at v1.12.1 · GoogleChrome/puppeteer · GitHub に書いてある内容をそのまま記述しました。
Dockerfile
FROM node:8-slim
# See https://crbug.com/795759
RUN apt-get update && apt-get install -yq libgconf-2-4
# Install latest chrome dev package and fonts to support major charsets (Chinese, Japanese, Arabic, Hebrew, Thai and a few others)
# Note: this installs the necessary libs to make the bundled version of Chromium that Puppeteer
# installs, work.
RUN apt-get update && apt-get install -y wget --no-install-recommends \
&& wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
&& apt-get update \
&& apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get purge --auto-remove -y curl \
&& rm -rf /src/*.deb
# It's a good idea to use dumb-init to help prevent zombie chrome processes.
ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
RUN chmod +x /usr/local/bin/dumb-init
# Uncomment to skip the chromium download when installing puppeteer. If you do,
# you'll need to launch puppeteer with:
# browser.launch({executablePath: 'google-chrome-unstable'})
# ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true
# Install puppeteer so it's available in the container.
RUN npm i puppeteer
# Add user so we don't need --no-sandbox.
RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
&& mkdir -p /home/pptruser/Downloads \
&& chown -R pptruser:pptruser /home/pptruser \
&& chown -R pptruser:pptruser /node_modules
# Run everything after as non-privileged user.
USER pptruser
ENTRYPOINT ["dumb-init", "--"]
CMD ["google-chrome-unstable"]
ビルド
$ docker build -t puppeteer-chrome-linux .
...
Chromium downloaded to /node_modules/puppeteer/.local-chromium/linux-672088
npm WARN saveError ENOENT: no such file or directory, open '/package.json'
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN enoent ENOENT: no such file or directory, open '/package.json'
npm WARN !invalid#1 No description
npm WARN !invalid#1 No repository field.
npm WARN !invalid#1 No README data
npm WARN !invalid#1 No license field.
...
package.json
がないと怒られたので、
$ npm init
とりあえず、全部 Enter
で良いと思います。
$ npm init
package name: (scraping)
version: (1.0.0)
description:
entry point: (index.js)
test command:
git repository:
keywords:
author:
license: (ISC)
About to write to /Users/username/scraping/package.json:
{
"name": "scraping",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC"
}
Is this OK? (yes)
package.json
が作られます。
{
"name": "scraping",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC"
}
もう一度、docker
のビルドを行ってみます。
$ docker build -t puppeteer-chrome-linux .
Sending build context to Docker daemon 259.3MB
Step 1/10 : FROM node:8-slim
---> bce75035da07
Step 2/10 : RUN apt-get update && apt-get install -yq libgconf-2-4
---> Using cache
---> 5032dee55575
Step 3/10 : RUN apt-get update && apt-get install -y wget --no-install-recommends && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' && apt-get update && apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont --no-install-recommends && rm -rf /var/lib/apt/lists/* && apt-get purge --auto-remove -y curl && rm -rf /src/*.deb
---> Using cache
---> 184f2fc73b93
Step 4/10 : ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
Downloading [==================================================>] 46.4kB/46.4kB
---> Using cache
---> d5c9df362ff8
Step 5/10 : RUN chmod +x /usr/local/bin/dumb-init
---> Using cache
---> 10101be0f2da
Step 6/10 : RUN npm i puppeteer
---> Using cache
---> 6c21b8e2a3cc
Step 7/10 : RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser && mkdir -p /home/pptruser/Downloads && chown -R pptruser:pptruser /home/pptruser && chown -R pptruser:pptruser /node_modules
---> Using cache
---> 5b1bc9596344
Step 8/10 : USER pptruser
---> Using cache
---> b954549a38d0
Step 9/10 : ENTRYPOINT ["dumb-init", "--"]
---> Using cache
---> 520e7b037be7
Step 10/10 : CMD ["google-chrome-unstable"]
---> Using cache
---> 4544bcada3d1
Successfully built 4544bcada3d1
Successfully tagged puppeteer-chrome-linux:latest
成功しました。
puppeteer
puppeteer
をインストールします。
$ npm i -D puppeteer
package.json
"author": "",
- "license": "ISC"
+ "license": "ISC",
+ "devDependencies": {
+ "puppeteer": "^1.18.1"
+ }
}
package.json
に scripts
を追加します。
"description": "",
"main": "index.js",
"scripts": {
- "test": "echo \"Error: no test specified\" && exit 1"
+ "puppeteer": "node app/script/index.js"
},
docker-compose
docker-compose.yml
を書きます。
version: "2"
services:
main:
build: "."
container_name: "scraping"
volumes:
- "./app/script:/app/script"
ビルドします。
$ docker-compose up --build
スクレイピング
サイトにアクセスして、スクリーンショットを撮ってきます。
app/script/index.js
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.screenshot({ path: "./app/data/example.png" });
await browser.close();
})();
スクレイピングしてみます。
$ yarn puppeteer
yarn run v1.6.0
$ node app/script/index.js
(node:77270) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open './app/data/example.png'
-- ASYNC --
at Page.<anonymous> (/Users/username/scraping/node_modules/puppeteer/lib/helper.js:111:15)
at /Users/username/scraping/app/script/index.js:7:14
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7)
(node:77270) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:77270) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
app/data
ディレクトリを作ってもう一度実行してみます。
$ yarn puppeteer
yarn run v1.6.0
$ node app/script/index.js
✨ Done in 1.78s.
app/data/example.png
スクリーンショットを撮ることができました。