用TypeScript开发爬虫程序
t554in32
8年前
<p>全局安装typescript:</p> <pre> <code class="language-typeScript">npm install -g typescript</code></pre> <p>目前版本2.0.3,这个版本不再需要使用typings命令了。但是vscode捆绑的版本是1.8的,需要一些配置工作,看本文的处理办法。</p> <p>测试tsc命令:</p> <pre> <code class="language-typeScript">tsc</code></pre> <p>创建要写的程序项目文件夹:</p> <pre> <code class="language-typeScript">mkdir test-typescript-spider</code></pre> <p>进入该文件夹:</p> <pre> <code class="language-typeScript">cd test-typescript-spider</code></pre> <p>初始化项目:</p> <pre> <code class="language-typeScript">npm init</code></pre> <p>安装superagent和cheerio模块:</p> <pre> <code class="language-typeScript">npm i --save superagent cheerio</code></pre> <p>安装对应的类型声明模块:</p> <pre> <code class="language-typeScript">npm i -s @types/superagent --save npm i -s @types/cheerio --save</code></pre> <p>安装项目内的typescript(必须走这一步):</p> <pre> <code class="language-typeScript">npm i --save typescript</code></pre> <p>用vscode打开项目文件夹。在该文件夹下创建tsconfig.json文件,并复制以下配置代码进去:</p> <pre> <code class="language-typeScript">{ "compilerOptions": { "target": "ES6", "module": "commonjs", "noEmitOnError": true, "noImplicitAny": true, "experimentalDecorators": true, "sourceMap": false, // "sourceRoot": "./", "outDir": "./out" }, "exclude": [ "node_modules" ] }</code></pre> <p>在vscode打开“文件”-“首选项”-“工作区设置”</p> <p>在settings.json中加入(如果不做这个配置,vscode会在打开项目的时候提示选择哪个版本的typescript):</p> <pre> <code class="language-typeScript">{ "typescript.tsdk": "node_modules/typescript/lib" }</code></pre> <p>创建api.ts文件,复制以下代码进去:</p> <pre> <code class="language-typeScript">import superagent = require('superagent'); import cheerio = require('cheerio'); export const remote_get = function(url: string) { const promise = new Promise<superagent.Response>(function (resolve, reject) { superagent.get(url) .end(function (err, res) { if (!err) { resolve(res); } else { console.log(err) reject(err); } }); }); return promise; }</code></pre> <p>创建app.ts文件,书写测试代码:</p> <pre> <code class="language-typeScript">import api = require('./api'); const go = async () => { let res = await api.remote_get('http://www.baidu.com/'); console.log(res.text); } go();</code></pre> <p>执行命令:</p> <pre> <code class="language-typeScript">tsc</code></pre> <p>然后:</p> <pre> <code class="language-typeScript">node out/app</code></pre> <p>观察输出是否正确。</p> <p>现在尝试抓取 http://cnodejs.org/ 的第一页文章链接。</p> <p>修改app.ts文件,代码如下:</p> <pre> <code class="language-typeScript">import api = require('./api'); import cheerio = require('cheerio'); const go = async () => { const res = await api.remote_get('http://cnodejs.org/'); const $ = cheerio.load(res.text); let urls: string[] = []; let titles: string[] = []; $('.topic_title_wrapper').each((index, element) => { titles.push($(element).find('.topic_title').first().text().trim()); urls.push('http://cnodejs.org/' + $(element).find('.topic_title').first().attr('href')); }) console.log(titles, urls); } go();</code></pre> <p>观察输出,文章的标题和链接都已获取到了。</p> <p>现在尝试深入抓取文章内容</p> <pre> <code class="language-typeScript">import api = require('./api'); import cheerio = require('cheerio'); const go = async () => { const res = await api.remote_get('http://cnodejs.org/'); const $ = cheerio.load(res.text); $('.topic_title_wrapper').each(async (index, element) => { let url = ('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href')); const res_content = await api.remote_get(url); const $_content = cheerio.load(res_content.text); console.log($_content('.topic_content').first().text()); }) } go();</code></pre> <p>可以发现因为访问服务器太迅猛,导致出现很多次503错误。</p> <p>解决:</p> <p>添加helper.ts文件:</p> <pre> <code class="language-typeScript">export const wait_seconds = function (senconds: number) { return new Promise(resolve => setTimeout(resolve, senconds * 1000)); }</code></pre> <p>修改api.ts文件为:</p> <pre> <code class="language-typeScript">import superagent = require('superagent'); import cheerio = require('cheerio'); export const get_index_urls = function () { const res = await remote_get('http://cnodejs.org/'); const $ = cheerio.load(res.text); let urls: string[] = []; $('.topic_title_wrapper').each(async (index, element) => { urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href')); }); return urls; } export const get_content = async function (url: string) { const res = await remote_get(url); const $ = cheerio.load(res.text); return $('.topic_content').first().text(); } export const remote_get = function (url: string) { const promise = new Promise<superagent.Response>(function (resolve, reject) { superagent.get(url) .end(function (err, res) { if (!err) { resolve(res); } else { console.log(err) reject(err); } }); }); return promise; }</code></pre> <p>修改app.ts文件为:</p> <pre> <code class="language-typeScript">import api = require('./api'); import helper = require('./helper'); import cheerio = require('cheerio'); const go = async () => { const res = await api.remote_get('http://cnodejs.org/'); const $ = cheerio.load(res.text); let urls = await api.get_index_urls(); for (let i = 0; i < urls.length; i++) { await helper.wait_seconds(1); let text = await api.get_content(urls[i]); console.log(text); } } go();</code></pre> <p>观察输出可以看到,程序实现了隔一秒再请求下一个内容页。</p> <p>现在尝试把抓取到的东西存到数据库中。</p> <p>安装mongoose模块:</p> <pre> <code class="language-typeScript">npm i mongoose --save npm i -s @types/mongoose --save</code></pre> <p>然后建立Scheme。先创建models文件夹:</p> <pre> <code class="language-typeScript">mkdir models</code></pre> <p>在models文件夹下创建index.ts:</p> <pre> <code class="language-typeScript">import * as mongoose from 'mongoose'; mongoose.connect('mongodb://127.0.0.1/cnodejs_data', { server: { poolSize: 20 } }, function (err) { if (err) { process.exit(1); } }); // models export const Article = require('./article');</code></pre> <p>在models文件夹下创建IArticle.ts:</p> <pre> <code class="language-typeScript">interface IArticle { title: String; url: String; text: String; } export = IArticle;</code></pre> <p>在models文件夹下创建Article.ts:</p> <pre> <code class="language-typeScript">import mongoose = require('mongoose'); import IArticle = require('./IArticle'); interface IArticleModel extends IArticle, mongoose.Document { } const ArticleSchema = new mongoose.Schema({ title: { type: String }, url: { type: String }, text: { type: String }, }); const Article = mongoose.model<IArticleModel>("Article", ArticleSchema); export = Article;</code></pre> <p>修改api.ts为:</p> <pre> <code class="language-typeScript">import superagent = require('superagent'); import cheerio = require('cheerio'); import models = require('./models'); const Article = models.Article; export const get_index_urls = async function () { const res = await remote_get('http://cnodejs.org/'); const $ = cheerio.load(res.text); let urls: string[] = []; $('.topic_title_wrapper').each((index, element) => { urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href')); }); return urls; } export const fetch_content = async function (url: string) { const res = await remote_get(url); const $ = cheerio.load(res.text); let article = new Article(); article.text = $('.topic_content').first().text(); article.title = $('.topic_full_title').first().text().replace('置顶', '').replace('精华', '').trim(); article.url = url; console.log('获取成功:' + article.title); article.save(); } export const remote_get = function (url: string) { return new Promise<superagent.Response>((resolve, reject) => { superagent.get(url) .end(function (err, res) { if (!err) { resolve(res); } else { reject(err); } }); }); }</code></pre> <p>修改app.ts为:</p> <pre> <code class="language-typeScript">import api = require('./api'); import helper = require('./helper'); import cheerio = require('cheerio'); (async () => { try { let urls = await api.get_index_urls(); for (let i = 0; i < urls.length; i++) { await helper.wait_seconds(1); await api.fetch_content(urls[i]); } } catch (err) { console.log(err); } console.log('完毕!'); })();</code></pre> <p>执行</p> <pre> <code class="language-typeScript">tsc node out/app</code></pre> <p>观察输出,并去数据库检查一下</p> <p>可以发现入库成功了!</p> <p>补充:remote_get方法的改进版,实现错误重试和加入代理服务器.</p> <p>放弃了superagent库,用的request库,仅供参考:</p> <pre> <code class="language-typeScript">//config.retries = 3; let current_retry = config.retries || 0; export const remote_get = async function (url: string, proxy?: string) { //每次请求都先稍等一下 await wait_seconds(2); if (!proxy) { proxy = ''; } const promise = new Promise<string>(function (resolve, reject) { console.log('get: ' + url + ', using proxy: ' + proxy); let options: request.CoreOptions = { headers: { 'Cookie': '', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36', 'Referer': 'https://www.baidu.com/' }, encoding: 'utf-8', method: 'GET', proxy: proxy, timeout: 3000, } request(url, options, async function (err, response, body) { console.log('got:' + url); if (!err) { body = body.toString(); current_retry = config.retries || 0; console.log('bytes:' + body.length); resolve(body); } else { console.log(err); if (current_retry <= 0) { current_retry = config.retries || 0; reject(err); } else { console.log('retry...(' + current_retry + ')') current_retry--; try { let body = await remote_get(url, proxy); resolve(body); } catch (e) { reject(e); } } } }); }); return promise; }</code></pre> <p>另外,IArticle.ts和Article.ts合并为一个文件,可能更好,可以参考我另一个model的写法:</p> <pre> <code class="language-typeScript">import mongoose = require('mongoose'); interface IProxyModel { uri: string; ip: string; port:string; info:string; } export interface IProxy extends IProxyModel, mongoose.Document { } const ProxySchema = new mongoose.Schema({ uri: { type: String },// ip: { type: String },// port: { type: String },// info: { type: String },// }); export const Proxy = mongoose.model<IProxy>("Proxy", ProxySchema);</code></pre> <p>导入的时候这么写就行了:</p> <pre> <code class="language-typeScript">import { IProxy, Proxy } from './models';</code></pre> <p>其中Proxy可以用来做new、find、where之类的操作:</p> <pre> <code class="language-typeScript">let x = new Proxy(); let xx = await Proxy.find({}); let xxx = await Proxy.where('aaa',123).exec();</code></pre> <p>而IProxy用于实体对象的传递,例如</p> <pre> <code class="language-typeScript">function xxx(p:IProxy){ }</code></pre> <p> </p> <p>来自:https://segmentfault.com/a/1190000007326795</p> <p> </p>