用TypeScript开发爬虫程序

t554in32 8年前
   <p>全局安装typescript:</p>    <pre>  <code class="language-typeScript">npm install -g typescript</code></pre>    <p>目前版本2.0.3,这个版本不再需要使用typings命令了。但是vscode捆绑的版本是1.8的,需要一些配置工作,看本文的处理办法。</p>    <p>测试tsc命令:</p>    <pre>  <code class="language-typeScript">tsc</code></pre>    <p>创建要写的程序项目文件夹:</p>    <pre>  <code class="language-typeScript">mkdir test-typescript-spider</code></pre>    <p>进入该文件夹:</p>    <pre>  <code class="language-typeScript">cd test-typescript-spider</code></pre>    <p>初始化项目:</p>    <pre>  <code class="language-typeScript">npm init</code></pre>    <p>安装superagent和cheerio模块:</p>    <pre>  <code class="language-typeScript">npm i --save superagent cheerio</code></pre>    <p>安装对应的类型声明模块:</p>    <pre>  <code class="language-typeScript">npm i -s @types/superagent --save  npm i -s @types/cheerio --save</code></pre>    <p>安装项目内的typescript(必须走这一步):</p>    <pre>  <code class="language-typeScript">npm i --save typescript</code></pre>    <p>用vscode打开项目文件夹。在该文件夹下创建tsconfig.json文件,并复制以下配置代码进去:</p>    <pre>  <code class="language-typeScript">{      "compilerOptions": {          "target": "ES6",          "module": "commonjs",          "noEmitOnError": true,          "noImplicitAny": true,          "experimentalDecorators": true,          "sourceMap": false,       // "sourceRoot": "./",          "outDir": "./out"      },      "exclude": [          "node_modules"      ]  }</code></pre>    <p>在vscode打开“文件”-“首选项”-“工作区设置”</p>    <p>在settings.json中加入(如果不做这个配置,vscode会在打开项目的时候提示选择哪个版本的typescript):</p>    <pre>  <code class="language-typeScript">{     "typescript.tsdk": "node_modules/typescript/lib"  }</code></pre>    <p>创建api.ts文件,复制以下代码进去:</p>    <pre>  <code class="language-typeScript">import superagent = require('superagent');  import cheerio = require('cheerio');    export const remote_get = function(url: string) {        const promise = new Promise<superagent.Response>(function (resolve, reject) {          superagent.get(url)              .end(function (err, res) {                  if (!err) {                      resolve(res);                  } else {                      console.log(err)                      reject(err);                  }              });      });      return promise;  }</code></pre>    <p>创建app.ts文件,书写测试代码:</p>    <pre>  <code class="language-typeScript">import api = require('./api');  const go = async () => {      let res = await api.remote_get('http://www.baidu.com/');      console.log(res.text);  }  go();</code></pre>    <p>执行命令:</p>    <pre>  <code class="language-typeScript">tsc</code></pre>    <p>然后:</p>    <pre>  <code class="language-typeScript">node out/app</code></pre>    <p>观察输出是否正确。</p>    <p>现在尝试抓取 http://cnodejs.org/ 的第一页文章链接。</p>    <p>修改app.ts文件,代码如下:</p>    <pre>  <code class="language-typeScript">import api = require('./api');  import cheerio = require('cheerio');    const go = async () => {      const res = await api.remote_get('http://cnodejs.org/');      const $ = cheerio.load(res.text);      let urls: string[] = [];      let titles: string[] = [];      $('.topic_title_wrapper').each((index, element) => {          titles.push($(element).find('.topic_title').first().text().trim());          urls.push('http://cnodejs.org/' + $(element).find('.topic_title').first().attr('href'));      })      console.log(titles, urls);  }  go();</code></pre>    <p>观察输出,文章的标题和链接都已获取到了。</p>    <p>现在尝试深入抓取文章内容</p>    <pre>  <code class="language-typeScript">import api = require('./api');  import cheerio = require('cheerio');    const go = async () => {      const res = await api.remote_get('http://cnodejs.org/');      const $ = cheerio.load(res.text);      $('.topic_title_wrapper').each(async (index, element) => {          let url = ('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));          const res_content = await api.remote_get(url);          const $_content = cheerio.load(res_content.text);          console.log($_content('.topic_content').first().text());      })    }  go();</code></pre>    <p>可以发现因为访问服务器太迅猛,导致出现很多次503错误。</p>    <p>解决:</p>    <p>添加helper.ts文件:</p>    <pre>  <code class="language-typeScript">export const wait_seconds = function (senconds: number) {      return new Promise(resolve => setTimeout(resolve, senconds * 1000));  }</code></pre>    <p>修改api.ts文件为:</p>    <pre>  <code class="language-typeScript">import superagent = require('superagent');  import cheerio = require('cheerio');    export const get_index_urls = function () {      const res = await remote_get('http://cnodejs.org/');      const $ = cheerio.load(res.text);      let urls: string[] = [];      $('.topic_title_wrapper').each(async (index, element) => {          urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));      });      return urls;  }  export const get_content = async function (url: string) {      const res = await remote_get(url);      const $ = cheerio.load(res.text);      return $('.topic_content').first().text();  }    export const remote_get = function (url: string) {        const promise = new Promise<superagent.Response>(function (resolve, reject) {            superagent.get(url)              .end(function (err, res) {                  if (!err) {                      resolve(res);                  } else {                      console.log(err)                      reject(err);                  }              });      });      return promise;  }</code></pre>    <p>修改app.ts文件为:</p>    <pre>  <code class="language-typeScript">import api = require('./api');  import helper = require('./helper');  import cheerio = require('cheerio');    const go = async () => {      const res = await api.remote_get('http://cnodejs.org/');      const $ = cheerio.load(res.text);      let urls = await api.get_index_urls();      for (let i = 0; i < urls.length; i++) {          await helper.wait_seconds(1);          let text = await api.get_content(urls[i]);          console.log(text);      }  }  go();</code></pre>    <p>观察输出可以看到,程序实现了隔一秒再请求下一个内容页。</p>    <p>现在尝试把抓取到的东西存到数据库中。</p>    <p>安装mongoose模块:</p>    <pre>  <code class="language-typeScript">npm i mongoose --save  npm i -s @types/mongoose --save</code></pre>    <p>然后建立Scheme。先创建models文件夹:</p>    <pre>  <code class="language-typeScript">mkdir models</code></pre>    <p>在models文件夹下创建index.ts:</p>    <pre>  <code class="language-typeScript">import * as mongoose from 'mongoose';    mongoose.connect('mongodb://127.0.0.1/cnodejs_data', {      server: { poolSize: 20 }  }, function (err) {      if (err) {          process.exit(1);      }  });    // models  export const Article = require('./article');</code></pre>    <p>在models文件夹下创建IArticle.ts:</p>    <pre>  <code class="language-typeScript">interface IArticle {      title: String;      url: String;      text: String;  }  export = IArticle;</code></pre>    <p>在models文件夹下创建Article.ts:</p>    <pre>  <code class="language-typeScript">import mongoose = require('mongoose');  import IArticle = require('./IArticle');  interface IArticleModel extends IArticle, mongoose.Document { }    const ArticleSchema = new mongoose.Schema({      title: { type: String },      url: { type: String },      text: { type: String },  });    const Article = mongoose.model<IArticleModel>("Article", ArticleSchema);  export = Article;</code></pre>    <p>修改api.ts为:</p>    <pre>  <code class="language-typeScript">import superagent = require('superagent');  import cheerio = require('cheerio');  import models = require('./models');  const Article = models.Article;    export const get_index_urls = async function () {      const res = await remote_get('http://cnodejs.org/');        const $ = cheerio.load(res.text);      let urls: string[] = [];      $('.topic_title_wrapper').each((index, element) => {          urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));      });      return urls;    }  export const fetch_content = async function (url: string) {      const res = await remote_get(url);        const $ = cheerio.load(res.text);      let article = new Article();      article.text = $('.topic_content').first().text();      article.title = $('.topic_full_title').first().text().replace('置顶', '').replace('精华', '').trim();      article.url = url;      console.log('获取成功:' + article.title);      article.save();    }  export const remote_get = function (url: string) {        return new Promise<superagent.Response>((resolve, reject) => {          superagent.get(url)              .end(function (err, res) {                  if (!err) {                      resolve(res);                  } else {                      reject(err);                  }              });      });  }</code></pre>    <p>修改app.ts为:</p>    <pre>  <code class="language-typeScript">import api = require('./api');  import helper = require('./helper');  import cheerio = require('cheerio');    (async () => {        try {          let urls = await api.get_index_urls();          for (let i = 0; i < urls.length; i++) {              await helper.wait_seconds(1);              await api.fetch_content(urls[i]);          }      } catch (err) {          console.log(err);      }        console.log('完毕!');    })();</code></pre>    <p>执行</p>    <pre>  <code class="language-typeScript">tsc  node out/app</code></pre>    <p>观察输出,并去数据库检查一下</p>    <p>可以发现入库成功了!</p>    <p>补充:remote_get方法的改进版,实现错误重试和加入代理服务器.</p>    <p>放弃了superagent库,用的request库,仅供参考:</p>    <pre>  <code class="language-typeScript">//config.retries = 3;  let current_retry = config.retries || 0;  export const remote_get = async function (url: string, proxy?: string) {      //每次请求都先稍等一下      await wait_seconds(2);      if (!proxy) {          proxy = '';      }      const promise = new Promise<string>(function (resolve, reject) {          console.log('get: ' + url + ',  using proxy: ' + proxy);          let options: request.CoreOptions = {              headers: {                  'Cookie': '',                  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',                  'Referer': 'https://www.baidu.com/'              },              encoding: 'utf-8',              method: 'GET',              proxy: proxy,              timeout: 3000,          }          request(url, options, async function (err, response, body) {              console.log('got:' + url);              if (!err) {                  body = body.toString();                  current_retry = config.retries || 0;                  console.log('bytes:' + body.length);                  resolve(body);              } else {                  console.log(err);                  if (current_retry <= 0) {                      current_retry = config.retries || 0;                      reject(err);                  } else {                      console.log('retry...(' + current_retry + ')')                      current_retry--;                      try {                          let body = await remote_get(url, proxy);                          resolve(body);                      } catch (e) {                          reject(e);                      }                  }              }          });      });      return promise;  }</code></pre>    <p>另外,IArticle.ts和Article.ts合并为一个文件,可能更好,可以参考我另一个model的写法:</p>    <pre>  <code class="language-typeScript">import mongoose = require('mongoose');    interface IProxyModel {      uri: string;      ip: string;      port:string;      info:string;  }  export interface IProxy extends IProxyModel, mongoose.Document { }    const ProxySchema = new mongoose.Schema({      uri: { type: String },//      ip: { type: String },//      port: { type: String },//      info: { type: String },//  });  export const Proxy = mongoose.model<IProxy>("Proxy", ProxySchema);</code></pre>    <p>导入的时候这么写就行了:</p>    <pre>  <code class="language-typeScript">import { IProxy, Proxy } from './models';</code></pre>    <p>其中Proxy可以用来做new、find、where之类的操作:</p>    <pre>  <code class="language-typeScript">let x = new Proxy();  let xx = await Proxy.find({});  let xxx = await Proxy.where('aaa',123).exec();</code></pre>    <p>而IProxy用于实体对象的传递,例如</p>    <pre>  <code class="language-typeScript">function xxx(p:IProxy){  }</code></pre>    <p> </p>    <p>来自:https://segmentfault.com/a/1190000007326795</p>    <p> </p>