Java 爬虫框架，WebMagic 0.4.1 发布

jopen 12年前

此次更新加强了Ajax抓取的功能，并进行了一些功能改进。同时引入了重要的脚本化功能"webmagic-script"，为今后的WebMagic-Avalon计划做准备。

功能增强：

修复了抓取完页面后，Spider偶尔无法退出的问题。详细问题的分析，有兴趣的可以点这里查看。
将抽取正文的SmartContentSelector中的算法改为哈工大的正文抽取算法https://code.google.com/p/cx-extractor/ ，经过测试，有较好的效果。使用方法：Html.getSmartContent()。
为Page加入了更多的Http信息，包括http状态码"Page.getStatusCode()"和未解析过的正文"Page.getRawText()"。
为Spider增加一些监控信息，包括抓取的页面数"Spider.getPageCount()"，运行状态"Spider.getStatus()"和执行线程数"Spider.getThreadAlive()"。

Ajax方面，在注解模式，引入了JsonPath表达式来进行抽取，示例代码：

public class AppStore {        @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..trackName")      private String trackName;        @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..description")      private String description;        @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..userRatingCount")      private int userRatingCount;        @ExtractBy(type = ExtractBy.Type.JsonPath, value = "$..screenshotUrls",multi = true)      private List<String> screenshotUrls;        public static void main(String[] args) {          AppStore appStore = OOSpider.create(Site.me(), AppStore.class).<AppStore>get("http://itunes.apple.com/lookup?id=653350791&country=cn&entity=software");          System.out.println(appStore.trackName);          System.out.println(appStore.description);          System.out.println(appStore.userRatingCount);          System.out.println(appStore.screenshotUrls);      }  }
  

JsonPath表达式的含义及具体用法看这里：http://www.oschina.net/p/jsonpath

WebMagic今后的目标是一个完整的产品，让即使不会编码的人也能通过简单脚本，完成基本的爬虫开发，并促进脚本分享。这就是WebMagic-Avalon计划。大家可以查看https://github.com/code4craft/webmagic/issues/43 进行功能讨论，欢迎各种建议。

目前第一期是要做到脚本化，对应文档：

https://github.com/code4craft/webmagic/tree/master/webmagic-scripts

webmagic邮件组：https://groups.google.com/forum/#!forum/webmagic-java

Java 爬虫框架，WebMagic 0.4.1 发布

相关资讯