提取文章正文的Python库：Python-goose

jopen 11年前

Python-goose项目是用Python重写的Goose，Goose原来是用Java写的文章提取工具。Python-goose的目标是给定任意资讯文章或者任意文章类的网页，不仅提取出文章的主体，同时提取出所有元信息以及图片等信息，支持中文网页。
Python-goose可提取的信息包括：

文章主体内容
文章主要图片
文章中嵌入的任何油Tube/Vimeo视频
元描述
元标签

Python-goose许可为Apache 2.0。

示例

>>> from goose import Goose  >>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'  >>> g = Goose()  >>> article = g.extract(url=url)  >>> article.title  u'Occupy London loses eviction fight'  >>> article.meta_description  "Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."  >>> article.cleaned_text[:150]  (CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi  >>> article.top_image.src     http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

项目主页：http://www.open-open.com/lib/view/home/1404377634108

提取文章正文的Python库：Python-goose

相关经验

目录