Python开源的正文抽取模块:libextract

xpkdi 9年前

基于统计特征抽取HT/XML文档正文。

Libextract提供两个现成的抽取器:api.articles 和 api.tabular。

libextract.api.articles(document, encoding='utf-8', count=5)

Given an html document, and optionally the encoding and the number of predictions (count) to return (in descending rank),articlesreturns a list of HTML-nodes likely containing the articles of text of a given website.

The extraction algorithm is based of text length. Refer to rodricios.github.io/eatiht for an in-depth explanation.

libextract.api.tabular(document, encoding='utf-8', count=5)

Given an html document, and optionally the encoding, and the number of predictions (count) to return (in descending rank) tabular returns a list of HTML nodes likely containing "tabular" data (ie. table, and table-like elements).

Installation

pip install libextract

Usage

Extracting text-nodes from a wikipedia page:

from requests import get from libextract.api import articles    r = get('http://en.wikipedia.org/wiki/Information_extraction')  textnodes = articles(r.content)

Libextract uses Python's de facto HT/XML processing library, lxml.

The predictions returned by bothapi.articlesandapi.tabularare lxml HtmlElement objects (along with the associated metric used to rank each prediction).

Therefore, you can access lxml's methods for post-processing.

>> print(textnodes[0][0].text_content())  Information extraction (IE) is the task of automatically extracting structured information...

Tabular-data extraction is just as easy.

from libextract.api import tabular    height_data = get("http://en.wikipedia.org/wiki/Human_height")  tabs = tabular(height_data.content)

To convert HT/XML element to pythondict(and, you know, use it with Pandas and stuff):

>>> from libextract import clean >>> clean.to_dict(tabs[0][0])  {'Entity': ['Monaco', 'Macau', 'Japan', 'Singapore', 'San Marino',    ...}

项目主页:http://www.open-open.com/lib/view/home/1431939836677