Python开源的正文抽取模块:libextract
基于统计特征抽取HT/XML文档正文。
Libextract提供两个现成的抽取器:api.articles 和 api.tabular。
libextract.api.articles(document, encoding='utf-8', count=5)
Given an html document, and optionally the encoding and the number of predictions (count) to return (in descending rank),articlesreturns a list of HTML-nodes likely containing the articles of text of a given website.
The extraction algorithm is based of text length. Refer to rodricios.github.io/eatiht for an in-depth explanation.
libextract.api.tabular(document, encoding='utf-8', count=5)
Given an html document, and optionally the encoding, and the number of predictions (count) to return (in descending rank) tabular returns a list of HTML nodes likely containing "tabular" data (ie. table, and table-like elements).
Installation
pip install libextract
Usage
Extracting text-nodes from a wikipedia page:
from requests import get from libextract.api import articles r = get('http://en.wikipedia.org/wiki/Information_extraction') textnodes = articles(r.content)
Libextract uses Python's de facto HT/XML processing library, lxml.
The predictions returned by bothapi.articlesandapi.tabularare lxml HtmlElement objects (along with the associated metric used to rank each prediction).
Therefore, you can access lxml's methods for post-processing.
>> print(textnodes[0][0].text_content()) Information extraction (IE) is the task of automatically extracting structured information...
Tabular-data extraction is just as easy.
from libextract.api import tabular height_data = get("http://en.wikipedia.org/wiki/Human_height") tabs = tabular(height_data.content)
To convert HT/XML element to pythondict(and, you know, use it with Pandas and stuff):
>>> from libextract import clean >>> clean.to_dict(tabs[0][0]) {'Entity': ['Monaco', 'Macau', 'Japan', 'Singapore', 'San Marino', ...}