自动文本摘要生成
JerHma
9年前
来自: https://github.com/miso-belica/sumy
Automatic text summarizer
自动文本摘要生成。简单的库和命令行工具用于从HTML页面或纯文本抽取摘要。该软件包还包含了文本摘要简单的评价框架。实现的摘要方法如下:
- Luhn - heurestic method, reference
- Edmundson heurestic method with previous statistic research, reference
- Latent Semantic Analysis, LSA - one of the algorithm from http://scholar.google.com/citations?user=0fTuW_YAAAAJ&hl=en I think the author is using more advanced algorithms now. Steinberger, J. a Ježek, K. Using latent semantic an and summary evaluation. In In Proceedings ISIM '04. 2004. S. 93-100.
- LexRank - Unsupervised approach inspired by algorithms PageRank and HITS, reference
- TextRank - some sort of combination of a few resources that I found on the internet. I really don't remember the sources. Probably Wikipedia and some papers in 1st page of Google :)
- SumBasic - Method that is often used as a baseline in the literature. Source: Read about SumBasic
- KL-Sum - Method that greedily adds sentences to a summary so long as it decreases the KL Divergence. Source: Read about KL-Sum
Here are some other summarizers:
- https://github.com/thavelick/summarize/ - Python, TF (very simple)
- Reduction - Python, TextRank (simple)
- Open Text Summarizer - C, TF without normalization
- Simple program that summarize text - Python, TF without normalization
- Intro to Computational Linguistics - Java, LexRank
- Sumtract: Second project for UW LING 572 - Python
- TextTeaser - Scala
- PyTeaser - TextTeaser port in Python
- Automatic Document Summarizer - Java, Bipartite HITS (no sources)
- Pythia - Python, LexRank & Centroid
- SWING - Ruby
- Topic Networks - R, topic models & bipartite graphs
- Almus: Automatic Text Summarizer - Java, LSA (without source code)
- Musutelsa - Java, LSA (always freezes)
- http://mff.bajecni.cz/index.php - C++
- MEAD - Perl, various methods + evaluation framework
Installation
Make sure you have Python 2.7/3.3+ and pip ( Windows , Linux ) installed. Run simply (preferred way):
$ [sudo] pip install sumy
Or for the fresh version:
$ [sudo] pip install git+git://github.com/miso-belica/sumy.git
Usage
Sumy contains command line utility for quick summarization of documents.
$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization # what's summarization? $ sumy luhn --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ $ sumy edmundson --language=czech --length=3% --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan $ sumy --help # for more info
Various evaluation methods for some summarization method can be executed by commands below:
$ sumy_eval lex-rank reference_summary.txt --url=http://en.wikipedia.org/wiki/Automatic_summarization $ sumy_eval lsa reference_summary.txt --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ $ sumy_eval edmundson reference_summary.txt --language=czech --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan $ sumy_eval --help # for more info
Python API
Or you can use sumy like a library in your project.
# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from sumy.parsers.html import HtmlParser from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer as Summarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words LANGUAGE = "czech" SENTENCES_COUNT = 10 if __name__ == "__main__": url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html" parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE)) # or for plain text files # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE)) stemmer = Stemmer(LANGUAGE) summarizer = Summarizer(stemmer) summarizer.stop_words = get_stop_words(LANGUAGE) for sentence in summarizer(parser.document, SENTENCES_COUNT): print(sentence)
Tests
Setup:
$ pip install pytest pytest-cov
Run tests via
$ py.test-2.7 && py.test-3.3 && py.test-3.4 && py.test-3.5