全文和文章元数据抽取开源Python库:newspaper
jopen
9年前
newspaper: 一个新闻、全文和文章元数据抽取开源Python库。支持包括中文在内的多种自然语言,支持关键字、图像和摘要等多种元数据类型抽取,支持多线程下载。
- Full Python3 and Python2 support
- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Google trending terms extraction
- Works in 10+ languages (English, Chinese, German, Arabic, ...)
>>> from newspaper import Article >>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/' >>> article = Article(url) >>> article.download() >>> article.html '<!DOCTYPE HTML><html itemscope itemtype="http://...' >>> article.parse() >>> article.authors ['Leigh Ann Caldwell', 'John Honway'] >>> article.publish_date datetime.datetime(2013, 12, 30, 0, 0) >>> article.text 'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...' >>> article.top_image 'http://someCDN.com/blah/blah/blah/file.png' >>> article.movies ['http://油Tube.com/path/to/link.com', ...] >>> article.nlp() >>> article.keywords ['New Years', 'resolution', ...] >>> article.summary 'The study shows that 93% of people ...' >>> import newspaper >>> cnn_paper = newspaper.build('http://cnn.com') >>> for article in cnn_paper.articles: >>> print(article.url) http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/ http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html ... >>> for category in cnn_paper.category_urls(): >>> print(category) http://lifestyle.cnn.com http://cnn.com/world http://tech.cnn.com ... >>> cnn_article = cnn_paper.articles[0] >>> cnn_article.download() >>> cnn_article.parse() >>> cnn_article.nlp() ... >>> from newspaper import fulltext >>> html = requests.get(...).text >>> text = fulltext(html) Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language. >>> from newspaper import Article >>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml' >>> a = Article(url, language='zh') # Chinese >>> a.download() >>> a.parse() >>> print(a.text[:150]) 香港行政长官梁振英在各方压力下就其大宅的违章建 筑(僭建)问题到立法会接受质询,并向香港民众道歉。 梁振英在星期二(12月10日)的答问大会开始之际 在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的 意图和动机。 一些亲北京阵营议员欢迎梁振英道歉, 且认为应能获得香港民众接受,但这些议员也质问梁振英有 >>> print(a.title) 港特首梁振英就住宅违建事件道歉 If you are certain that an entire news source is in one language, go ahead and use the same api :) >>> import newspaper >>> sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh') >>> for category in sina_paper.category_urls(): >>> print(category) http://health.sina.com.cn http://eladies.sina.com.cn http://english.sina.com ... >>> article = sina_paper.articles[0] >>> article.download() >>> article.parse() >>> print(article.text) 新浪武汉汽车综合 随着汽车市场的日趋成熟, 传统的“集全家之力抱得爱车归”的全额购车模式已然过时, 另一种轻松的新兴 车模式――金融购车正逐步成为时下消费者购 买爱车最为时尚的消费理念,他们认为,这种新颖的购车 模式既能在短期内 ... >>> print(article.title) 两年双免0手续0利率 科鲁兹掀背金融轻松购_武汉车市_武汉汽 车网_新浪汽车_新浪网