Python开源爬虫框架:Grab
jopen
9年前
Grab是一个Python开源Web爬虫框架。Grab提供非常多实用的方法来爬取网站和处理爬到的内容:
- Automatic cookies (session) support
- HTTP and SOCKS proxy with and without authorization
- Keep-Alive support
- IDN support
- Tools to work with web forms
- Easy multipart file uploading
- Flexible customization of HTTP requests
- Automatic charset detection
- Powerful API of extracting info from HTML documents with XPATH queries
- Asynchronous API to make thousands of simultaneous queries. This part of library called Spider and it is too big to even list its features in this README.
- Python 3 ready
Grab Example
from grab import Grab import logging logging.basicConfig(level=logging.DEBUG) g = Grab() g.go('https://github.com/login') g.set_input('login', '***') g.set_input('password', '***') g.submit() g.doc.save('/tmp/x.html') g.doc('//span[contains(@class, "octicon-sign-out")]').assert_exists() home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text() repo_url = home_url + '?tab=repositories' g.go(repo_url) for elem in g.doc.select('//h3[@class="repo-list-name"]/a'): print('%s: %s' % (elem.text(), g.make_url_absolute(elem.attr('href'))))