Python Web 爬虫汇总
jopen
9年前
Network
- General
- urllib - network library (stdlib)
- requests - network library
- grab - network library (pycurl based)
- pycurl - network library (binding to libcurl)
- urllib3 - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
- httplib2 - network library
- RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
- MechanicalSoup - A Python library for automating interaction with websites.
- mechanize - Stateful programmatic web browsing.
- socket low-level networking interface (stdlib)
- Unirest for Python - Unirest is a set of lightweight HTTP libraries available in multiple languages
- hyper - HTTP/2 Client for Python
- PySocks - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.
- Asynchronous
- treq - requests like API (twisted based)
- aiohttp - http client/server for asyncio (PEP-3156) </ul> </li> </ul>
- Full Featured Crawlers </li>
- Other
- portia - Visual scraping for Scrapy.
- restkit - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
- demiurge - PyQuery-based scraping micro-framework. </ul> </li> </ul>
- General
- lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
- cssselect - working with DOM tree with CSS selectors
- pyquery - working with DOM tree with jQuery-like selectors
- BeautifulSoup - slow HTML/XMl processing library, written in pure python
- html5lib - builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
- feedparser - parsing of RSS/ATOM feeds.
- MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
- xmltodict - Working with XML feel like you are working with JSON.
- xhtml2pdf - HTML/CSS to PDF converter.
- untangle - Converts XML documents to Python objects for easy access.
- Sanitizing
- Bleach - cleaning of HTML (requires html5lib)
- sanitize - Bringing sanity to world of messed-up data. </ul> </li> </ul>
-
General
- difflib - (Python standard library) Helpers for computing deltas.
- Levenshtein - Fast computation of Levenshtein distance and string similarity.
- fuzzywuzzy - Fuzzy String Matching.
- esmre - Regular expression accelerator.
- ftfy - Makes Unicode text less broken and more consistent automagically.
-
Transliteration
- unidecode - ASCII transliterations of Unicode text. </ul> </li>
-
Character encoding
- uniout - Print readable chars instead of the escaped string.
- chardet - Python 2/3 compatible character encoding detector.
- xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
- pangu.py - Spacing texts for CJK and alphanumerics. </ul> </li>
-
Slugify
- awesome-slugify - A Python slugify library that can preserve unicode.
- python-slugify - A Python slugify library that translates unicode to ASCII.
- unicode-slugify - A slugifier that generates unicode slugs.
- pytils - Simple tools for processing strings in russian (including pytils.translit.slugify) </ul> </li>
-
General Parser
- PLY - Implementation of lex and yacc parsing tools for Python
- pyparsing - A general purpose framework for generating parsers. </ul> </li>
-
Human names
- python-nameparser - Parsing human names into their individual components. </ul> </li>
-
Phone Number
- phonenumbers - Parsing, formatting, storing and validating international phone numbers. </ul> </li>
-
User-agent string
- python-user-agents - Browser user agent parser.
- HTTP Agent Parser - Python HTTP Agent Parser </ul> </li> </ul>
-
General
- tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
- textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
- messytables - Tools for parsing messy tabular data
- rows - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)
-
Office
- python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
- xlwt / xlrd - Writing and reading data and formatting information from Excel files.
- XlsxWriter - A Python module for creating Excel .xlsx files.
- xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
- openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
- Marmir - Takes Python data structures and turns them into spreadsheets. </ul> </li>
-
PDF
- PDFMiner - A tool for extracting information from PDF documents.
- PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
- ReportLab - Allowing Rapid creation of rich PDF documents.
- pdftables - Extract tables from PDF files directly </ul> </li>
-
Markdown
- Python-Markdown - A Python implementation of John Gruber’s Markdown.
- Mistune - Fastest and full featured pure Python parsers of Markdown.
- markdown2 - A fast and complete Python implementation of Markdown </ul> </li>
-
YAML
- PyYAML - YAML implementations for Python. </ul> </li>
-
CSS
- cssutils - A CSS library for Python. </ul> </li>
-
ATOM/RSS
- feedparser - Universal feed parser. </ul> </li>
-
SQL
- sqlparse - A non-validating SQL parser. </ul> </li>
-
HTTP
- http-parser - HTTP request/response parser for python in C </ul> </li>
-
Microformats
- opengraph - A Python module to parse the Open Graph Protocol tags </ul> </li>
-
Portable Executable
- pefile - A multi-platform module to parse and work with Portable Executable (aka PE) files. </ul> </li>
-
PSD
- psd-tools - reading Adobe Photoshop PSD files (as described in specification) to Python data structures. </ul> </li> </ul>
- NLTK - A leading platform for building Python programs to work with human language data.
- Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
- TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
- jieba - Chinese Words Segmentation Utilities.
- SnowNLP - A library for processing Chinese text.
- loso - Another Chinese segmentation library.
- genius - A Chinese segment base on Conditional Random Field.
- langid.py - Stand-alone language identification system.
- Korean - A library for Korean morphology.
- pymorphy2 - Morphological analyzer (POS tagger + inflection engine) for Russian language.
- PyPLN - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
- Browsers </li>
- Headless tools
- xvfbwrapper - Python wrapper for running a display inside X virtual framebuffer (Xvfb) </ul> </li> </ul>
- threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
- multiprocessing - standard python library to run processes.
- celery - An asynchronous task queue/job queue based on distributed message passing.
- concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.
- asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
- Twisted - An event-driven networking engine.
- Tornado - A Web framework and asynchronous networking library.
- pulsar - Event-driven concurrent framework for Python.
- diesel - Greenlet-based event I/O Framework for Python.
- gevent - A coroutine-based Python networking library that uses greenlet.
- eventlet - Asynchronous framework with WSGI support.
- Tomorrow - Magic decorator syntax for asynchronous code.
- celery - An asynchronous task queue/job queue based on distributed message passing.
- huey - Little multi-threaded task queue.
- mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
- RQ - lightweight task queue manager based on redis
- simpleq - A simple, infinitely scalable, Amazon SQS based queue.
- python-gearman - python API for Gearman
- picloud - executing python-code in cloud
- dominoup.com - executing R, Python и matlab code in cloud
- flanker - A email address and Mime parsing library.
- Talon - Mailgun library to extract message quotations and signatures.
- URL
- furl - A small Python library that makes manipulating URLs simple.
- purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
- urllib.parse - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
- tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
- Network Address
- netaddr - A Python library for representing and manipulating network addresses. </ul> </li> </ul>
- Text and Meta Data from HTML pages
- newspaper - News extraction, article extraction and content curation in Python.
- html2text - Convert HTML to Markdown-formatted text.
- python-goose - HTML Content/Article Extractor.
- lassie - Web Content Retrieval for Humans.
- micawber - A small library for extracting rich content from URLs.
- sumy - A module for automatic summarization of text documents and HTML pages.
- Haul - An Extensible Image Crawler.
- python-readability - Fast Python port of arc90's readability tool.
- scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
- Video
- 油Tube-dl - A small command-line program to download videos from 油Tube.
- you-get - A 油Tube/Youku/Niconico video downloader written in Python 3. </ul> </li>
- Wiki
- WikiTeam - Tools for downloading and preserving wikis. </ul> </li> </ul>
- Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
- AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
- WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.
- dnsyo - Check your DNS against over 1500 global DNS servers.
- pycares - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously
- OpenCV - Open Source Computer Vision Library.
- SimpleCV - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
- mahotas - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.
- shadowsocks - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
- tproxy - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python
- user_agent - this module is for generating random, valid web navigator's configs & User-Agent HTTP headers.
WebSocket
Libraries for working with WebSocket.
DNS Resolving
Computer Vision
Proxy Server
Misc
Other python lists
来自:https://github.com/lorien/awesome-web-scraping/blob/master/python.md
Web Content Extracting
Libraries for extracting web contents.
Multiprocessing
Asynchronous
Libraries for asynchronous networking programming.
Queue
Cloud Computing
Email
Libraries for parsing email.
URL and Network Address Manipulation
Libraries for parsing/modifying URLs and network addresses.
Natural Language Processing
Libraries for working with human languages.
Browser automation and emulation
Specific Formats Processing
Libraries for parsing and manipulating specific text formats.
Text Processing
Libraries for parsing and manipulating plain texts.
HTML/XML Parsing
Web-Scraping Frameworks