基于Java的开源HTML解析器:jsoup 1.7.3 发布

jopen 11年前

Jsoup是一个Java的HTML解析器,提供了非常方便的抽取和操作HTML文档方法,可以结合DOM,CSS和Jquery类似的方法来定位和得到节点的信息。
有着和Jquery一样强大的select和pipeline的API。

jsoup 1.7.3 版本发布了,这个版本引入了改进的表单处理,更强大的字符集检测,在解析和CSS选择器方面速度和内存得到了优化,以及一些错误修正。

详细改进内容如下:

Improvements:
- Added the element type FormElement, to facilitate simple form submissions. Find forms in a doc using Elements.forms(), then prepare it for submission with FormElement.submit().
- Improved the reliability of HTTP character-set recognition from response headers, particularly for when servers return out-of-spec responses.
- Added Document.location() to retrieve the document's location URL. Handy if the request was redirected from the original URL.
- Large decrease in the amount of temporary objects created during parsing, leading to less GC load (helpful particularly on Android), and faster parsing.
- Improved the time to match elements with common CSS selectors by ~ 27%.
Bug Fixes:
- Fixed support for self-closing script tags.
- Fixed a crash when reading an unterminated CDATA section.
- Fixed an issue where elements added via the adoption agency algorithm did not preserve their attributes.
- Fixed an issue when cloning a document with extremely nested elements that could cause a stack-overflow.
- Fixed an issue when connecting or redirecting to a URL that contains a space.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();  Elements newsHeadlines = doc.select("#mp-itn b a");
</div> 本站翻译的中文版cookbook:http://www.open-open.com/Jsoup/