基于简单脚本的下一代开源爬虫框架 - Creeper
fjlvjie
8年前
<p style="text-align: center;"><img src="https://simg.open-open.com/show/b81cc15d8a320ed618ed5f2aae21e7b6.png"></p> <h2>About</h2> <p>Creeper is a <em>next-generation</em> crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.</p> <p>Warning:At present this project is still under stage-1 development, please do not use in the production environment.</p> <h2>Get Started</h2> <p>Installation</p> <pre> $ go get github.com/wspl/creeper</pre> <p>Hello World!</p> <p>Create hacker_news.crs</p> <pre> page(@page=1) = "https://news.ycombinator.com/news?p={@page}" news[]: page -> $("tr.athing") title: $(".title a.storylink").text site: $(".title span.sitestr").text link: $(".title a.storylink").href</pre> <p>Then, create main.go</p> <pre> package main import "github.com/wspl/creeper" func main() { c := creeper.Open("./hacker_news.crs") c.Array("news").Each(func(c *creeper.Creeper) { println("title: ", c.String("title")) println("site: ", c.String("site")) println("link: ", c.String("link")) println("===") }) }</pre> <p>Build and run. Console will print something like:</p> <pre> title: Samsung chief Lee arrested as S.Korean corruption probe deepens site: reuters.com link: http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD === title: ReactOS 0.4.4 Released site: reactos.org link: https://reactos.org/project-news/reactos-044-released === title: FeFETs: How this new memory stacks up against existing non-volatile memory site: semiengineering.com link: http://semiengineering.com/what-are-fefets/</pre> <h2>Script Spec</h2> <h3>Town</h3> <p>Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.</p> <pre> page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"</pre> <p>When you need town, use it as if you were calling a function:</p> <pre> news[]: page(ext="Hello World!") -> $("tr.athing")</pre> <p>Hey, you might have noticed that the @page parameter is not used. Yeah, it is a special parameter.</p> <p>Expression in town definition line like name="something" , represents parameter name has a default value "something" .</p> <p>Incidentally, @page is a parameter that will automatically increasing when current page has no more content.</p> <h3>Node</h3> <p>Nodes are tree structure that represent the data structure you are going to crawl.</p> <pre> news[]: page -> $("tr.athing") title: $(".title a.storylink").text site: $(".title span.sitestr").text link: $(".title a.storylink").href</pre> <p>Like yaml , nodes distinguishes the hierarchy by indentation.</p> <p>Node Name</p> <p>Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.</p> <p>Page</p> <p>Page indicates where to fetching the field data. It can be a town expression or field reference.</p> <p>Field reference is a advanced usage of Node, you can found the details in <a href="/misc/goto?guid=4959737794635644266" rel="nofollow,noindex">./eh.crs</a> .</p> <p>If a node owned page and fun at the same time, page should on the left of -> , fun should on the right of -> . Which is page -> fun</p> <p>Fun</p> <p>Fun represents the data processing process.</p> <p>There are all supported funs:</p> <table> <thead> <tr> <th>Name</th> <th>Parameters</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>$</td> <td>(selector: string)</td> <td>CSS selector</td> </tr> <tr> <td>html</td> <td> </td> <td>inner HTML</td> </tr> <tr> <td>text</td> <td> </td> <td>inner text</td> </tr> <tr> <td>outerHTML</td> <td> </td> <td>outer HTML</td> </tr> <tr> <td>attr</td> <td>(attr: string)</td> <td>attribute value</td> </tr> <tr> <td>style</td> <td> </td> <td>style attribute value</td> </tr> <tr> <td>href</td> <td> </td> <td>href attribute value</td> </tr> <tr> <td>src</td> <td> </td> <td>src attribute value</td> </tr> <tr> <td>calc</td> <td>(prec: int)</td> <td>calculate arithmetic expression</td> </tr> <tr> <td>match</td> <td>(regexp: string)</td> <td>match first sub-string via regular expression</td> </tr> <tr> <td>expand</td> <td>(regexp: string, target: string)</td> <td>expand matched strings to target string</td> </tr> </tbody> </table> <h2>Author</h2> <p>Plutonist</p> <p><a href="/misc/goto?guid=4959737794729556554" rel="nofollow,noindex">impl.moe</a> · Github <a href="/misc/goto?guid=4959737794813216466" rel="nofollow,noindex">@wspl</a></p> <p> </p>