开源一个爬虫代理框架:IPProxyTool
epimetheus
8年前
<p style="text-align:start">使用 scrapy 爬虫抓取代理网站,获取大量的免费代理 ip。过滤出所有可用的 ip,存入数据库以备使用。</p> <h2>运行环境</h2> <p style="text-align:start">python 2.7.12</p> <h3 style="text-align: start;">运行依赖包</h3> <ul> <li>scrapy</li> <li>BeautifulSoup</li> <li>requests</li> <li>mysql-connector-python</li> <li>web.py</li> <li>scrapydo</li> <li>lxml</li> </ul> <h3 style="text-align: start;">Mysql 配置</h3> <ul> <li>安装 Mysql 并启动</li> <li>安装 mysql-connector-python <a href="/misc/goto?guid=4959737167257814816" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">安装参考</a></li> <li>在 config.py 更改数据库配置</li> </ul> <pre style="text-align:start"> <code> database_config = { 'host': 'localhost', 'port': 3306, 'user': 'root', 'password': '123456', } </code></pre> <h2 style="text-align:start">下载使用</h2> <p style="text-align:start">将项目克隆到本地</p> <pre style="text-align:start"> <code>$ git clone https://github.com/awolfly9/IPProxyTool.git </code></pre> <p style="text-align:start">进入工程目录</p> <pre style="text-align:start"> <code>$ cd IPProxyTool </code></pre> <p style="text-align:start">分别运行代理抓取、验证、服务器 脚本</p> <pre style="text-align:start"> <code>$ python runspider.py </code></pre> <pre style="text-align:start"> <code>$ python runvalidator.py </code></pre> <pre style="text-align:start"> <code>$ python runserver.py </code></pre> <h2 style="text-align:start">项目说明</h2> <p>抓取代理网站</p> <p style="text-align:start">所有抓取代理网站的代码都在 <a href="/misc/goto?guid=4959737167341230598" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">proxy</a></p> <p>扩展抓取其他的代理网站</p> <p style="text-align:start">1.在 proxy 目录下新建脚本并继承自 BaseSpider <br> 2.设置 name、urls、headers<br> 3.重写 parse_page 方法,提取代理数据<br> 4.将数据存入数据库 具体可以参考 <a href="/misc/goto?guid=4959737167416302141" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">ip181</a> <a href="/misc/goto?guid=4959737167506104245" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">kuaidaili</a><br> 5.如果需要抓取特别复杂的代理网站,可以参考<a href="/misc/goto?guid=4959737167586577773" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">peuland</a></p> <p>修改 runspider.py 导入抓取库,添加到抓取队列</p> <p style="text-align:start">运行 runspider.py 脚本开始抓取代理网站</p> <pre style="text-align:start"> <code>$ python runspider.py </code></pre> <p>验证代理 ip 是否有效</p> <p style="text-align:start">目前验证方式:利用将抓取到的代理 ip 设置成 scrapy 请求的代理,然后去请求目标网站,如果目标网站在合适的时间内成功返回,那么这个则认为这个代理 ip 有效。如果没有在合适的时间返回成功的数据,则认为这个代理 ip 无效。<br> 一个目标网站对应一个脚本,所有验证代理 ip 的代码都在 <a href="/misc/goto?guid=4959737167666763031" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">validator</a></p> <p>扩展验证其他网站</p> <p style="text-align:start">1.在 validator 目录下新建脚本并继承 Validator <br> 2.设置 name、timeout、urls、headers <br> 3.然后调用 init 方法 <br> 4.如果需要特别复杂的验证方式,可以参考 <a href="/misc/goto?guid=4959737167756078255" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">assetstore</a></p> <p>修改runvalidator.py 导入验证库,添加到验证队列</p> <p style="text-align:start">运行 runvalidator.py 脚本开始抓取代理网站</p> <pre style="text-align:start"> <code>$ python runvalidator.py </code></pre> <h3 style="text-align:start">获取代理 ip 数据服务器</h3> <p style="text-align:start">在 config.py 中修改启动服务器端口配置 data_port,默认为 8000 启动服务器</p> <pre style="text-align:start"> <code>$ python runserver.py </code></pre> <p style="text-align:start">服务器提供接口</p> <p>获取</p> <p style="text-align:start"><a href="/misc/goto?guid=4959737167833208235" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">http://127.0.0.1:8000/select?name=douban</a></p> <p style="text-align:start">参数</p> <table style="-webkit-text-stroke-width:0px; border-collapse:collapse; border-spacing:0px; box-sizing:border-box; color:rgb(51, 51, 51); display:block; font-family:-apple-system,blinkmacsystemfont,segoe ui,helvetica,arial,sans-serif,apple color emoji,segoe ui emoji,segoe ui symbol; font-size:16px; font-style:normal; font-variant-caps:normal; font-variant-ligatures:normal; font-weight:normal; letter-spacing:normal; margin-bottom:16px; margin-top:0px; orphans:2; overflow:auto; text-align:start; text-indent:0px; text-transform:none; white-space:normal; widows:2; width:888px; word-spacing:0px"> <thead> <tr> <th>Name</th> <th>Type</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>name</td> <td>str</td> <td>数据库名称</td> </tr> </tbody> </table> <p>删除</p> <p style="text-align:start"><a href="http://127.0.0.1:8000/delete?name=free_ipproxy&ip=27.197.144.181" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">http://127.0.0.1:8000/delete?name=free_ipproxy&ip=27.197.144.181</a></p> <p style="text-align:start">参数</p> <table style="-webkit-text-stroke-width:0px; border-collapse:collapse; border-spacing:0px; box-sizing:border-box; color:rgb(51, 51, 51); display:block; font-family:-apple-system,blinkmacsystemfont,segoe ui,helvetica,arial,sans-serif,apple color emoji,segoe ui emoji,segoe ui symbol; font-size:16px; font-style:normal; font-variant-caps:normal; font-variant-ligatures:normal; font-weight:normal; letter-spacing:normal; margin-bottom:16px; margin-top:0px; orphans:2; overflow:auto; text-align:start; text-indent:0px; text-transform:none; white-space:normal; widows:2; width:888px; word-spacing:0px"> <thead> <tr> <th>Name</th> <th>Type</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>name</td> <td>str</td> <td>数据库名称</td> </tr> <tr> <td>ip</td> <td>str</td> <td>需要删除的 ip</td> </tr> </tbody> </table> <p>插入</p> <p style="text-align:start"><a href="http://127.0.0.1:8000/insert?name=douban&ip=555.22.22.55&port=335&country=%E4%B8%AD%E5%9B%BD&anonymity=1&https=yes&speed=5&source=100" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">http://127.0.0.1:8000/insert?name=douban&ip=555.22.22.55&port=335&country=%E4%B8%AD%E5%9B%BD&anonymity=1&https=yes&speed=5&source=100</a></p> <p style="text-align:start">参数</p> <table style="-webkit-text-stroke-width:0px; border-collapse:collapse; border-spacing:0px; box-sizing:border-box; color:rgb(51, 51, 51); display:block; font-family:-apple-system,blinkmacsystemfont,segoe ui,helvetica,arial,sans-serif,apple color emoji,segoe ui emoji,segoe ui symbol; font-size:16px; font-style:normal; font-variant-caps:normal; font-variant-ligatures:normal; font-weight:normal; letter-spacing:normal; margin-bottom:16px; margin-top:0px; orphans:2; overflow:auto; text-align:start; text-indent:0px; text-transform:none; white-space:normal; widows:2; width:888px; word-spacing:0px"> <thead> <tr> <th>Name</th> <th>Type</th> <th>Description</th> <th>是否必须</th> </tr> </thead> <tbody> <tr> <td>name</td> <td>str</td> <td>数据库名称</td> <td>是</td> </tr> <tr> <td>ip</td> <td>str</td> <td>ip 地址</td> <td>是</td> </tr> <tr> <td>port</td> <td>str</td> <td>端口</td> <td>是</td> </tr> <tr> <td>country</td> <td>str</td> <td>国家</td> <td>否</td> </tr> <tr> <td>anonymity</td> <td>int</td> <td>1:高匿,2:匿名,3:透明</td> <td>否</td> </tr> <tr> <td>https</td> <td>str</td> <td>yes:https,no:http</td> <td>否</td> </tr> <tr> <td>speed</td> <td>float</td> <td>访问速度</td> <td>否</td> </tr> <tr> <td>source</td> <td>str</td> <td>ip 来源</td> <td>否</td> </tr> </tbody> </table> <h2 style="text-align:start">TODO</h2> <ul> <li>添加服务器获取接口更多筛选条件</li> <li>添加 https 支持</li> <li>添加检测 ip 的匿名度</li> <li>添加抓取更多免费代理网站</li> <li>分布式部署项目</li> </ul> <h2 style="text-align:start">参考</h2> <ul> <li><a href="/misc/goto?guid=4959737168081228665" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">IPProxyPool</a></li> </ul>