爬虫的自我解剖(抓取网页HtmlUnit)
网络爬虫第一个要面临的问题,就是如何抓取网页,抓取其实很容易,没你想的那么复杂,一个开源HtmlUnit
包,4行代码就OK啦,例子如下:
1 2 3 4 </td> | final WebClient webClient= new WebClient(); final HtmlPage page=webClient.getPage( "http://www.yanyulin.info" ); System.out.println(page.asText()); webClient.closeAllWindows(); | </tr> </tbody> </table> </div> </div> </div>
1 2 3 4 5 6 </td> | final WebClient webClient= new WebClient(); webClient.getOptions().setCssEnabled( false ); webClient.getOptions().setJavaScriptEnabled( false ); final HtmlPage page=webClient.getPage( "http://www.yanyulin.info" ); System.out.println(page.asText()); webClient.closeAllWindows(); | </tr> </tbody> </table> </div> </div> </div>
1 2 </td> | //模拟chorme浏览器,其他浏览器请修改BrowserVersion.后面 WebClient webClient= new WebClient(BrowserVersion.CHROME); | </tr> </tbody> </table> </div> </div> </div>
1 2 3 </td> | HtmlPage page=webClient.getPage( "http://www.yanyulin.info" ); //从[烟雨林博客]上获取标签hed的内容 HtmlDivision div=(HtmlDivision)page.getElementById( "hed" ); | </tr> </tbody> </table> </div> </div> </div>
1 2 3 4 </td> | //同样可以打印出hed的内容,//div中//表示搜索整个文档中的div,并将这些div //放入list中,然后获取第一个div final HtmlDivision div = (HtmlDivision) page.getByXPath( "//div" ).get( 0 ); System.out.println(div.asXml()); | </tr> </tbody> </table> </div> </div> </div>
1 2 3 </td> | final WebClient webClient = new WebClient(BrowserVersion.CHROME, "http://127.0.0.1" , 8087 ); final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider(); credentialsProvider.addCredentials( "username" , "password" ); | </tr> </tbody> </table> </div> </div> </div>
1 2 3 4 5 6 7 8 9 </td> | //获取表单 final HtmlForm form = page.getFormByName( "form" ); //获取提交按扭 final HtmlSubmitInput button = form.getInputByName( "submit" ); //一会得输入的 final HtmlTextInput textField = form.getInputByName( "userid" ); textField.setValueAttribute( "test" ); //点击提交表单 final HtmlPage page = button.click(); | </tr> </tbody> </table> </div> </div> </div>
1 2 3 4 </td> | java.util.List for (HtmlAnchor ach:achList){ System.out.println(ach.getHrefAttribute()); } | </tr> </tbody> </table> </div> </div> </div>