使用 HtmlCleaner 对 HTML 页面内容进行抓取

有一个小工具需要对已经存在的 HTML 页面中的内容进行抓取。

以前我们使用的是正则表达式进行搜索,搜索语法比较难写。

后来我们使用了 HtmlCleaner + xpath

考察有下面的代码片段:[code]TagNode tagNode = new HtmlCleaner().clean(message.getBody());
try {
w3cDoc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
xpath = XPathFactory.newInstance().newXPath();

			firstName = (String) xpath.evaluate(t_FirstNameXPathPattern, w3cDoc, XPathConstants.STRING);
			email = (String) xpath.evaluate(t_EmailXPathPattern, w3cDoc, XPathConstants.STRING);
			phone = (String) xpath.evaluate(t_PhoneXPathPattern, w3cDoc, XPathConstants.STRING);
			mlsNumber = (String) xpath.evaluate(t_MlsNumberXPathPattern, w3cDoc, XPathConstants.STRING);
			comment = (String) xpath.evaluate(t_CommentXPathPattern, w3cDoc, XPathConstants.STRING);

		} catch (Exception ex) {
			// TODO Auto-generated catch block
			logger.error("HTML XPATH PROCESS ERROR: {}", new Object[] { ex });
		}

[/code]上面方法中的 message.getBody() 就是获得需要处理的 HTML String

有关 xpath 的定义在:private String r_FirstNameXPathPattern = "/html/body/div/div/text()[3]"; private String r_LastNameXPathPattern = "/html/body/div/div/text()[4]"; private String r_MlsNumberXPathPattern = "/html/body/div/div/text()[18]"; private String r_EmailXPathPattern = "/html/body/div/div/a[1]"; private String r_PhoneXPathPattern = "/html/body/div/div/a[2]"; private String r_CommentXPathPattern = "/html/body/div/div/text()[11]";中。

使用原生的 xpath