互联网 qkzz.net
全刊杂志网:首页 > 女性 > 文章正文
刊社推荐

基于正文结构和长句提取的网页去重算法


  摘 要:针对网页重复的特点和网页正文的结构特征,提出了一种动态的、层次的、鲁棒性强的网页去重算法。该方法通过将网页正文表示成正文结构树的形式,实现了一种动态的特征提取算法和层次指纹的相似度计算算法。特征提取利用长句提取算法保证了强鲁棒性。实验证明,该方法对镜像网页和近似镜像网页都能进行准确的检测。

  关键词:网页去重;正文结构树;长句提取;层次指纹

  中图分类号:TP391文献标志码:A

  文章编号:1001-3695(2010)07-2489-03

  doi:10.3969/j.issn.1001-3695.2010.07.024

  Detection and elimination of similar Web pages based on text structure and extraction of long sentences

  HUANG Ren,FENG Sheng,YANG Ji-yun,LIU Yu,AO Min

  (College of Computer Science, Chongqing University, Chongqing 400044, China)

  Abstract:As regard to the feature of the similarity and that of the text structure of Web pages,this paper proposed a dynamic,stratified and robust algorithm to detect and delete similar Web pages.By this method,expressed the texts of Web pages in the style of text structure trees.Then,thus implemented a dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similarity.That the extraction of the features made use of the algorithm of extraction of long sentences guarantees the robustness. The experimental results show that the method can carry out accurate detection concerning completely similar Web pages and partly similar ones.

......
很抱歉,暂无全文,若需要阅读全文或喜欢本刊物请联系《计算机应用研究》杂志社购买。
欢迎作者提供全文,请点击编辑
分享:
 

了解更多资讯,请关注“木兰百花园”
分享:
 
精彩图文


关键字
支持中国杂志产业发展,请购买、订阅纸质杂志,欢迎杂志社提供过刊、样刊及电子版。
关于我们 | 网站声明 | 刊社管理 | 网站地图 | 联系方式 | 中图分类法 | RSS 2.0订阅 | IP查询
全刊杂志赏析网 2017