关键词:HITS算法; 锚文本; 网页标题; 专题相关度; 向量模型; 专题训练集
Effective strategy of topic distillation and retrieval
WANG Yuxina, LIU Haifenga, GUO Heb, CHEN Xinb
(a.School of Electronic & Information Engineering, b.School of Software, Dalian University of Technology, Dalian Liaoning 116023, China)
Abstract:The strategy of topic distillation and retrieval on Internet is the key work in research of vertical search engine. HITS algorithm is a classical method for this problem at an earlier time, and some improvements are made by other researchers afterwards. Nevertheless, no matter the theme relation rate or accuracy grade of engine still have room to be improved. This paper proposed a strategy of topic distillation and retrieval by filtering Web pages based on anchor texts and titles combining relation grade of Web pages. Using the topic training collection to judge relation grade could overcome the shortcomings of depending on inquiring strings. The experiment results prove that this strategy can improve the accuracy of topic distillation and retrieval, and reduce the downloaded information of unrelated URLs.......