互联网 qkzz.net
全刊杂志网:首页 > 女性 > 文章正文
刊社推荐

一种基于层叠CRF的古文断句与句读标记方法张 合 王晓东 杨建宇 周卫东


□ 张 合 王晓东 杨建宇 周卫东

  摘 要:针对利用自然语言理解技术进行古汉语断句及句读标注的主要挑战是数据稀疏问题,设计了一种六字位标记集,提出了一种基于层叠式CRF模型的古文断句与句读标记方法。基于六字位标集,低层模型用观察序列确定句子边界,高层模型同时使用观察序列和低层的句子边界信息进行句读标记。实验在5M混合古文语料上分别进行了封闭测试和开放测试,封闭测试断句与句读标注的F值分别达到96.48%和91.35%,开放测试断句与句读标注的F值分别达到71.42%和67.67%。

  关键词:古汉语; 层叠条件随机场; 数据稀疏; 断句; 句读标注

  中图分类号:TP391文献标志码:A

  文章编号:1001-3695(2009)09-3326-04

  doi:10.3969/j.issn.1001-3695.2009.09.036

  Method of sentence segmentation and punctuating for ancient Chineseliteratures based on cascaded CRF

  ZHANG He WANG Xiao-dong YANG Jian-yu ZHOU Wei-dong3

  (1. College of Computer & Information Technology, Henan Normal University, Xinxiang Henan 453007, China; 2. Beijing d-Ear Technologies Co., Ltd., Beijing 100085, China; 3. Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China)

  Abstract:Data sparseness is a primary challenge in sentence segmentation and punctuating for ancient Chinese literatures using natural language processing technology. In order to overcome this difficulty, designed a 6-tag set and proposed a method based on cascaded conditional random fields. The main idea was as follows: basing on the 6-tag set, a low level model determined the boundaries of sentences according to observation sequence and a high level model punctuated sentences taking consideration of both observation sequence and low level’s results. Done close test and open test based on approximate 5M mixed corpus respectively. The F measure of sentence segmentation and punctuation were 96.48% and 91.35% respectively in close test, and those were 71.42% and 67.67% respectively in open test.

......
很抱歉,暂无全文,若需要阅读全文或喜欢本刊物请联系《计算机应用研究》杂志社购买。
欢迎作者提供全文,请点击编辑
分享:
 

了解更多资讯,请关注“木兰百花园”
分享:
 
精彩图文


关键字
支持中国杂志产业发展,请购买、订阅纸质杂志,欢迎杂志社提供过刊、样刊及电子版。
关于我们 | 网站声明 | 刊社管理 | 网站地图 | 联系方式 | 中图分类法 | RSS 2.0订阅 | IP查询
全刊杂志赏析网 2017