命名实体识别_利用CRF_CodingPark编程公园
发布日期:2021-06-29 15:46:41
浏览次数:2
分类:技术文章
本文共 14025 字,大约阅读时间需要 46 分钟。
文章介绍
本文主要讲解
原始语料库 -> 语料清洗 -> 语料分割 -> 构建训练和测试数据 -> CRF++训练 -> 命名实体检索 -> 模型评估完整项目
-
原始语料库
以人民日报1998年01月语料库为例子 -
语料清洗
(1)将语料全角字符(SBC)统一转为半角(DBC) (2)三空格变双空格 标注和标注之间的间隔规定为双空格,但是会存在一些三空格 (3)单空格变为双空格 标注和标注之间的间隔规定为双空格,但是会存在一些单空格 (4)中括号内容合并 例如,‘学生/n ,/w 奔波/v 于/p 两/m 个/q 课堂/n 。/w [上海市/ns 房屋/n 土地/n 管理局/n]nt 为/p’ (5)合并人名 例如,“金/nr 正日/nr” 合并为“金正日/nr” -
语料分割
将cleaned_data.txt按照8:2随机分割,作为训练和测试原始语料 -
构建训练和测试数据
构建训练数据集和测试数据集 采用“BMEWO”标签体系生成训练数据,具体解释如下 ‘B’:Begin ‘M’:Middle ‘E’:End ‘W’:代表单个实体 ‘O’:Other 识别实体类型和语料词性对应关系: 时间:TIME, /t 人物:PERSON, /nr 地点:LOCATION, /ns 团体组织机关:ORGANIZATION, /nt -
CRF++训练
./CRF++-0.58/crf_learn -f 2 -c 3.0 /usr/cellar/NER/CRF++-0.58/example/seg/template labeled_reain_data.txt model -t
“…/CRF+±0.58/crf_learn” 为…/CRF+±0.58/crf_learn路径
" -f 2 -c 3.0" 为模型计算参数 " /usr/cellar/NER/CRF+±0.58/example/seg/template" 为template路径 “labeled_train_data.txt” 为训练样本数据 "-t"为输出参数设置 训练完毕 生成model 和 model.txt-
CRF++测试
测试指令:…/crf_test -[可选参数] -m model test.data 例如: …/CRF+±0.58/crf_test -m model labeled_test_data.txt >> Fin_testdata.txt Fin_testdata.txt 部分截图如下-黄色部分为模型预测结果,主要看黄色部分,第二列个人感觉主要作用是评估模型。其实第二列个人感觉如不需评估模型,其实可以随意赋值。下文中命名实体识别也是通过黄色部分查找。故主要看黄色部分 -
命名实体检索
-
模型评估
准确率(Accurary) 、 精确率(Precision)、召回率( Recall)、F1 准确率(Accurary) 、 精确率 (Precision)、召回率( Recall )、F1是机器学习模型性能评测中易混淆的几个概念。 TP:样本为正,预测为正; FP:样本为负,预测为正; FN:样本为正,预测为负; TN:样本为负,预测为负。 Accurary = ( TP + TN ) / ( TP + FP + FN + TN) Accurary表示所有样本中分类正确的样本与总样本的比例,反映总体样本预测准确率。 Precision = TP / ( TP + FP ) Precision表示预测为正样本的样本中真正为正的比例,反映正样本预测准确率。 Recall = TP / ( TP + FN ) Recall表示模型准确预测为正样本的数量占所有正样本数量的比例,反映正样本有多少被找出来(召回)。 F1 = 2 * Precision * Recall / ( Precision + Recall ) F1是一个综合指标,是Precision和Recall的调和平均数,一般情况下,Precision和Recall是两个互斥指标,即Recall越大,Precision往往越小,所以需要通过F1测度来综合进行评估,F1越大,分类器效果越好。 Accuracy和Precision区别: Accaracy和Precision作用相差不大,值越大,分类器效果越好,Accuracy使用前提是样本是均衡的,如果样本严重失衡了,Accuracy不再适用,只能使用Precision。
完整代码
语料清洗
import copyimport re# 读取98年人民日报语料库def read_srcdata(path): with open(path, 'r', encoding='UTF-8') as fr: srcdata = fr.readlines() return srcdata# 将语料全角字符(SBC)统一转为半角(DBC)def sbc2dbc(data): temp_data = copy.deepcopy(data) for i, sentence in enumerate(data): temp_sentence = '' for word in sentence: word_code = ord(word) if word_code == 12288: word_code = 32 elif (word_code >= 65281 and word_code <= 65374): word_code -= 65248 temp_sentence += chr(word_code) temp_data[i] = temp_sentence return temp_data# 三空格变双空格def three_space_2_double_space(data): for i, sentence in enumerate(data): sentence = sentence.split(' ') data[i] = ' '.join(sentence) return data# 单空格变为双空格def single_space_2_double_space(data): for i, sentence in enumerate(data): sentence = sentence.split(' ') len_sen = len(sentence) for j in range(len_sen): if ' ' in sentence[j]: sentence[j] = sentence[j].replace(' ', ' ') data[i] = ' '.join(sentence) return data# 中括号内容合并# '学生/n ,/w 奔波/v 于/p 两/m 个/q 课堂/n 。/w [上海市/ns 房屋/n 土地/n 管理局/n]nt 为/p'def merge_bracket_content(data): pattern = re.compile(r'([a-z]{1,5}\][a-z]{1,5})') for i, sentence in enumerate(data): sentence = sentence.split(' ') len_sen = len(sentence) flag = False tag1 = False tag2 = False for j in range(len_sen)[::-1]: if not flag: re_val = re.findall(pattern, sentence[j]) tag1 = True if len(re_val) >= 1 else False if tag1 and j-1 >= 0: tag2 = True if sentence[j-1][0] != '[' else False else: continue if tag1 and tag2: pat = re.compile(r'/(.*])[a-z]{1,5}') val = re.findall(pat, sentence[j]) if len(val)>=1: string = val[0] sentence[j] = sentence[j].replace(string, '') sentence[j-1] = sentence[j-1].split('/')[0] + sentence[j] flag = True del sentence[j] elif tag1 and not tag2: pat = re.compile(r'/(.*])[a-z]{1,5}') val = re.findall(pat, sentence[j]) if len(val)>=1: string = val[0] sentence[j] = sentence[j].replace(string, '') sentence[j-1] = sentence[j-1].strip('[').split('/')[0] + sentence[j] flag = False del sentence[j] data[i] = ' '.join(sentence) return data # 合并人名,例如"金/nr 正日/nr" 拼为“金正日/nr” def merge_name(data): for i, sentence in enumerate(data): sentence = sentence.split(' ') len_sen = len(sentence) temp_j = len_sen - 1 for j in range(len_sen)[::-1]: if j-1 >= 0: tag1 = sentence[j].split('/')[-1] tag2 = sentence[j-1].split('/')[-1] if tag1=='nr' and tag2 == 'nr': if j == temp_j and j != len_sen-1: continue sentence[j-1] = sentence[j-1].strip('/nr')+sentence[j] temp_j = j-1 del sentence[j] data[i] = ' '.join(sentence) return data# 合并时间"1月/t 26日/t" 合并为“1月26日/t”def merge_time(data): for i, sentence in enumerate(data): sentence = sentence.split(' ') len_sen = len(sentence) for j in range(len_sen)[::-1]: if j-1 >= 0: tag1 = sentence[j].split('/')[-1] tag2 = sentence[j-1].split('/')[-1] if tag1=='t' and tag2 == 't': sentence[j-1] = sentence[j-1].strip('/t')+sentence[j] del sentence[j] data[i] = ' '.join(sentence) return data def main(): # 按照如下顺序对数据进行清洗 data0 = read_srcdata('199801.txt') # 原始语料库路径 data1 = sbc2dbc(data0) data2 = three_space_2_double_space(data1) data3 = single_space_2_double_space(data2) data4 = merge_bracket_content(data3) data5 = merge_name(data4) data6 = merge_time(data5) with open('cleaned_data.txt','w', encoding='utf-8') as fw: for i, meta in enumerate(data6): if meta == '\n': continue fw.write(meta)if __name__ == '__main__': main()
语料分割
# -*- coding: utf-8 -*-"""按照8:2切分训练集和测试集"""import randomdef read_srcdata(path): with open(path, 'r', encoding='UTF-8') as fr: srcdata = fr.readlines() return srcdatadef train_test_segment(datapath, train_txt, test_txt, ratio): srcdata = read_srcdata(datapath) del srcdata[-1] # 去掉最后一行‘\n’ src_len = len(srcdata) max_len = int(src_len * ratio) index_set = list(range(src_len)) random.shuffle(index_set) train_fw = open(train_txt, 'w', encoding='utf-8') test_fw = open(test_txt, 'w', encoding='utf-8') for i in range(len(index_set)): if i <= max_len: train_fw.write(srcdata[index_set[i]]) else: test_fw.write(srcdata[index_set[i]]) train_fw.close() test_fw.close() return True if __name__ == '__main__': datapath = 'cleaned_data.txt' trainpath = 'train_data.txt' testpath = 'test_data.txt' ratio = 0.8 # 0.8为traindata result = train_test_segment(datapath, trainpath, testpath, ratio) print(result)
构建训练和测试数据
# -*- coding: utf-8 -*-#采用“BMEWO”标签体系生成训练数据"""'B':Begin'M':Middle'E':End'W':代表单个实体'O':Other时间:TIME, /t人物:PERSON, /nr地点:LOCATION, /ns团体组织机关:ORGANIZATION, /nt"""import codecs # 读取清洗后98年人民日报语料库def read_srcdata(path): with open(path, 'r', encoding='UTF-8') as fr: srcdata = fr.readlines() return srcdatadef generate_train_data(input_path, output_path): data0 = read_srcdata(input_path) delimiter = '\t' with codecs.open(output_path,'w', encoding='utf-8') as fw: for i, sentence in enumerate(data0): words = sentence.split(' ') for j,word in enumerate(words): if j==0 or word == '\n': continue split_word = word.split('/') tag = split_word[-1] word_meta = split_word[0] meta_len = len(word_meta) if tag == 't': if meta_len == 1: char = word_meta fw.write(char+delimiter+'W'+'\n') continue for k, char in enumerate(word_meta): if k == 0: fw.write(char+delimiter+'B_TIME'+'\n') elif k == meta_len - 1: fw.write(char+delimiter+'E_TIME'+'\n') else: fw.write(char+delimiter+'M_TIME'+'\n') elif tag == 'nr': if meta_len == 1: char = word_meta fw.write(char+delimiter+'W'+'\n') continue for k, char in enumerate(word_meta): if k == 0: fw.write(char+delimiter+'B_PERSON'+'\n') elif k == meta_len - 1: fw.write(char+delimiter+'E_PERSON'+'\n') else: fw.write(char+delimiter+'M_PERSON'+'\n') elif tag == 'ns': if meta_len == 1: char = word_meta fw.write(char+delimiter+'W'+'\n') continue for k, char in enumerate(word_meta): if k == 0: fw.write(char+delimiter+'B_LOCATION'+'\n') elif k == meta_len - 1: fw.write(char+delimiter+'E_LOCATION'+'\n') else: fw.write(char+delimiter+'M_LOCATION'+'\n') elif tag == 'nt': if meta_len == 1: char = word_meta fw.write(char+delimiter+'W'+'\n') continue for k, char in enumerate(word_meta): if k == 0: fw.write(char+delimiter+'B_ORGANIZATION'+'\n') elif k == meta_len - 1: fw.write(char+delimiter+'E_ORGANIZATION'+'\n') else: fw.write(char+delimiter+'M_ORGANIZATION'+'\n') else: for k, char in enumerate(word_meta): fw.write(char+delimiter+'O'+'\n') fw.write('\n') return True if __name__ == '__main__': input_file = 'test_data.txt' output_file = 'labeled_test_data.txt' result = generate_train_data(input_file, output_file) print(result)
命名实体检索
"""人名识别"""def readtxt(path): with open(path, 'r', encoding='UTF-8') as fr: content = fr.readlines() # @ readline 与 readlines 要区分开哟 哈哈 return contentdef findit(content): c1 = '' c2 = '' c3 = '' C = '' name = [] for line in content: if line == '\n': continue line = line.strip('\n').split('\t') predicted_value = line[-1] if predicted_value == 'O': continue if predicted_value == 'B_PERSON': c1 = line[0] if predicted_value == 'M_PERSON': c2 = line[0] if predicted_value == 'E_PERSON': c3 = line[0] C = c1 + c2 + c3 print(C) name.append(C) return nameif __name__ == '__main__': raw = 'Fin_testdata.txt' content = readtxt(raw) name = findit(content) Strname = '' Name = Strname + str(name) with open('FindName.txt', 'w', encoding='utf-8') as fw: fw.write(Name) print() print("---完成---")
模型评估
# -*- coding: utf-8 -*-# 计算B_LOCATION的准确率(Accurary) 、 精确率(Precision)、召回率( Recall)、F1# 分别计算TP FN FP TN # 计算TPdef cal_tp(data, tag): TP = 0 for line in data: if line == '\n': continue line = line.strip('\n').split('\t') actual_value = line[-2] predicted_value = line[-1] if actual_value == tag and actual_value == predicted_value: TP += 1 print('TP -> ', TP) return TP# 计算FNdef cal_fn(data, tag): FN = 0 for line in data: if line == '\n': continue line = line.strip('\n').split('\t') actual_value = line[-2] predicted_value = line[-1] if actual_value == tag and actual_value != predicted_value: FN += 1 print('FN -> ', FN) return FN# 计算FPdef cal_fp(data, tag): FP = 0 for line in data: if line == '\n': continue line = line.strip('\n').split('\t') actual_value = line[-2] predicted_value = line[-1] if predicted_value == tag and actual_value != tag: FP += 1 print('FP -> ', FP) return FP# 计算TNdef cal_tn(data, tag): TN = 0 for line in data: if line == '\n': continue line = line.strip('\n').split('\t') actual_value = line[-2] predicted_value = line[-1] if predicted_value != tag and actual_value != tag: TN += 1 print('TN -> ', TN) return TNdef cal_main(): with open('Fin_testdata.txt','r', encoding='UTF-8') as fw: data = fw.readlines() tag = 'B_LOCATION' TP = cal_tp(data, tag) FN = cal_fn(data, tag) FP = cal_fp(data, tag) TN = cal_tn(data, tag) precision = TP/(TP+FP) recall = TP/(TP+FN) accurary = (TP+TN)/(TP+TN+FN+FP) f1 = 2*precision*recall/(precision+recall) return round(accurary, 3), round(precision, 3), round(recall, 3), round(f1, 3) # 保留小数3位 if __name__ == '__main__': accurary, precision, recall, f1 = cal_main() print(accurary, precision, recall, f1)
特别鸣谢
📍Linux 命令详解./configure、make、make install 命令
https://www.100txy.com/article/207.html 📍如何安装CRF++工具 https://www.jianshu.com/p/9a98701799af 📍如何使用CRF++工具 https://www.jianshu.com/p/1fdece7f7c41 📍如何评测CRF++结果 https://www.jianshu.com/p/13f8792bfbe4
转载地址:https://codingpark.blog.csdn.net/article/details/106618471 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!
发表评论
最新留言
留言是一种美德,欢迎回访!
[***.207.175.100]2024年04月13日 04时08分57秒
关于作者
喝酒易醉,品茶养心,人生如梦,品茶悟道,何以解忧?唯有杜康!
-- 愿君每日到此一游!
推荐文章
JAVA中的浮点数与二进制
2019-04-29
JAVA笔记(二)--Java初始
2019-04-29
JAVA笔记(三)--变量及运算符
2019-04-29
JAVA笔记(四)--三大结构语句
2019-04-29
JAVA语言基础(五)--数组
2019-04-29
JAVA项目案例详解带代码
2019-04-29
JAVA九种排序算法详解
2019-04-29
JAVA笔记(六)面向对象--类和对象
2019-04-29
JAVA笔记(十一)面向对象--多态
2019-04-29
webpack打包错误:Invalid configuration object. Webpack has been initialised using a configuration object
2019-04-29
TypeError: this.getOptions is not a function
2019-04-29
el-table 二维数组合并行
2019-04-29
js获取当月的天数
2019-04-29
多个相邻的盒子外边框合并的问题
2019-04-29
js实现复制功能
2019-04-29
UR5e机械臂运行一直阻塞在waitForServer
2019-04-29
ROS把pkg1下的某个头文件和源文件生成动态链接库供pkg2调用
2019-04-29
使用urdf_tutorial快速可视化urdf文件
2019-04-29