---
tags: AI分享
---
# 快速NER套件
## Content
[TOC]
## 1 多功能快速NER套件
### 1.0 架构
```mermaid
graph TD
AC((字符串多模匹配<br>Trie类))
RegExp[正则自动机 re]
SDAG[最短路 DAG]
HDAG[HMM DAG]
CRF[CRF CRFSuite]
TrieIndex[前缀树索引 Trie]
DecisionIndex[决策树索引 Sorted Array]
Sol1([多特征实体归一 公司/学校/...])
Sol2([多级实体归一 地名/地址/...])
Sol3([支持词库自定义的分词器])
Sol4([结合词库的快速NER])
AC--> RegExp --> DecisionIndex --> Sol1
RegExp --> TrieIndex --> Sol2
AC --> HDAG --> Sol3
AC --> SDAG --> CRF --> Sol4
SDAG --> Sol3
```
### 1.1 多特征实体归一
```python
# 字符串多模匹配
posseg_trie = Trie.from_posseg(folder)
# 正则自动机
reg_automaton = RegAuto(tokenizer=posseg_trie, pos_mode=True)
reg_automaton.add_rule(rule)
# 决策树索引
index = SortedArray.from_npy(file)
def solution(s: str):
for token in reg_automaton(s):
yield index.search(token)
```
### 1.2 多级实体归一
```python
# 字符串多模匹配
data_trie = Trie.from_csv(csv)
# 正则自动机
reg_automaton = RegAuto(tokenizer=data_trie)
reg_automaton.add_rule(rule)
# 前缀树索引
def get_index(idx: str):
return tuple(idx[i: i + 3] for i in range(0, len(idx), 3))
index_trie = Trie.create_from_index(key=get_index)
def solution(s: str):
for (tokens,) in reg_automaton(s):
yield from index_trie.merge(tokens)
```
### 1.3 支持词库自定义的分词器
```python
# 最短路
Trie.from_csv(vocabulary)
shortest_path_dag = DAG.load_shortest_path(tokenizer=hmm_trie.cut_all)
solution = shortest_path_dag
# HMM
# Train
# TODO
# Predict
hmm_trie = Trie.from_csv(hmm_login)
hmm_dag = DAG.load_hmm(tokenizer=hmm_trie.cut_all, hmm_non_login)
solution = hmm_dag
```
### 1.4 结合词库的快速NER
```python
# Train
# TODO
sentences = [[('上', 'loc', 'B-COMP'),
('海', 'loc', 'I-COMP'),
('谷', 'abbr', 'I-COMP'),
('露', 'abbr', 'I-COMP'),
('软', 'ind', 'I-COMP'),
('件', 'ind', 'I-COMP'),
('有', 'type', 'I-COMP'),
('限', 'type', 'I-COMP'),
('公', 'type', 'I-COMP'),
('司', 'type', 'E-COMP'),
('的', 'O', 'O'),
('算', 'O', 'O'),
('法', 'O', 'O'),
('工', 'O', 'O'),
('程', 'O', 'O')]]
CRFSuite.train(sentences, algorithms="lbfgs")
# Predict
fast_ner = CRFSuite.load(tokenizer=shortest_path_dag.origin_cut, crf_model=crf_model)