快速NER套件

--- tags: AI分享 --- # 快速NER套件 ## Content [TOC] ## 1 多功能快速NER套件 ### 1.0 架构 ```mermaid graph TD AC((字符串多模匹配<br>Trie类)) RegExp[正则自动机 re] SDAG[最短路 DAG] HDAG[HMM DAG] CRF[CRF CRFSuite] TrieIndex[前缀树索引 Trie] DecisionIndex[决策树索引 Sorted Array] Sol1([多特征实体归一公司/学校/...]) Sol2([多级实体归一地名/地址/...]) Sol3([支持词库自定义的分词器]) Sol4([结合词库的快速NER]) AC--> RegExp --> DecisionIndex --> Sol1 RegExp --> TrieIndex --> Sol2 AC --> HDAG --> Sol3 AC --> SDAG --> CRF --> Sol4 SDAG --> Sol3 ``` ### 1.1 多特征实体归一 ```python # 字符串多模匹配 posseg_trie = Trie.from_posseg(folder) # 正则自动机 reg_automaton = RegAuto(tokenizer=posseg_trie, pos_mode=True) reg_automaton.add_rule(rule) # 决策树索引 index = SortedArray.from_npy(file) def solution(s: str): for token in reg_automaton(s): yield index.search(token) ``` ### 1.2 多级实体归一 ```python # 字符串多模匹配 data_trie = Trie.from_csv(csv) # 正则自动机 reg_automaton = RegAuto(tokenizer=data_trie) reg_automaton.add_rule(rule) # 前缀树索引 def get_index(idx: str): return tuple(idx[i: i + 3] for i in range(0, len(idx), 3)) index_trie = Trie.create_from_index(key=get_index) def solution(s: str): for (tokens,) in reg_automaton(s): yield from index_trie.merge(tokens) ``` ### 1.3 支持词库自定义的分词器 ```python # 最短路 Trie.from_csv(vocabulary) shortest_path_dag = DAG.load_shortest_path(tokenizer=hmm_trie.cut_all) solution = shortest_path_dag # HMM # Train # TODO # Predict hmm_trie = Trie.from_csv(hmm_login) hmm_dag = DAG.load_hmm(tokenizer=hmm_trie.cut_all, hmm_non_login) solution = hmm_dag ``` ### 1.4 结合词库的快速NER ```python # Train # TODO sentences = [[('上', 'loc', 'B-COMP'), ('海', 'loc', 'I-COMP'), ('谷', 'abbr', 'I-COMP'), ('露', 'abbr', 'I-COMP'), ('软', 'ind', 'I-COMP'), ('件', 'ind', 'I-COMP'), ('有', 'type', 'I-COMP'), ('限', 'type', 'I-COMP'), ('公', 'type', 'I-COMP'), ('司', 'type', 'E-COMP'), ('的', 'O', 'O'), ('算', 'O', 'O'), ('法', 'O', 'O'), ('工', 'O', 'O'), ('程', 'O', 'O')]] CRFSuite.train(sentences, algorithms="lbfgs") # Predict fast_ner = CRFSuite.load(tokenizer=shortest_path_dag.origin_cut, crf_model=crf_model)