NLP工具再汇总

本文主要是针对specific domain中的KG构建中涉及到的相关工具完成梳理、配置、简单使用工作，达到熟悉的目的。

涉及知识点

短语挖掘——在关键词提取和属性提取等IE抽取中，作用还是蛮大的，可以用做校准的依据。
知识分类体系建设
NLP工具（一般一个工具，具有多种功能，比如分词、分句、句法依存分析、NER等等）——样例地址：project_hj_py

Stanford CoreNLP
spaCy——spaCy中可以添加pattern。类似于re，但如果熟练re的话，可以直接用re。
NLTK——包含了stopwords文件，本机存放地址为C:\Users\ASUS\AppData\Roaming\nltk_data\corpora\stopwords
…

KG 工具

Cayley
Neo4j

网页信息抽取
Domain PLM训练

停用词删除

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
def Delete_stopwords(example_sent):
    stop_words = set(stopwords.words('english'))

    word_tokens = word_tokenize(example_sent)

    filtered_sentence = [w for w in word_tokens if not w in stop_words]

    filtered_sentence = []

    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)

    return word_tokens,filtered_sentence

example_sent = "This is a sample sentence, showing off the stop words filtration."
word_tokens,filtered_sentence=Delete_stopwords(example_sent)
print(word_tokens,filtered_sentence)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

停用词文件：

https://www.nltk.org/nltk_data/——73项

网页信息抽取：

不仅只有re表达式的方式，在大量网站抽取时，好像是有专门的研究分支——网页信息抽取技术。
1

Google Sheet：https://cn.gijn.org/2022/07/22/data-extraction-tools/
可以捕获标签内的text内容。通过公式，IMPORTXML导入网页元素。
Google sheet教程：https://blog.coupler.io/importhtml-function-google-sheets/
建议：如果会爬虫，就不要用Google sheet，不好使，还没现有的爬虫软件好用。

相关阅读:
Crossplane-云基础架构管理平台
OpenJDK和OracleJDK的区别说明
升讯威在线客服系统客服端英文界面的技术实现方法，客户落地巴西圣保罗
C++ 文字显示
利用Axios封装及泛型实现定制化HTTP请求处理
基于Matlab求解高教社杯全国大学生数学建模竞赛(CUMCM2012A题)-葡萄酒的评价（源码+数据）
公开课｜“技术+法律”隐私计算如何助力数据合规
git提交规范
嵌入式烧录报错：板端IP与PC的IP相同
前端请求后台接口失败处理逻辑

原文地址：https://blog.csdn.net/Hekena/article/details/126136641