Simhash 与海明距离
对中文文档计算出对应的simhash值。simhash是谷歌用来进行文本去重的算法,现在广泛应用在文本处理中。Simhash引擎先进行分词和关键词提取,后计算Simhash值和海明距离。
注: 简繁词库内容略有不同,此处网页显示的是简体中文对应结果。下同。
words = "hello world!"
simhasher = worker("simhash",topn=2)
simhasher <= "江州市长江大桥参加了长江大桥的通车仪式"
$simhash
[1] "12882166450308878002"
$keyword
22.3853 8.69667
"长江大桥" "江州"
distance(words, "江州市长江大桥参加了长江大桥的通车仪式",simhasher)
$distance
[1] "23"
$lhs
11.7392 11.7392
"hello" "world"
$rhs
22.3853 8.69667
"长江大桥" "江州"
v0.7 更新内容
一、增加:get_idf() 根据多文档词条结果计算 IDF 值。输入一个包含多个文本向量的 list,每一个文本向量代表一个文档,可自定义停止词列表。
get_idf(a_big_list,stop="停止词列表",path="输出IDF目录")
二、增加:可以使用 vector_simhash
vector_distance
直接对文本向量计算 simhash 和 海明距离。
sim = worker("simhash")
cutter = worker()
vector_simhash(cutter["这是一个比较长的测试文本。"],sim)
$simhash
[1] "9679845206667243434"
$keyword
8.94485 7.14724 4.77176 4.29163 2.81755
"文本" "测试" "比较" "这是" "一个"
vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim)
$simhash
[1] "13133893567857586837"
$keyword
6.45994 6.18823 5.64148 5.63374 4.99212
"天气" "不错" "感觉" "真的" "今天"
vector_distance(c("今天","天气","真的","十分","不错","的","感觉"),c("今天","天气","真的","十分","不错","的","感觉"),sim)
$distance
[1] "0"
$lhs
6.45994 6.18823 5.64148 5.63374 4.99212
"天气" "不错" "感觉" "真的" "今天"
$rhs
6.45994 6.18823 5.64148 5.63374 4.99212
"天气" "不错" "感觉" "真的" "今天"
三、增加:可以使用 tobin
进行 simhash 数值的二进制转换。
res = vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim)
tobin(res$simhash)
[1] "0000000000000000000000000000000000010101111100000111001010010101"