Simhash 与海明距离

对中文文档计算出对应的simhash值。simhash是谷歌用来进行文本去重的算法,现在广泛应用在文本处理中。Simhash引擎先进行分词和关键词提取,后计算Simhash值和海明距离。

注: 简繁词库内容略有不同,此处网页显示的是简体中文对应结果。下同。

words = "hello world!"
simhasher = worker("simhash",topn=2)
simhasher <= "江州市长江大桥参加了长江大桥的通车仪式"
$simhash
[1] "12882166450308878002"

$keyword
   22.3853    8.69667 
"长江大桥"     "江州"
distance(words, "江州市长江大桥参加了长江大桥的通车仪式",simhasher)
$distance
[1] "23"

$lhs
11.7392 11.7392 
"hello" "world" 

$rhs
   22.3853    8.69667 
"长江大桥"     "江州"

v0.7 更新内容

一、增加:get_idf() 根据多文档词条结果计算 IDF 值。输入一个包含多个文本向量的 list,每一个文本向量代表一个文档,可自定义停止词列表。

get_idf(a_big_list,stop="停止词列表",path="输出IDF目录")

二、增加:可以使用 vector_simhash vector_distance 直接对文本向量计算 simhash 和 海明距离。

sim = worker("simhash")
cutter = worker()
vector_simhash(cutter["这是一个比较长的测试文本。"],sim)
$simhash
[1] "9679845206667243434"

$keyword
8.94485 7.14724 4.77176 4.29163 2.81755 
 "文本"  "测试"  "比较"  "这是"  "一个"
vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim)
$simhash
[1] "13133893567857586837"

$keyword
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天"
vector_distance(c("今天","天气","真的","十分","不错","的","感觉"),c("今天","天气","真的","十分","不错","的","感觉"),sim)
$distance
[1] "0"

$lhs
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天" 

$rhs
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天"

三、增加:可以使用 tobin 进行 simhash 数值的二进制转换。

res = vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim) 
tobin(res$simhash)
[1] "0000000000000000000000000000000000010101111100000111001010010101"