一、增加:get_tuple() 返回分词结果中 n 个连续的字符串组合的频率情况,可以作为自定义词典的参考。
get_tuple(c("sd","sd","sd","rd"),size=3)
# name count
# 4 sdsd 2
# 1 sdrd 1
# 2 sdsdrd 1
# 3 sdsdsd 1
get_tuple(list(
c("sd","sd","sd","rd"),
c("新浪","微博","sd","rd"),
), size = 2)
# name count
# 2 sdrd 2
# 3 sdsd 2
# 1 微博sd 1
# 4 新浪微博 1
二、增加:get_idf() 根据多文档词条结果计算 IDF 值。输入一个包含多个文本向量的 list,每一个文本向量代表一个文档,可自定义停止词列表。
get_idf(a_big_list,stop="停止词列表",path="输出IDF目录")
三、增加:可以使用 vector_simhash
vector_distance
直接对文本向量计算 simhash 和 海明距离。
sim = worker("simhash")
cutter = worker()
vector_simhash(cutter["这是一个比较长的测试文本。"],sim)
$simhash
[1] "9679845206667243434"
$keyword
8.94485 7.14724 4.77176 4.29163 2.81755
"文本" "测试" "比较" "这是" "一个"
vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim)
$simhash
[1] "13133893567857586837"
$keyword
6.45994 6.18823 5.64148 5.63374 4.99212
"天气" "不错" "感觉" "真的" "今天"
vector_distance(c("今天","天气","真的","十分","不错","的","感觉"),c("今天","天气","真的","十分","不错","的","感觉"),sim)
$distance
[1] "0"
$lhs
6.45994 6.18823 5.64148 5.63374 4.99212
"天气" "不错" "感觉" "真的" "今天"
$rhs
6.45994 6.18823 5.64148 5.63374 4.99212
"天气" "不错" "感觉" "真的" "今天"
四、增加:可以使用 tobin
进行 simhash 数值的二进制转换。
res = vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim)
tobin(res$simhash)
[1] "0000000000000000000000000000000000010101111100000111001010010101"