jiebaR 中文分词

GitHub v0.7

一、增加:get_tuple() 返回分词结果中 n 个连续的字符串组合的频率情况,可以作为自定义词典的参考。

get_tuple(c("sd","sd","sd","rd"),size=3)
#     name count
# 4   sdsd     2
# 1   sdrd     1
# 2 sdsdrd     1
# 3 sdsdsd     1
get_tuple(list(
        c("sd","sd","sd","rd"),
        c("新浪","微博","sd","rd"),
    ), size = 2)
#       name count
# 2     sdrd     2
# 3     sdsd     2
# 1   微博sd     1
# 4 新浪微博     1

二、增加:get_idf() 根据多文档词条结果计算 IDF 值。输入一个包含多个文本向量的 list,每一个文本向量代表一个文档,可自定义停止词列表。

get_idf(a_big_list,stop="停止词列表",path="输出IDF目录")

三、增加:可以使用 vector_simhash vector_distance 直接对文本向量计算 simhash 和 海明距离。

sim = worker("simhash")
cutter = worker()
vector_simhash(cutter["这是一个比较长的测试文本。"],sim)
$simhash
[1] "9679845206667243434"

$keyword
8.94485 7.14724 4.77176 4.29163 2.81755 
 "文本"  "测试"  "比较"  "这是"  "一个"
vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim)
$simhash
[1] "13133893567857586837"

$keyword
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天"
vector_distance(c("今天","天气","真的","十分","不错","的","感觉"),c("今天","天气","真的","十分","不错","的","感觉"),sim)
$distance
[1] "0"

$lhs
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天" 

$rhs
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天"

四、增加:可以使用 tobin 进行 simhash 数值的二进制转换。

res = vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim) 
tobin(res$simhash)
[1] "0000000000000000000000000000000000010101111100000111001010010101"