idf

idf 计算技巧

假设不同词共有N个,文档共有M个,单文档最大长度是L。

我们看如下两种计算方式。

第一种是先作出所有词的词典,然后对每个词进行遍历,看它在多少个文档中出现,复杂度是N×M:

1
2
3
4
5
6
7
8
9
10
11
12
from collections import Counter
#word frequency by words
w_freq = Counter(' '.join(train_que).split())
#word frequency by docs
d_freq = {}
for i in w_freq.keys():
for j in train_que:
if i in j:
if i not in d_freq:
d_freq[i] = 0
else:
d_freq[i] += 1

第二种是遍历所有文档,记录每个文档中出现的词,复杂度是M×L:

1
2
3
4
5
6
7
8
9
10
11
from collections import Counter
#word frequency by words
w_freq = Counter(' '.join(train_que).split())
#word frequency by docs
d_freq = {}
for text in train_que:
for i in set(text.split()):
try:
d_freq[i] += 1
except:
d_freq[i] = 0

显然,两者效率相差极大。一般L不过在10左右,而N却可以高达上万。