文本预处理

只保留中文

1
re.findall(u'[\u4e00-\u9fff ]+', text)

只保留英文和数字

1
re.sub('[^A-Za-z0-9]+', ' ', mystring)

移除英文标点符号

1
2
3
4
re.sub('[{}]+'.format(string.punctuation),'',l)
# or we can use string.punctuation get punctuation list then straightforward use it to replace.
print(string.punctuation)
re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+',' ',l)

移除中文标点符号

1
re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*():;《)《》“”()»〔〕-]+", " ",line)

英文分词并去除英文停用词

1
2
3
4
5
6
7
8
9
10
11
# 英文停用词
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))

line = "I'm walking with a dog. You hasn't a cat!"

[x for x in nltk.word_tokenize(line) if x not in stops]
Out[39]: ['I', "'m", 'walking', 'dog', '.', 'You', "n't", 'cat', '!']

[x for x in nltk.word_tokenize(line) if x.lower() not in stops]
Out[40]: ["'m", 'walking', 'dog', '.', "n't", 'cat', '!']

注意,nltk自带的停用词词典是小写格式,所以在去除停用词时先需要转换小写。

nltk.word_tokenize在应对I'm这样形式的单词时分词效果是否合宜需要取决需求。

如果code报错,可能是因为还没有下载停用词库。通过如下方式下载:

1
2
import nltk
nltk.download("stopwords")

英文还原词跟

1
2
3
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
text = [stemmer.stem(word) for word in text.split()]

中文分词并去除停用词

1
2
3
stops = [ line.strip() for line in open("../../file/stopword.txt")]
segs = jieba.cut(text)
segs = [word for word in list(segs) if word not in stoplist]

references

python27使用jieba分词,去除停用词