Add Tf-idf Function Module and Using NLTK

This week we add tf-idf function module and using NLTK to help us lemmatize the text.

The following is the function that lemmatizate the text.

Lemmatizations function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def tokenize(text):
    text = ' '.join(text)
    tokens = nltk.tokenize.TreebankWordTokenizer().tokenize(text)
    #print tokens
    # lemmatize words. try both noun and verb lemmatizations
    lmtzr = WordNetLemmatizer()
    for i in range(0,len(tokens)):
        #tokens[i] = tokens[i].strip("'")
            res = lmtzr.lemmatize(tokens[i])
            if res == tokens[i]:
                tokens[i] = lmtzr.lemmatize(tokens[i], 'v')
            else:
                tokens[i] = res

    # don't return any single letters
    tokens = [t for t in tokens if len(t) > 1 and not t.isdigit()]
    return tokens

We also use the NLTK DATA help us to deal with the text analysis. Here are the English names(male and female) provided by NLTK Data.