Add tf-idf function module and using NLTK

This week we add tf-idf function module and using NLTK to help us lemmatize the text.

The following is the function that lemmatizate the text.

Lemmatizations function

def tokenize(text):
    text = ' '.join(text)
    tokens = nltk.tokenize.TreebankWordTokenizer().tokenize(text)
    #print tokens
    # lemmatize words. try both noun and verb lemmatizations
    lmtzr = WordNetLemmatizer()
    for i in range(0,len(tokens)):
        #tokens[i] = tokens[i].strip("'")
            res = lmtzr.lemmatize(tokens[i])
            if res == tokens[i]:
                tokens[i] = lmtzr.lemmatize(tokens[i], 'v')
            else:
                tokens[i] = res

    # don't return any single letters
    tokens = [t for t in tokens if len(t) > 1 and not t.isdigit()]
    return tokens

We also use the NLTK DATA help us to deal with the text analysis. Here are the English names(male and female) provided by NLTK Data.

CSE534 Fundamental of Network Project

Real-time measurement for mailing list outage