This week we add tf-idf function module and using NLTK to help us lemmatize the text.
The following is the function that lemmatizate the text.
Lemmatizations function
1234567891011121314151617
deftokenize(text):text=' '.join(text)tokens=nltk.tokenize.TreebankWordTokenizer().tokenize(text)#print tokens# lemmatize words. try both noun and verb lemmatizationslmtzr=WordNetLemmatizer()foriinrange(0,len(tokens)):#tokens[i] = tokens[i].strip("'")res=lmtzr.lemmatize(tokens[i])ifres==tokens[i]:tokens[i]=lmtzr.lemmatize(tokens[i],'v')else:tokens[i]=res# don't return any single letterstokens=[tfortintokensiflen(t)>1andnott.isdigit()]returntokens
We also use the NLTK DATA help us to deal with the text analysis.
Here are the English names(male and female) provided by NLTK Data.