Add Tf-idf Function Module and Using NLTK
This week we add tf-idf function module and using NLTK to help us lemmatize the text.
The following is the function that lemmatizate the text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
We also use the NLTK DATA help us to deal with the text analysis. Here are the English names(male and female) provided by NLTK Data.
Deal With Problems Last Week and New Tasks
What we have done:
- Combinated the same threads(original posts and reply posts)
- We used python to remove the punctations and some unrelated words(stopwords). The origal stopwords what we are using is here.
Problems and New tasks:
The stopwords are not enough. We should also remove the people names, places names, etc. So we summary the unrelated words to the following 9 categories.
- Spurious data.
- Links. Then we ignored the url, website links and email links in the posts.
- Punctuations and Numbers.
Read on →
Task for This Week(03/02-03/09)
Today is our team regular meeting.
In this meeting, we arrage the detail tasks for this week. In this week, we will try to complete the first two steps of Data Preprocessing.
Collate threads.
We have contacted with the author of our reference paper, one of Phd students in our university, he shared with us some experience about this project and give us some guide, here we want to say thanks for him. We will classify these dataset at the level of threads.Remove spurious data and stop-words In this part, we will firstly try to learn some basic tools. What’s we need to learn: SMART information retrieval system, Stanford CoreNLP toolkit, tf-idf. After that, we need try to remove spurious data and stop-words(e.g., articles, prepositions and pronouns).
Submit Proposal
Today We submitted our proposal. But we still had some problems about this project.
Questions about the real-time
- Outage mail is sent in real-time by someone?
- If the outage is recovered right now, so what should I do for the diagnostic in this situation?
- what’s the meaning of real-time.
This project is based on this Paper Internet Outages, the Eyewitness Accounts: Analysis of the Outages Mailing List.
Discuss Which Project Should We Choose
Today We discussed the projects provided by the professor. We went through all the topics and talked about every details for each project. Based on our interests, we finaly picked out topic 5 and topic 8.(Refer to Projects). Then we found our professor, and talked with her to know what these two projects do. At last, we all took the topic 8(mailing list outage) as our project topic.