Deal With Problems Last Week and New Tasks
What we have done:
- Combinated the same threads(original posts and reply posts)
- We used python to remove the punctations and some unrelated words(stopwords). The origal stopwords what we are using is here.
Problems and New tasks:
The stopwords are not enough. We should also remove the people names, places names, etc. So we summary the unrelated words to the following 9 categories.
- Spurious data.
- Links. Then we ignored the url, website links and email links in the posts.
- Punctuations and Numbers.
- Traceroute measurements.
- Stop words. We use a list of stop words obtained from the SMART information retrieval system.
- Organization and Human names.
- Time-related and Place-related words. Such as day, night, NYC, San Jose, etc.
- Some unrelated abbreviation words. Such like ICS, ISP, etc.
- Others. This includes some entities words( like issue,information, etc) or phrase(like in order to)