Deal With Problems Last Week and New Tasks

What we have done:

  • Combinated the same threads(original posts and reply posts)
  • We used python to remove the punctations and some unrelated words(stopwords). The origal stopwords what we are using is here.

Problems and New tasks:

The stopwords are not enough. We should also remove the people names, places names, etc. So we summary the unrelated words to the following 9 categories.

  1. Spurious data.
  2. Links. Then we ignored the url, website links and email links in the posts.
  3. Punctuations and Numbers.
  4. Traceroute measurements.
  5. Stop words. We use a list of stop words obtained from the SMART information retrieval system.
  6. Organization and Human names.
  7. Time-related and Place-related words. Such as day, night, NYC, San Jose, etc.
  8. Some unrelated abbreviation words. Such like ICS, ISP, etc.
  9. Others. This includes some entities words( like issue,information, etc) or phrase(like in order to)