The course project is to give the students hands-on experience on solving some novel text mining problems. The project thus emphasizes either research-oriented problems or "deliverables." It is preferred that the outcome of your project could be publishable, e.g., your (unique) solution to some (interesting/important/new) problems, or tangible, e.g., some kind of prototype system that can be demonstrated. Team work is required.
Your project will be graded based on the following required components:
An official rubric for the final report and rubric for the project presentation are provided for your reference.
Note that you are required to use the provided templates for your project proposal and final report. See the Resources page for the template and example file. Please name your submitted document as "CompID[-CompID]+-Proposal.PDF" or "CompID[-CompID]+-Report.PDF" accordingly, where "CompID[-CompID]+" refers to the list of your group members' computing IDs. One team only needs to provide one submission on collab; and unless specifically required, the same grade will be applied to all team members.
You can either pick from a list of sample topics provided by the instructor or choose your own topic. You are suggested to starting thinking about the topic for your course project from the first day of the class, and discuss it with your fellow students. This is a good way to identify opportunities for collaborations.
Leveraging existing resources is especially encouraged as it allows you to minimize the amount of work that you have to do and focus on developing truly your ideas.
When picking a topic, try to ask yourself the following questions:
Keep in mind, you are required to address the above questions in your project proposal and final report.
You are required to work with other students as a team. Teams may consist of up to four total students, and three students a team is recommended. Teamwork not only gives your some experience on working with others, but also allows you to work on a larger (presumably more important) topic.
Note that it is your responsibility to figure out how to contribute to your group project, so you will need to act proactively and in a timely manner if your group leader has not assigned a task to you. The instructor will believe all team members actively contribute to the project and the same grade will be applied to the group member (unless special treatment is required by the group members).
While choosing a topic, it is very important to be aware of whether the problem you would like to tackle has already been solved. If so, you may want to figure out where exactly your novelty is and whether novelty leads to any benefit to others. Your goal is to go beyond, rather than simply duplicate, the existing work. To minimize your effort, you are encouraged to leverage existing algorithms, toolkits, and other useful resources as much as possible. The instructor can also help you check related work. Please feel free to discuss your plan with the instructor before finalizing your proposal.
You are required to write a two-page proposal before you actually go in depth on a topic. In the proposal, you should address the following questions and include the names of all the team members as authors. The order among authors' names do not matter.
Intuitively, the proposal should read like the introduction part of a regular research paper. Briefly state the background/motivation, what has been done, what is missing, how do you plan to solve it, how do you plan to prove the usefulness of your method, and summarize your contribution(s).
You should leverage any existing tools or methods as much as possible. For example, consider using the Lucene toolkit for indexing and searching in a large text corpus; using Stanford NLP parser or OpenNLP toolkit for text analysis; using MALLET or WEKA for classification or clustering. There are also many tools available on the Internet. See the resources page for some useful pointers. Discuss any problems or issues with your teammates or classmates. If you need special support, please let the instructor know.
Consider documenting your work regularly. This way, you will already have a lot of things written down by the end of the semester. In addition, we strongly suggest using version control for your project! Nothing is more frustrating than losing a lot of your hard work, especially if it's close to a deadline.
To help you better manage your time in course project, every team is required to send an email to the instructor every month to briefly report their progress. In the email, please briefly summarize your achievements in the past month, milestones you have reached, and plan for the next month. Please feel free to discuss with the instructor and TAs about the difficulties and challenges you have encountered during the project.
At the end of the semester, each project team is expected to present their project in class. The purpose of this presentation is
In general, the structure of your presentation should be prepared like a conference presentation. So it should touch all the following aspects (text in parenthesis states the instructor's expectation):
Think about how you can best present your work so as to make it as easy as possible for your audience to understand your main messages. Try to be concise, to the point. Pictures, illustrations, and examples are generally more effective than text for explaining your project. Try to show screen shots and/or plots of your experimental results. Watching some top conference presentations (e.g., KDD, SIGIR, ICML) on VideoLectures will be beneficial.
In order to be fair to all members in the same group, the instructor will randomly pick team members for question answering during the presentation.
You should write your report as if you were writing a regular conference paper. You should address the same questions as those you have addressed in the proposal and presentation, only with more details. Pay special attention to the challenges that you have solved and your detailed solutions. Basic sections to be included in the report should be the same as those in a conference paper, e.g., abstract, introduction, related work, method, experiment and conclusion. If you are developing a demo system or toolkit, your report should follow the format of a demo paper.
You are required to use LaTeX for your project report. See the Resources page for the template and example file. The project report must be at most six pages with the required template (no minimal requirement, as long as you feel it is sufficient to prove the merit of your work, and no page limit on the references).
Automatic tutor for English writing: Improving English writing skill is always a significant challenge for a non-English speaker. And it is also even stressful for a native speaker in specific scenarios, e.g., formal scientific writing. This project aims to develop automatic tools based on language models to beautify an amateur's English writings. For example, language models trained on twitter data would make an ordinary user's tweet looks more like being written by a experienced twitter user; and language models trained on scientific publications would make an amateur's paper read like an expert's work. In Gmail, neural language models are used to help complete our messages, but it can only make suggestions to complete the sentence, not to revise the sentence completely based on the context. Can we do better there?
Spatial temporal analysis of opinions: Social opinions provide a gold mine for researchers to understand the explore public's opinion towards a specific entity, e.g., products and celebrities, or a service, e.g., hotels and restaurants. This project aims to extend an existing aspect-based opinion mining system, ReviewMiner, for supporting spatial-temporal analysis of opinions. Specifically, we want to visualize the opinions: display the temporal dynamics of opinion across different entities (e.g., from twitter stream or reviews), render the opinions on a map, and support user interaction with such spatial-temporal analysis of opinions.
Temporal topic analysis: Documents generated over time, although could be large in volume, are never independent from each other. Both temporal and textual information strongly manifest the underlying dependency structure of document streams. How to effectively model and analyze these unstructured document streams becomes increasingly important for service providers to improve users' experience and maximize their service utility. This project focuses on developing a systematic solution to perform temporal analysis of topics in document streams and capture the temporal and semantic dependency among the documents.
Social influence v.s. homophily: Users in Yelp write reviews about businesses and make friends who share similar tastes and preferences. However it is unknown whether users become friends because they visited the same restaurants before (i.e., homophily); or they visited the same restaurants because they were friend (i.e., )influence. Distinguishing these two factors are very important for social network based recommendations.
Query intent classification: Current product search system can only support simply keyword search, e.g., "canon 5d3". It is preferred if the system can support some simple semantic search, e.g., "cheap digit camera with high resolution." The system should be able to correctly map the specifiers of "cheap" and "high" to corresponding aspects of the product, e.g., price and image quality, and return all the results matching such criteria. One can imagine this as a translation process and opinionated review text documents provide nice resource to estimate such translation model.
Active learning for sequential text labeling: Manually annotating text documents for supervised machine learning is generally time consuming and expensive. The situation becomes even worse when it comes to the situation of sequential text labeling, e.g., part-of-speech labeling and named entity recognition. However, the availability and quality of manual labels directly limit the effectiveness of the learnt models. Active learning becomes a natural remedy of this challenge. Traditional works in active learning mostly focus on simple learning tasks, e.g., multi-class classification or regression, while little attention has been paid onto the problem of structured prediction problems, e.g., sequential text labeling. Instead of selecting a whole sequence for labeling, can we only actively label a subsequence of input to improve model training? How to update a structured prediction model when only partial labeling is available?
Learning a text classifier with unreliable annotations only: Oftentimes, complete and fully trustful manual annotations are hard to obtain, but partial and noisy annotations, e.g., referred as weak or remote supervision, can be easily acquired at scale. How to model and take advantage of such weak supervision becomes an important and emerging research topic. In text mining, especially when handling social media data, being able to handle weak supervision becomes extremely important. How can we identify the reliability of the weak annotations, and modeling the dependency between the weak and true labels? If we can acquire the true labels on the fly, how should we design the query strategy to best improve the classifier over time?
We will use the same evaluation system page for peer evaluation in our project presentation. Please note you will not evaluate your own presentation, and therefore do not be surprised that you cannot find your name in the evaluation system.
We will follow the following schedule and presentation order to perform our project presentations on April 30th and May 2nd.
Name | Date | Project Title |
Xinzuo Wang, Jiayang Liu and Hao Gu | April 30 | Hotel Recommendation Based on Opinion Analysis |
Wanyu Du and Xinyu Yang | April 30 | Alter the style of the text: A neural style transfer model |
Austin Chen, Quinn Dawkins, Danial Hussain and Jihyeong Lee | May 2 | Predicting Lines In Movie Scripts |
Zheng Chen, Yumeng Jiang, Runze Yan and Yingying Chen | May 2 | Yelp Recommendation System Based On Sequence Tagging |
Jinyu Chen, Runnan Yang, Xiaoxi Lin and Jie Yang | May 2 | Personalized Recommendation System for Restaurant |
Yu Du and Haochuan Zhang | May 2 | Restaurant Recommendation System Based on Yelp |
Chuanhao Li, Ruizhong Miao, Rongrong Liu and Mengyu Gong | May 2 | Recommendation System Using Knowledge Graph |
Wen Ying and Teng Li | May 2 | Query Intent Classification In Online-shopping |
Yinqiao Xiong, Aobo Yang, Zixi Qi and Kechen Liu | May 2 | Text Summarization System for Articles |
Anna Baglione and Abraham Gebru Tesfay | May 2 | Mining Text From Twitter Users To Identify Political Affiliation |
Akanksha Nichrelay and Arjun Malhotra | May 2 | Detecting questions with same intent in Question-Answering Platforms |
Guangxu Xun, Mengdi Huai, Jianhui Sun and Kishlay Jha | May 2 | Knowledge-Base EnrichedWord Embeddings for Biomedical Domain |
Yichen Jiang, Wen Ding, Shenghao Ye and Xiang Guo | May 2 | Emoji Usage Prediction and Its Application in Sentimental Analysis |
Andrea Zhang, Eamon Collins, Mike Song and Monique Mezher | May 2 | Cross-lingual Sentiment Comparison in Wikipedia |
Tyler Handley, Hunter Murphy, Sile Shu and Rachel Wicks | May 2 | Word Sense Disambiguation for Double Meanings |