We will highlight the basic structure and major topics of this course, and go over some logistic issues and course requirements.
Lecture II: Document Representation
We will discuss how to represent the unstructured text documents with appropriate format and structure to support later automated text mining algorithms.
We will briefly provide an introduction to computational linguistics, from morphology (word formation) and syntax (sentence structure) to semantics (meaning), as the first step to process and analyze text data. Public natural langauge processing (NLP) toolkits will be introduced for you to understand and practice with those techniques.
Day 1: Get familiar with NLP pipelines (slides, PDF)
Document categorization refers to the task of assigning a text document to one or more classes or categories. We will discuss several basic supervised text categorization algorithms.
Text clustering refers to the task of identifying the clustering structure of a corpus of text documents and assigning documents to the identified cluster(s). We will discuss two typical types of clustering algorithms, i.e., centroid-based clustering (e.g., k-means clustering) and connectivity-based clustering (a.k.a., hierarchical clustering).
- ### Day 1: Basic Concepts & Evaluations (slides, PDF)
Day 4: Word clustering
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. We will introduce the general idea of topic modeling, two basic topic models, i.e., Probabilistic Latent Semantic Indexing (pLSI) and Latent Dirichlet Allocation (LDA), and their variants for different application scenarios, including classification, imagine annotation, collaborative filtering, and hierarchical topical structure modeling.
- ### Day 1: Topic models (slides, PDF)
Lecture VIII: Text Mining Applications
We will introduce some modern text mining applications, including sentiment analysis, document summarization, recommendation, and document visualization.
Day 4: Document visualization
