Welcome to CS 498 Data Mining  
Data Mining
This course provides an introduction to the computational processes of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, information retrieval, and scalable database systems.

The overall goal of data mining is to extract useful information from a data set and transform it into an understandable structure for further use. To achieve this goal, data mining involves a knowledge discovery (KDD) process that includes database and data management aspects, data preprocessing and feature identification, machine learning model and inference considerations, post-processing of discovered structures, and visualization.

Course topics and laboratory exercises will cover each aspect of the KDD process, emphasize methods for predictive analytics applied to large data sets, and will include real-world datasets and applications from social media, science, engineering, and business.

Prerequisite:
MA-262 Probability and Statistics, junior standing, and programming maturity in Java, Python, R, or Matlab.
Helpful: CS-386 Database Systems, and MA-383 Linear Algebra.

2-2-3 (class hours/week, laboratory hours/week, credits)
Required text:
  • Data Mining: Concepts and Techniques, Third Edition, Morgan Kaufman, Jiawei Han, Micheline Kamber, Jian Pei. (DM)

Recommended text:
  • Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (Morgan Kaufmann Series in Data Management Systems), Ian H. Witten.

References and resources:
  • Python for Data Analysis (OÕReilly), William McKinney.

  • Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites (OÕReilly), Matthew A. Russell.

  • Mining of Massive Datasets, Anand Rajaraman and Jeffrey David Ullman. http://infolab.stanford.edu/~ullman/mmds.html

  • Data Mining with R: Learning with Case Studies (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series), Luis Torgo.

  • Programming Collective Intelligence: Building Smart Web 2.0 Applications, OÕRielly, Toby Seagram.

  • Some materials from: CS4881, CS386, CS4230, and CS3851.

Tools:
  • Java/Weka, MySQL, Python and Pandas, R, Matlab/Octave.

Upon successful completion of this course, the student will:
  • Understand the concepts of data mining and the relationships of data mining with database systems, statistics, machine learning, and information retrieval.

  • Be able to iteratively and interactively apply the KDD process.

  • Understand the objectives of data preprocessing and be able to apply basic methods of data cleaning, data integration and transformation, and data reduction.

  • Understand the concept of a data warehouse and its associated dimensional data model.

  • Understand concepts and application techniques for association, correlation, and frequent pattern analysis.

  • Understand the concepts and application techniques for classiÞcation analysis.

  • Understand the concepts and application techniques for cluster and outlier analysis.

  • Understand the concepts and application techniques text mining and web mining.

  • Understand the concepts and application techniques sequence analysis.

  • Understand the concepts and application techniques data visualization.