COMPSCI 397F: Introduction to Data Science

What is Data Science?

Data Science is the study of how we can transform raw observations (data) into meaningful information. The products of Data Science, such as models, reports, graphs and charts convey information in a concise way so that informed decisions can be made and new hypotheses can be formulated at an increasing rate.

Data Science involves three main areas of focus: computer science, statistics, and knowledge of the domain under study.

About this course:

In this course we provide an introduction into the concepts, tools and techniques to perform the following steps in the data science process:

  1. data acquisition, preparation
  2. data exploration
  3. data modeling
  4. data visualization
Course materials:

All software is open source and freely available. Students will install and use the required software on their machines.

We will use the R software environment including R-studio for data analysis and visualization:

The required text: Doing Data Science by Cathy O'Neil and Rachel Schutt:

Course topics:

We will cover basic statistics and progress to clustering, and classification tasks. Machine learning techniques such as K-Means, Partitioning Around Medioids, K-Nearest Neighbor, Naive Bayes, Logistic Regression, Decision Trees, Random Forests, Principle Component Analysis and more will be covered.

We will apply basic machine learning analysis steps: data manipulation, model selection, k-fold cross validation, model evaluation.

The emphasis will be on the suitability and application of these modeling techniques to specific data sets, although the mathematical basis of these algorithms will be explained.

We will study data sets from several domains, including Education, Science, Finance, Health, and Social Media.

Course format:

The course meets twice weekly. Each meeting consists of a presentation of new material followed by hands-on work that applies the concepts presented.

Several projects will be assigned.

A midterm and final exam will be administered.

Who should take this course?

This course is designed for Informatics students following the data science track, as well as students from other disciplines who want to gain experience applying data science concepts to the analysis of real-world data sets. This course is NOT intended for Computer Science majors and does not count as an elective towards the CS major.

Programming experience and knowledge of basic statistics is helpful but not required.

To register for this course, fill out the online override form by clicking on the button in the middle of this page:


Gordon Anderson, PhD
College of Information and Computer Sciences
University of Massachusetts, Amherst