CICS 397A: Predictive Analytics with Python, Fall 2019

https://www.cics.umass.edu/content/fall-19-course-descriptions

 


Course Number: CICS 397A
Instructor:
Swarna Reddy
Teaching Assistants: To be announced
Location: To be announced (aiming to be at TT : 4-5:15pm)

Time: To be announced
Instructor office hours: To be announced

 

Link to Piazza: contains schedule, assignments, etc.

 

Course Description:

 

Twenty first century technological advances are generating ever-greater volumes of data. Examples of these sources include the ubiquitous internet and the notorious smart phone. There is an astounding number of opportunities to use these data for good (and bad) in the applied sciences, business, social media, politics, cyber security, to name a few areas. Gaining insight from these data requires a firm understanding of the mathematics and computational methods upon which the methods are based and put into use. That said, the elements of data science are indeed accessible to a fairly broad audience, and so our goal is to provide course participants with an understanding of these elements through application.

 

The specific course objectives are to educate participants in some of the most commonly used data analytics including methods for reducing massively large data to informative statistics, data visualization, and cluster analysis. Practical data science demands the ability to program in a scripting language and therefore, students in this course will learn and use the most popular of these languages Python. The first learning goal is to understand these central data analytic methods, and the second learning goal is to know how to use them with Python. Our approach is close to the metal you'll create the Python scripts from the ground up and apply them to real and fascinating data sets.

 

The course will use a new approach, with in-class tutorials. The tutorials introduce students that are new to the area with practical data analytics. The topic-wise tutorials in python with actual data sets in the areas of political campaign contributions, the complex CDC-BRFSS (Behavioral Risk Factor Surveillance System). The choice of BRFSS data is due to its complexity, not just to benefit those, who are interested in healthcare industry but introduce the experience of information retrieval in the arena of data science and big data analytics. The course also teaches how to identify and analyze the stylistics in writing, the special case of this analysis is more known with the applications to identify plagiarism.

 

 

Required Background:

 

This course requires mathematical background in probability and statistics, calculus, and background in linear algebra is desirable. The general awareness of big data applications of current environment gives better insight of the course. The official prerequisites are Either COMPSCI 190 or STAT 240 (equivalent) and COMPSCI 119.

 

Override questions:

 

If you'd like to take this course but cannot register, please submit an override request through the online system. Above all, please describe your background in Stat/Mathematics and or computer science. Please list any courses you've taken either in those areas, or any other relevant training or experience you might have.

 

Textbooks:

 

The course readings will primarily be based the following textbook:

 

Algorithms for Data Science (ISBN-10: 3319457950)

https://www.springer.com/us/book/9783319457956

 

About textbook: This textbook on practical data analytics unites fundamental principles, algorithms, and data. Algorithms are the keystone of data analytics and the focal point of this textbook. Clear and intuitive explanations of the mathematical and statistical foundations make the algorithms transparent. But practical data analytics requires more than just the foundations. Problems and data are enormously variable and only the most elementary of algorithms can be used without modification. Programming fluency and experience with real and challenging data is indispensable and so the reader is immersed in Python and R and real data analysis.

 

Note: Chapter previews are available at publishers web-site.

 

Course Format:

 

Class meetings are divided between lectures and working in small groups on programming and data analytics.

 

Course requirements:

 

50% In-class Tutorials/ Home works.

Homework and Tutorials: Homework exercises emphasizing applications of the algorithms will be assigned biweekly. Home works are usually include both written math questions, as well as programming submission problems.

Tutorials are oriented toward gaining proficiency in programming by guiding the student through the creation of a Python script. Students are responsible for completing 4 tutorials per month (due at the beginning of each month except September).

 

20% Midterm

30% final Exam

 

Major topics:

 

1.     Data mappings and the concepts of data reduction. Similarity measures and distance metrics.

2.     List, set, and dictionary comprehension.

3.     Scalable algorithms and associative statistics. Computing univariate and multivariate statistics using big data.

4.     Introduction to distributed computing and the Map/Reduce algorithm.

5.     Data visualization and ggplot2.

6.     Predictive analytics. K-nearest neighbor methods and regression.

7.     Cluster analysis. Hierarchical and k-means methods.

The Academic Honesty:

 

We follow the universitys Academic Honesty Policy and Procedures.

 

If you have questions about a particular situation, please ask.