Assignment 04: Understanding and Visualizing Data

  1. Summarize the reading from Coderre (Chapter 5–8). Your summary can be brief (about a page or two for this week / 1,000 words maximum; go over the limit if you need to but please don’t write me a book!). Use a style you are comfortable with (text, outline, etc.).

  2. The following zip contains four CSV files of X and Y data: [assignment-04-02.zip] For each one, compute:

    • mean of x
    • mean of y
    • variance of x
    • variance of y
    • (optional) correlation between x and y
    • (optional) line of best fit (slope, intercept) via linear regression

    Next, graph each dataset using a simple scatterplot (X vs Y).

    If you do this correctly, you will note something unusual about your results; this observation on the limits of descriptive statistics was popularized by Francis Anscombe in 1973.

  3. The following zip contains a CSV files baseball salary data: [assignment-04-03.zip]

    Using the material from the readings and analysis and visualization tools of your choice, I want you address each of the following, supporting your answers with either tables, visualizations, or brief descriptions of how you obtained each answer (if it was done through statistical analysis):

    a. Are there any anomalies, duplicates, or missing values in the data? If so, how (or did) you correct them for your subsequent analyses?

    b. For each year, which team had the highest-paid player? The lowest-paid player? The highest mean salary? The highest median salary?

    c. Which player earned the most salary over the years of data in this file?

    d. For the team with the highest payroll, and on the basis of the data in this file, what do you expect their payroll to be in the next year? Explain your answer (there’s definitely more than one way you might choose to go about this one).

    e. Stratify the salary ranges in a reasonable way, and show the number of players in each stratum over time.

    f. Do the salaries in this file obey Benford’s law? Explain why you think they do or do not.

    g. Perform one other analysis or visualization of the data of your choice; tell me something interesting you learn.