Assignment 05: Numeric tests

  1. Summarize the reading from Coderre (Chapter 10 + 12). Your summary can be brief (about a page or two for this week / 1,000 words maximum; go over the limit if you need to but please don’t write me a book!). Use a style you are comfortable with (text, outline, etc.).

  2. Suppose you were going to use the analysis technique described in Chapter 10 of Coderre to profile numerical data and detect outliers that might exist. Assume your data is a single list of numerical values in a simple format (Excel: A single column; Python: either a list or a numpy array; R: either a vector or a data frame with a single column, etc.). Describe the operations (either the formula, or a function/method you’d write, etc.) to perform each of the following analyses:

    • Finding the minimum, maximum, mean, and median values.
    • Finding the n greatest and n smallest values, for a value of n provided elsewhere (either in another cell of your spreadsheet, or as a parameter to your function, etc.).
    • Finding round numbers. In particular, finding values that are multiple of 10n, for a given value of n.
    • Finding ratios. In particular, finding the max/min ratio and the max1/max2 ratio.
  3. Coderre discusses directed sampling and gives a very brief overview of statistical sampling in Chapter 12. We’re going to discuss statistical sampling and tests in more detail next week, but for now, we’ll get our feet wet with this and the next question.

    Suppose you had a dataset of transactions consisting of two sub-groups. In the first sub-group, there are 9,000 small transactions (say, with a mean value of $100), and in the second, there are 1,000 large transactions (with a mean value of $500).

    • Which sub-group contains the larger potential liability in the worst case? Explain your answer.
    • Suppose you believed there to be at most 50 erroneous transactions distributed uniformly at random across all 10,000 transactions, and you were going to audit by choosing transactions at uniformly random to audit. What is the smallest number of samples you need to draw to have at least a 90% probability of selecting one such erroneous transaction? (There is no trick here: Apply the formula from Coderre.)
    • Suppose you are more concerned with finding an error, rather than the dollar value of the error. How (if at all) would you modify your sampling technique to most efficiently sample where you believed the errors to be? Depending upon your mathematical skills, you may prefer to do a little math, game this one out on the computer, or some of each, to support your answer.
    • Suppose you are more concerned with finding expensive errors (that is, of high dollar value), rather than just finding an error. How (if at all) would you modify your sampling technique to most efficiently sample where you believed the errors to be?
  4. Read about the Standard Score (aka the z-score) on Wikipedia or in the statistics textbook of your choice. The z-score is a simple way to compute the distance of a sample from the mean of the underlying population; it is the number of standard deviations from the mean. Standard deviations are often written as the Greek letter sigma (σ); if you’ve heard of the “six sigma” methodology, that’s the same sigma.

    Statistics nerds-in-training take note: The use of the z-score is not ideal when you don’t know the underlying population mean (as opposed to the mean of your sample). And while well-defined for any distribution of underlying data (regardless of “shape,” like bell curve, or flat distribution, or exponential, etc.), the z-score often make the most sense for a normal distribution (that is, a bell-curve shape).

    Consider the 2005 salary data from last week’s assignment. Name the players whose salaries have an absolute z-score of at least four.