Assignment 06: Outliers and Statistical Tests
-
There is no reading from the textbook this week. Instead of requiring that you read and summarize a series of online articles, I instead ask that you read the suggested reading to the extent that you have time. Read it to the level that you’re comfortable with; if your background includes no statistics just stop when it starts to get too mathematical and move onto the next item. If you already have significant background, skip over the introductory material and move to the details. Use your best judgment.
Briefly summarize one item that you found interesting in your reading; either in the assigned reading you did or in your followup reading if you chased down one of the references or related articles mentioned in the reading.
-
This file [boston.csv] contains Boston-area housing prices from around 1978, along with several other variables. (For a fuller description, including a descriptions of the columns, see here: http://lib.stat.cmu.edu/datasets/boston and here: http://lib.stat.cmu.edu/datasets/boston_corrected.txt). There is no fraud (that I know of!) in the data, but it’s usefully large and noisy enough to practice your outlier detection and statistical skills on.
Using the tools of your choice, address the following tasks in the context of outlier / anomaly detection.
a. At least one column (dimension) is an obvious candidate to be removed, as it imparts no useful intrinsic information about the data. Which one(s) and why?
b. At least one column is perfectly correlated with another column, and thus one of the pair could be removed. Which one(s) and why?
c. Are there any other columns you might remove, either because you (as a human) can identify them as redundant, or because you applied another technique for dimensionality reduction? Explain which, if any.
d. Which column(s) represent the value of the houses?
e. Find the three instances with the highest assigned value; arguably, these are outliers. Using visualizations, sorting and ranking, manual analysis, or other techniques, try to determine why these three instances have such high values on the basis of the other columns (dimensions) in the data. Explain your results. (There is not a single correct answer here; I’m more interested in your thought process than the exact result. Perhaps you will look for outliers in other columns that correlate. Maybe you’ll visualize the data in some way. Maybe you’ll apply some domain knowledge if you know the Boston area. Ideally some combination of the above.)
-
Let’s look again at the data from the previous question, and ask a question involving statistical hypothesis testing. Choose two towns (each with at least five entries in the table). Does the mean value of housing (as represented by the samples in this data) differ significantly (in a statistical sense) between those two towns? (You can imagine applications of this technique in fraud detection, when comparing, say, sales teams in similar markets, or expense reports across similar groups, and so on.).
What is the null hypothesis and the alternative hypothesis you are considering?
Use a method of your choice to perform the test. Feel free to use an existing canned method; a t-test (which could be done, for example through Excel, R, or numpy) might be an appropriate choice, or you might perform a randomization test if you have access to an environment that provides one or sufficient programming skills to write one yourself.
Document the method you use and its results. What assumptions does it make? (It’s OK if you know the data maybe don’t exactly meet the assumptions – I just want you to be mindful of them.) What is the result of the test, in terms of your hypotheses?
Finally, do you believe your method had sufficient statistical power, given your choice of data and the effect size you were measuring? Explain your answer.
If something is unclear or seems needlessly difficult, please ask a question on the forums. Almost certainly you’re not alone, and either a fellow student or I will try to help you out!