Assignment 09
This assignment is due at 1700 on Wednesday, 26 November.
The goal of this assignment is to implement k-means clustering.
I will be updating the assignment with questions (and their answers) as they are asked.
Problem
According to Wikipedia,
[C]lustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
This seems a reasonable definition to work from.
In this assignment, you will write a program that implements the k-means clustering algorithm. It will take as input a value k and a set of instances. Each instance will consist of a label, and one or more values; each value will be a real number in the range [0.0, 1.0]. Your program will then on the basis of these values compute a clustering: an assignment of instances to clusters.
Input data format
Your program should expect input data in the following text-based format.
Each instance will consist of at least two whitespace-separated values. For a given list of instances, each instance will have the same number of values as any other.
The first value in each line will be a label, represented as a string of one or more non-whitespace characters. The second value (and any thereafter) will be a floating-point number in the range [0.0, 1.0]. These numerical values will be parseable by Java’s double.parseDouble()
method.
That is, there will be n (one or more) numerical values per instance; each instance represents a point in an n-dimensional space.
For example:
1 2 3 4 5 |
|
is a valid input representing five points in a 1d space.
The value of k will be provided as an argument to the program (see What to submit, below).
Output data format
Your program should cluster its input data into k clusters, breaking ties in the distance metric in favor of the cluster with the lower number. It should then output a list of integers, one per line, in the range [0, k-1]. Each line corresponds, in order, to a line in the input; each value denotes the cluster number to which that instance was assigned.
For example, for the above input, a valid output might be:
1 2 3 4 5 |
|
Other items of import
The label is for human use only. Do not use it as input to the k-means algorithm, and don’t make any other assumptions about it. For example, don’t assume it maps to the true cluster labels. Of course, you can use it as such in your own testing if it helps you debug.
The clusters your program creates will depend upon the points initially chosen as cluster centers. To ensure the output is deterministic and easily auto-gradeable, your program must use the first k instances in the input as the initial cluster centers.
The clusters your program creates will also depend upon the distance metric you use. Use Euclidean distance to find the closest points.
For at least 1d and 2d data, it should be straightforward to use the graphing program of your choice to visualize your results; it should also be fairly straightforward to generate test labeled test data to check your work. I strongly suggest you do so.
I got 99 problems but k-means ain’t one
Several students have emailed me to ask if they’re overlooking something: This assignment seems too easy.
All I can say is that it is intended to be straightforward.
But, let’s say you’d prefer to spend some time on a more sophisticated clustering algorithm that’s less than 50 years old. Your mission, should you choose to accept it, is to implement a more modern clustering algorithm: DBSCAN.
Wikipedia’s writeup (and its pseudocode), linked to above, is sufficient to implement the algorithm. Treat ε (eps) as its second command-line argument; set MinPts to the number of variables (dimensions) in the graph, plus one.
If you decide to implement DBSCAN, do so after getting k-means working, first. Submit it in addition to your k-means clusterer, in a file named DBSCAN.java
or the equivalent.
What to submit
You should submit two things: a program to cluster data, and a readme.txt
.
- Your clusterer should use its first command line argument as the path to file containing the input data data and its second command line argument as the value of k. If, for example, your classifier’s main method is in a Java class named
KMeans
, we should be able to usejava KMeans /Users/liberato/clusterer.data 2
to direct your program to read the input data in/Users/liberato/clusterer
and cluster it with a k of 2. - Your program should print the cluster assignments to standard output, in exactly the format described above.
Submit the source code of your programs, written in the language of your choice. Name the files containing the main()
methods KMeans.java
or your language’s equivalent. If the file(s) you submit depend(s) upon other files, be sure to submit these other files as well.
As in the previous assignments, while you may use library calls for parsing, data structures and the like, you must implement the clusterer yourself. Do not use a library for clustering. We will consider it plagiarism if you do. Check with us if you think there’s any ambiguity.
Your readme.txt
should contain the following items:
- your name
- if the language of your choice is not Java, Python, Ruby, node.js-compatible JavaScript, ANSI C or C++, or Mono-compatible C# (or if you’re concerned it’s not completely obvious to us how to compile and execute it), a description of how to compile and execute the submitted files
- a description of what you got working, what is partially working and what is completely broken
If you’re using language features that require a specific version of your language or runtime, check for that version at program start and fail if it’s not present, emitting an understandable error message indicating this fact. Your program must compile and execute on the Edlab Linux machines.
If your program does not compile or execute, you will receive no credit. Check with us in advance if you’re concerned.
Grading
We will run your program on a variety of test cases. The exact test cases will not be available to you before grading. You are welcome to write and distribute your own test cases.
If your readme.txt
is missing or judged insufficient, your overall score may be penalized by up to ten percent.
We’re not going to feed your program incorrectly formatted input, so you need only concern yourself with handling input in the format described in the assignment.
We expect valid output. Generating output that is not in the format described in the assignment will result in a failed test case. We also require you to initialize cluster starting points deterministically and to use Euclidean distance. If you do not do so, your program will likely fail the tests.
I do not expect anything in a solution to this assignment to be particularly memory or CPU intensive. But as usual, if your program exceeds available heap memory (which we’ll set to 1 GB in Java, using the -Xmx1024M
argument if necessary), or if it does not terminate in twenty seconds, we will consider the test case failed.
Questions and answers
How do you update the cluster center if, after reassigning the instances, there are no instances left in the cluster? Alternatively, should this never happen and I’m just assigning/updating wrong?
You’re probably not wrong, as this event can happen with the assignment as written. You’re more likely to see it happen if your input data is “sorted,” in the sense that you expect several (or all) of the first k entries to end up in the same cluster at algorithm termination.
For this assignment, ignore it. That is, don’t move or reset the cluster center just because it becomes empty.
This event means the algorithm has found a perhaps-undesirable local minima; in practice, k-means implementations will usually choose a new point at random as new center for the currently empty cluster. In the interest of autograder sanity, do not do this.
What should we do if a point is equidistant from two cluster centers?
Assign it to the lower-numbered cluster.
Are we supposed to change the cluster center and run the clusterer until convergence, or do we simply run through it once based on the initial k clusters and print off that?
What I implemented:
- I made an algorithm that takes the first k clusters, and stores them with center and label.
- Then I have a list of values which is the inputs which are from [0,1]. I iterate through the list of values, and calculate the closest Cluster center, and then print that Cluster center.
I ran my algorithm on the test input that you had in the assignment, and my output was correct. Perhaps this because this was a very simple test case and so it did not need to iterate
Run until convergence, that is, until the cluster assignments remain constant across updates of the cluster centers. In principle, it’s possible that the assignments will oscillate back and forth, but I will not provide an input that results in oscillation. In practice you can watch for it, or just stop after a fixed number of iterations.
As you observed, the sample input does not require iteration; other inputs may.
I had a question about the labels in the inputs, you say to initialize the clusters to the first k instances and I was wondering if the the first k instances will always have different labels as input or if some of them will be the same.
Assume nothing about the labels, other than their existence.
I have a quick clarification question on Assignment 09. Are we supposed to change the center clusters or should the center clusters just be the initial points and never change?
The k cluster centers will be initially be the first k examples. Each iteration, first assign each example to the cluster center that has the shortest Euclidean distance from it. Next, update each cluster center to be the average of all examples associated with it. Repeating this iteration until no examples change cluster assignments.