01: Welcome and Java Review

Welcome

Hello and welcome!

I’m Marc Liberatore liberato@cs.umass.edu and I’m your instructor for this course, COMPSCI 190D.

The most important thing to know today: the course web site is at https://people.cs.umass.edu/~liberato/courses/2017-spring-compsci190d/. It is the syllabus for this class and you are expected to read it in its entirety.

What is this course?

What is this course? It’s a course about the who, what, when, and why of commonly-used data structures – we’ll learn their names and behavior well enough to know when to use which structures. We’ll only briefly touch on the how – how data structures are implemented – as 187 concerns itself deeply with this topic and is the next course in the COMPSCI sequence.

Speaking of 187, let’s confront the elephant in the room right now. 190D is a new, optional course between 121 and 187. Why does it exist, and why does it exist now? (And by implication, why are you here?) Two big reasons:

The “casualty rate” recently between 121 and 187 has been unacceptably high. Students who get less than an A- in 121 are not likely to succeed in 187 – by which I mean, they are likely to get less than a C on their first attempt at 187. The only option they have is to re-take 187, which nobody (us or you) likes.
Much of the content (the “how”) in 187 is, perhaps, overkill for students intending to pursue the Informatics minor.

190D is an attempt to kill these two birds with one stone.

First, we conjecture that many students who do OK in 121 could pass 187 – if only they had a little more practice programming and exposure to various parts of the computer science ecosystem (especially the practical bits of Java). 190D is intended to guide you on a path toward programming mastery that’s more gentle than the current trajectory of 187. 187 has a bit of what I call the “eat your vegetables first” problem. It also has a bit of the “pie eating contest, where first prize is more pie” problem. Together this is a recipe for a lot of spinach pies, which maybe isn’t great if you’ve not been training in competitive spinach pie eating since you were a kid. 190D is an attempt to provide more reasonable portion sizes and to balance the diet.

Moving on from mixed metaphors, 187 is a prerequisite for all of the 200-level COMPSCI classes. Some upper-level COMPSCI classes are reasonable fits for Informatics majors (e.g., 326: Web Programming), but have prerequisites of either 187 or 220/230. The reason for these prerequisites is, in some cases, programming maturity: we think you need more than just 121 to be ready for them. 190D will be, we hope, an alternative to 187 in giving you the experience you need to be ready for these courses. (This is not set in stone yet, but it is the tentative plan.)

We’ll start off with a review of material from 121, and work our way up to more complicated programs. We’ll learn about and use many of the data structures available in the Java API to build these programs. Along the way, we’ll learn about the tooling you can expect to use as a working Java programmer (in 187, or in Informatics courses, or in internships). This is the part where I’m also supposed to say, “and we’ll have fun doing it,” but let’s not over-egg the pudding.

121 review: variables and values

Bits and bytes

We’ll spend most of the next two weeks on a review of some 121 material. Let’s get started.

The fundamental unit of information is the bit – a single, binary digit, of value 0 or 1.

Bits are organized into bytes: 8 bits in a byte. bits (the smaller unit) are abbreviated b; bytes (the bigger unit) are abbreviated B.

When we talk about computer memory and RAM, we’re talking about a large number of bytes that we use to store data. Data is thing that enables nontrivial, “stateful” programs — without (changeable) state, behavior is predetermined. But computers are generally useful to us because we can vary their behavior: we need data, and we need to be able to manipulate it.

How many bytes are we talking about? How much memory does your PC or Mac have? 4 GB? What’s a giga-?

A kilo / mega / giga are metric prefixes: multipliers of 10^3 (1,000), 10^6, and 10^9, respectively. Confusingly, in the land of computer science, they’re multipliers of 2^10 (1,024), 2^20, and 2^30 – each slightly larger than the corresponding metric prefix. Even more confusingly, that only applies to memory; network bandwidth and disk sizes usually use the metric meaning. (This fact is one of the reasons why when you buy a 500GB drive it usually shows up as significantly less: 500 x 10^9 ~= 465 x 2^30.)

These several billion bytes are like an enormous canvas that you, the programmer, can paint upon. Except the computer isn’t you, and can’t see them all at once: you have to precisely name the place in memory you care about. Modern CPUs number the bytes, starting at zero (‘cuz we start at zero in CS, of course) and working their way up to 4 x 2^30 - 1 (if you have 4GB of “addressable” memory). This number is called the byte’s address.

Modern CPUs usually work on larger units of information, called a word. Words on modern CPUs are typically 32 or 64 bits (4 or 8 bytes), and some special instructions can work on larger units still. A computer can have as much addressable memory as fits in a word. Thus, the address space is usually equal to 2^(word size) of the computer.

Right about now you’re probably like, “did I accidentally sign up for a Computer Systems Engineering course?” And the answer is no. But I do want to make sure you have some intuition for stuff that’s going to come up later in the course and 187. But I promise this is about as deep as we’ll go into computer organization.

Variables, data types, and assignment

OK, so another thing you might be thinking is, “Huh, I’ve never written any Java where I’ve worried about addressing memory directly.” To which I respond, “Eff yes! Ain’t it great?”

(Arguably) one of the greatest success stories of computer science is the development of high-level languages and runtimes to free programmers from worrying (too much) about nitty-gritty details like those above. Of course, sometimes you will need to do so, but for problems that don’t push the boundaries of what a computer can do, and that don’t need to scale to millions of machines, and so on, you can effectively ignore many little details and still be an extremely productive programmer.

For example, suppose we’re writing a web app for the PVTA: http://bustracker.pvta.com/infopoint/

We don’t need to worry about painstakingly laying out four bytes to represent a bus number, then eight bytes to represent a distance, and then remember their memory address each time we want to use them. Instead we write:

int busNumber;
double distanceTraveled;

And the computer (the compiler and the runtime) generate code that lays out memory for us, gives each variable a name, and even knows something about the variable – its type. The compiler can do “typechecking” to prevent us from attempting some impossible things, like, say, adding together a boolean and an integer. It can also use type information to make our lives easier: adding a floating-point number and an integer is actually non-trivial, but it’s transparent in Java. Similarly, we can “add” integers to strings to build a new string with the integer inserted.

There are roughly two kinds of types in Java: primitive types, and objects. (Arrays are kinda in between, but are actually an Object.)

What are the primitive types you know about?

byte: 8-bit signed
short: 16-bit signed
int: 32-bit signed
long: 64-bit signed
float: 32-bit floating point
double: 64-bit floating point
boolean: true or false
char: 16-bit Unicode character

Whenever you declare a variable of one of these types, Java lays out memory of the correct size and remembers the address, using the value stored in that memory address whenever you reference the variable. We blur the difference sometimes when speaking, but keeping the idea of a variable (a particular location in memory) and a value (in this context, the contents of a memory location) separate is very important.

A fundamental thing you can do with variables is assign to them. You can assign a literal value, like i = 3; to write a value directly to a memory location. You can assign from one variable to another, like i = j;. But let’s be clear about what’s happening: the computer isn’t “copying j into i,” even though that’s how you might say it. It’s looking up the value stored at the address of j, then storing that value in the address of i. This is more clear if you think about the result of a computation, like “i = j + k;”.

And obviously then the primitive types support various kinds of computation using “operators”, which are built into the language (things like addition and subtraction).

Arrays

Arrays are a built-in form of “container” type – the first (non-primitive) data structure you likely learned about. In particular, arrays are a linear sequence of values, all of the same type. In our examples today, they’ll be primitive types, but arrays can hold objects as well, as we’ll see. An array type is denoted with the [] suffix after a type. For example, an array of ints might be declared as int[] busNumbers;

The array type doesn’t tell you how many of the thing are in the array. This information exists only once the array is instantiated – the memory allocated for it: busNumbers = new int[10]; What’s happening here? First, new int[10] creates a new space in memory for ten integers, one after another. Then, the address of this memory space is stored in busNumbers. If you want to access a particular piece of the array, you must address it correctly, indexing from zero.

So, for example, busNumbers[2] = busNumbers[6]; means, in English, to look up the value stored in the sixth (starting from zero, which we call the “zero-eth”) slot of the array, and store it in the second slot of the array.

(Arrays also support some useful methods and properties, such as .length.)

You can have arrays of arrays, for example, int sudoku[][] is an array of arrays of ints; this kind of two (or more) dimensional array is sometimes a more intuitive way to represent a problem than a one-dimensional array.

Methods and scope

Nothing lasts forever. Variables (the names, not the values) only live for as long as they are “in scope.” Suppose we have two methods:

int add(int i, int j) {
  return i + j;
}

void print() {
  System.out.println(i + " " j);
}

Will this compile? No. Why not? Because i and j are not defined within the print() method – in other words, they are not in scope. Scope means the portion of the program where a variable (again, not necessarily the value) is valid.

Within methods, any parameters are in scope for the entire method. A variable that’s declared inline:

void aMethod() {
  ...some stuff...
  int x = 12;
  ... some more stuff
}

is only valid from where it’s declared until the end of its current block – which as you may recall from 121 is denoted by the next closing curly brace }.

Variables declared by some constructs (such as for loops) are only valid within the body of those constructs.

The stack and the heap

When the JVM is executing some code, for example:

int compute(int x, int y) {
  return doubled(x) + doubled(y);
}

int doubled(int x) {
  return 2 * x;
}

How does it keep track of which x is which, and how do values move around? I’m going to simplify somewhat here, but essentially there are two regions of memory used to store values (and whose sections are named by variables): the stack and the heap.

As the JVM executes the code, it goes line-by-line, expression-by-expression, evaluating each expression and performing each statement. If someone somewhere called compute(3, 4), the first thing that would happen is the JVM would lay out some memory in the stack for the result of the computation, and for each of the parameters, and for any variables that are in-scope for the whole method (none here, it turns out). Then the parameters would be copied into their spaces. Then the method would start.

Next, doubled(x) would be called. “On top” of the stack, more memory would be layed out, for the return value and for x, which would be copied in. Then doubled would execute, and copy the result into the right spot on the stack. Then control would return to compute, which would copy the value off the top of the stack and into the right spot. And so on. Notice that variables and values are automatically removed/reclaimed on the stack, and that we need only look at the top of the stack to find the current variable and value it holds.

“But Marc,” you might be thinking, “can a value that’s not the return value exists after a method ends?” Yes, those live on the heap, and the JVM is responsible for managing them dynamically via its garbage collection system. We’ll talk about this (again, at a high level) later.

An in-class exercise

Clickers

I brought my clicker to class today.

yup
nope

Stack allocation

double quadratic(double a, double b, double x) {
  double firstTerm = 0.0;
  double result = 0.0;

  firstTerm = a + Math.pow(x, 2); // first computation
  result = firstTerm + b; // second computation
  return result;
}

How many doubles worth of space is allocated on the stack?

Note our use of the Math class and its static method pow to square x.

Some administrivia

Let’s pause the course material to discuss some administrative stuff.

First, some words about assignments and grading.

There will be:

in-class exercises, like what we just did. These serve to give you a self-check of material we’re covering, as well as give me a sense of how much of the class is following what I’m doing. These are graded pass/fail, and you should bring paper and a writing implement to each class to complete them. You will also do exercises in some discussion sections.
labs, which are guided exercises to get you up to speed with particular course-related things, like installing an IDE or submitting assignments.
written assignments (“homework”), which are short (<30 minute) worksheets due at the start of each class, handed in electronically.
programming assignments, where you’ll be asked to engage in the practice of programming, and will be able to submit your work to an online grader for immediate feedback.
about seven quizzes and a final exam. Some discussion section meetings will have quizzes (about one every other week) which will be written to be taken in 25 minutes (but I’ll give you the full amount of time), and there will be a final exam. You cannot pass the course unless you pass the final.

Several things to note:

Discussion and lecture attendance is not optional! Absences will only be excused with written documentation. But I will drop your lowest three in-class exercise grades.

Assignments (labs, written or programming assignments) have a due date, clearly marked on the course web site. Late assignments will not be accepted. Requests for extensions need to be made at least a day in advance. If you want to request an extension after a due date, I will expect a reasonable and well-documented excuse.

Of the above, you can collaborate on everything except programming assignments, quizzes, and exams. Even though programming assignments are take-home, you are expected to work on them alone. 187 routinely starts the academic dishonesty process with something like 15% of its enrolled students, who apparently never believe us when we tell them we check for cheating on the programming assignments. Please help up break this trend!

End-of-class reminders

Read the pages on the course web site titled Overview, Policies, and Schedule; together, these constitute the course syllabus.

Assignments and their due date will go up on the web site as they become available. This includes the first few labs, the first programming assignment, and the first written assignment, all of which are now available!

Suggested reading and lecture notes will also be posted to the course web site.