Lexical analysis refers to the process of transforming something written as a sequence of characters into a sequence of tokens. You have already done lexical analysis in some of your programs. For example, in your command-driven programs, you converted the command line into a command and its arguments. In the very first program, when you removed comments from a C source file, you did lexical analysis to a limited extent. Your code recognized comments and string constants. A comment is a kind of token, as is a string constant.
Let's talk about C source programs, and what the tokens are, and how one could approach writing a program to do the lexical analysis of a C program. What do you think constitutes a token in a C program? Yes, there are many of them. Some are single characters, such as `;', `&', or `.'. Some are pairs of characters, such as `<=', `||', or `/='. Some tokens can be quite long, such as identifiers (variable, function, or field names), numeric constants, or string constants. By performing lexical analysis, a program produces a sequence of tokens, which is a higher level input stream for subsequent processes, such as syntactic analysis and code generation.
A common approach to writing a program is to implement a set of transitions through a state space, exactly as was done in the posted solution for the first homework. The program is said to be in some particular state of the recognition-of-tokens process. Recall that one of the states was to be in a comment. When one draws a graphical picture of the possible states, and which input characters cause a transition to which states, one has a transition diagram. Let's look at a transition diagram for the first homework solution.
s2
|^
any||\
v| other
s1-----+
|^<----+ ++ ++
"||" ||/ ||other
' v| / v| * v|
+-q1------------------>normal----------->c1--------->c2<-+
| |^ |^ ^^ | | |
\| |+-------------------+| |+-------------+ |* |other
| | ' | | other | |
| | other ' | | v |
| +------->q3-----------+ +--------------------------c3--+
| ^ / ^|
v |any ||*
q2---------+ ++
One starts in the state `normal'. Then each time a character is read
from the input stream, one determines the next state to be in, based
on the present state, the character just read, and the possible
transitions from that state. Notice that it is possible to stay in
the same state by taking a transition to the same state.
Let's see how this works on the following input: `This is /* verb */ good.' [Illustrate].
It is worth noting that lexical analysis can be stated in a rather formulaic manner when resorting to transition diagrams. Program tools exist, e.g. yacc, to automate the process of writing a lexical analyzer. However, we will work on a direct implementation of a lexical analyzer. We'll talk in terms of a transition diagram, knowing that it can be implemented quite directly in a set of switch statements, as the solution for the first homework demonstrates.
Let's extend this transition diagram so that it can recognize more tokens from the C programming language. Let's add some of the single character tokens. [Do several of the single character tokens.] Now let's do some of the two-character tokens. [Do `||'.] What shall we do for sequences such as `<5' versus `<='? (In the first case there are two tokens,and in the second there is just one.) Yes, we already know how to handle tokens of more than one character, e.g. the `/*' token that begins a comment. So, who has a suggestion? Yes, we can make a state corresponding to `<', and then transition appropriately depending on what the next character may be. Let's do that. [Illustrate - maybe do some others]
How might one recognize a numeric constant? Well, this is somewhat trickier. Let's ignore a leading `-' sign for the moment. Indeed, let's just consider positive integers. How would you do that? Yes, a digit starts a positive integer, and a nondigit ends it. How about a positive real value? Well, there might be a decimal point. There may or may not be digits before the decimal point. Any ideas? Yes, one could attempt to recognize the portion before the decimal, then the decimal, then any portion after the decimal. I'll leave this one to you. What about that leading `-'? Well, this is a matter of choice because it may or may not be a unary `-'. The simplest choice for the lexical phase is to treat the `-' as a separate token. After one has gotten a stream of tokens, it will be easier to determine whether it is binary or unary `-', and deal with it then.
The last major class of tokens is the set of identifiers. How would you recognize an identifier? Yes, a letter or an underscore starts an identifier, and any letter, digit, or underscore continues the identifier.
Your final homework assignment is to write a lexical analyzer for C source code. You can get the details from the homework specification. I recommend working on the transition diagram first. After you are satisfied that it is correct, then proceed with the implementation.