CSE 423 Homework 2: A Lexical Analyzer

Due: Friday February 7, 11:59pm

In this assignment you will write a lexical analyzer in flex(1), for a subset of Kotlin known as the k0 language

Engineering Requirements

In this and all subsequent assignments in 423, please meet the following engineering requirements. Points will be assigned in grading for them.

Mandatory .zip file unpacks to the current directory.

Turnin Must be a .zip containing a valid compressed archive that can be uncompressed on Linux via the command unzip. It may not be a .tar or a .rar or a .bzip or whatever, whether disguised or renamed or not. The .zip must unpack into the current directory, not a subdirectory. Subdirectories are fine, but there must be a top-level makefile that builds an executable named k0 in the top-level directory from where you were unzipped. That is what my test script will attempt to run.

"make" just works on login.cs.nmt.edu; it builds an executable named k0.

You have to supply a makefile that contains build rules described below. Also: historically, a good number of students will lose points at some point in the semester for not bothering to test their work on the machine that I will grade on. Testing in your own working directory will not protect you against accidentally omitting required files or other surprises. Test in a separate directory.

Fully Separate Compilation.

The compiler must be invoked separately on each source file. E.g. all .c files must be linked together via .o object files, no using #include of .c files. No including any code (function bodies) in .h files.

No Warnings.

The gcc compiler must be invoked with -Wall on all compilation lines. If you are using another language, which must be approved by Dr. J, you must also seek to use all its warning options, or get any omissions approved. Points will be lost if you don't fix warnings. There are some common lex/flex warnings, such as about not using input() that are no big deal, but use

%option noinput
%option nounput

to shut them up. See the instructor if you are unable to fix a warning.

Valgrind validation

You should test your work on login.cs.nmt.edu both with and without valgrind. Valgrind output should be free of memory errors. You will also become more experienced with gdb in this class, but valgrind is your first line of defense. For the purposes of this class, a "memory error" is a message from valgrind indicating a read or write of one or more bytes of illegal, out-of-bounds, or uninitialized memory. Other (non-memory) valgrind messages may be useful to you, or you can ignore them; they will not cost you points.

Feature Requirements

In this document, the term "must" indicates a feature that is required for passing grade, while the term "should" indicates a feature that is expected for a grade of "A" or "B". If you do not know what something means, or don't know how to do it, you are encouraged to ask and find out rather than turning in a homework that does not meet the specifications.

Your program executable must be named k0. Your program should read in source file(s) named on the command line and write output with one line for each token, described below. Source files must accept the extension .kt. The compiler should automatically add .kt to the end of filenames if no other extension is given. (Eventually in a later homework, the compiler will automatically name the executable the same name as the first argument. For this assignment there is no output executable.)

Compilers and related tools are used by programs such as make(1) that read the process exit status to tell whether all is well. Your program's exit status should return 0 if there are no errors, and a nonzero number to indicate errors. For lexical errors, return 1.

Language Details

The k0 language is (not) described (yet) at http://www.cs.nmt.edu/~jeffery/courses/423/k0.html. As this is a new language this semester, these details will be filled in and corrected and amended as needed in response to student questions.

Starting Points

You must write a k0lex.l that matches the tokens of the Kotlin language.
you will need to make up a set of integer codes for yylex() to return.
you will need to read Kotlin references and get precise definitions of the literal constants, including escape characters in strings, etc.
In C this is done as a set of #define's or an enum, in a .h file. These #define symbols must employ names corresponding exactly to those used in the Kotlin "Lexical grammar", section 1.2 of the Kotlin language specification.
Classic yacc used the name y.tab.h for the file containing the #define symbols; it can be named otherwise in your HW.
A full Kotlin lexer would have to account for dual-byte Unicode or UTF-8 multi-byte characters. For k0, our subset of Kotlin, we will skip the Unicode lexical rules.
For further notes, see discussion in section yylex() and main(), below.

Kotlin Lex Tokens

These are from the Kotlin lexical grammar, but are not in Flex format. For some of them we will have to ascertain precise definitions. For some of them we might end up omitting them if we can't identify a valid Kotlin token that they denote.

Comments and Whitespace

Recognize and consume/ignore/discard the following sequences when they occur

LF: an ASCII linefeed
CR: an ASCII carriage return
NL: ASCII linefeed (UNIX), or carriage return (old MacOS), or carriage return followed by line feed (MS-DOS and old Windows)
WS: ASCII space, tab, and formfeed characters
SHEBANG_LINE: #!... followed by all other characters up to a carriage return/newline
DelimitedComment: These look like C comments, except they can be nested
LineComment: //... followed by all other characters up to a carriage return/newline

Keywords and Operators

Your lexer must return an integer code to tell what kind of thing it finds for anything that's not whitespace, i.e. source code lexemes such as operators, reserved words, variable names, literal constants etc. The integer codes will have mnemonic names as specified in the Kotlin specification, and those names will be shown here.

"Fixing" the Literal Constants

We will give various examples of regular expressions for literal constants in lecture, but your mission is to get the literal constants for Kotlin as correct as you can manage.

If some legal Kotlin token is supposed to be in k0, add or correct regexes for it to k0lex.l
If some legal Kotlin token is not supposed to be in k0, have your lexical analyzer report a lexical error and stop execution.
Place a special focus on literal constants (strings, integers, reals...)
Catch lexical errors related to literal constants and report them (with filename and line number) and stop, instead of just returning bogus output.
Examples: what does your lexical analyzer do with
```
     "hello
     /* world
     12e
```

Lexical Attributes

In your yylex(), you should compute attributes for each token, and store them in a global variable named yytoken. Note that this is not part of the lex/yacc public interface, although it is named so as to be a recognizable extension of said interface. You should use the following token type, or a compatible extension of it.

struct token {
   int category;   /* the integer code returned by yylex */
   char *text;     /* the actual string (lexeme) matched */
   int lineno;     /* the line number on which the token occurs */
   char *filename; /* the source file in which the token occurs */
   int ival;       /* for integer constants, store binary value here */
   double dval;	   /* for real constants, store binary value here */
   char *sval;     /* for string constants, malloc space, de-escape, store */
                   /*    the string (less quotes and after escapes) here */
   }

In this homework your main() procedure should build a LINKED LIST of all the token structs, each of which is created by yylex(). In the next assignment, we will discard the linked list and instead insert all these tokens into a tree.

Example linked list structure:

   struct tokenlist {
      struct token *t;
      struct tokenlist *next;
      }

Use the malloc() function to allocate chunks of memory for struct token and struct tokenlist.

`yylex()` and `main()`

Your yylex() should return a different unique integer > 257 for each reserved word, and for each other token category (identifier, integer literal constant, string literal constant, addition operator, etc). Numbers > 257 are required for the sake of compatibility with the parser generator tool. For each such number, you must #define a symbol, as in

#define IDENTIFIER 260

This is required for the sake of readability. Your yylex() should return -1 when it hits end of file. In this homework, your yylex() should recognize lines beginning with # and treat them as comments, i.e. delete the line contents silently. In later homework, treatment of preprocessor directives will become more interesting.

In this assignment, there should be (at least) two separately-compiled .c files, a .h file and a makefile. The yylex() function must be called by a main() function in a loop. For each token, the main() function should write out a line containing the token category (an integer > 257) and lexical attributes.

Turn in...

An electronic copy via Canvas. The electronic copy should be a compressed archive .zip file, containing makefile, flex k0lex.l file, main.c file, and ytab.h file. If you add any other source files to your program, be sure you add it/them to the makefile rules and .zip containing the set of files that you turn in.

Example

For an example input file named hello.kt that contains:

fun main(args : Array<String>) { println("Hello,\tWorld!") }

your output should look something like the following. Integer categories are for illustration purposes; your integer codes may be different.

Category Text Lineno Filename Ival/Sval ------------------------------------------------------------------------- 270 fun 1 hello.kt 271 main 1 hello.kt 290 ( 1 hello.kt 271 args 1 hello.kt 294 : 1 hello.kt 271 Array 1 hello.kt 295 < 1 hello.kt 271 String 1 hello.kt 296 > 1 hello.kt 291 ) 1 hello.kt 292 { 1 hello.kt 271 println 2 hello.kt 290 ( 2 hello.kt 272 "Hello,\tWorld!" 2 hello.kt Hello, World! 291 ) 2 hello.kt 293 } 3 hello.kt