Uflex: A Lexical Analyzer Generator for Unicon

Clinton Jeffery, Susie Jeffery, Katrina Ray, and Ray Pereda
Unicon Technical Report #2b
September 2, 2021




Abstract


Uflex is a lexical analyzer generator for Unicon. It is as compatible as possible with the original Lex language, and explicitly supports development of lexical specifications that can be shared with a tool for Java called JFlex, from http://jflex.de.







http://unicon.org/utr/utr2b.html

Department of Computer Science and Engineering
New Mexico Institute of Mining and Technology
Socorro, NM 87801









Table of Contents

1. Introduction                     2
2. Command Line Invocation 3
3. Uflex Program Structure 3
3.1 Header Section 4
3.2 Rules Section 4
3.3 Procedures Section 4
4. Examples 5
4.1 Word Count 5
4.2 Desktop Calculator 6
4.3 Lexical Analyzer for Unicon 7
5. Evaluation 11
6. Conclusions 12
References 12
Appendix 13






1. Introduction

Building a language processor such as a compiler is a complex task. A language processor must extract the grammatical structure of any input written in the language, a task known as parsing. The first step in parsing is scanning to determine the sequence of lexical items (words, tokens, or lexemes) in the input.

Unicon has multiple built-in string processing facilities that can be used for lexical analysis. These include string scanning and SNOBOL-style patterns with regular expressions as their literals. However, for serious large-scale language processing tasks such as writing compilers and interpreters, declarative specifications of lexical and syntax rules still give better maintainability and modifiability than handwritten code.

Uflex is a tool for building scanners that perform lexical analysis. Uflex stands for Unicon's Farcical Lexical Analyzer. It functions like the classic UNIX program lex(1) except that it generates Unicon code output. Lex dates back to 1975 and is documented in [Lesk75]. Uflex also is influenced by Flex and Jflex, the modern open source descendants of Lex for C and Java, respectively.

Uflex uses regular expression notation to specify lexical analysis; the lexical structure of many languages can be concisely and precisely stated using this notation. The regular expressions supported in Uflex are summarized in Table 1.

Operator Description
a ordinary non-operator symbols match themselves
. a period matches any single character except newline
| alternation matches either the preceding or following expression
bc concatenation is an implicit binary operator
* matches zero or more occurrences of the preceding expression
[ ] matches any one character within the brackets
+ matches one or more occurrences of the preceding expression
? matches zero or one occurrences of the preceding expression
"…" matches characters in quotes literally
(e) groups regular expressions, overriding operator precedence
(?# comment) a comment
Table 1: Regular Expressions in Uflex.

The unary postfix operators * and + and ? are higher precedence than the binary operators, and alternation is lower precedence than concatenation. In Table 1, the square bracketed character set notation has several extensions, including ranges such as [a-z], negation [^abc], and a set of predefined named csets. The predefined named csets correspond to C's <ctype.h> macros and appear as square bracked items within a cset definition, for example [[:alnum:]] denotes any one alphanumeric character. Do not blame us for this interesting notation, it comes from Flex.

This document assumes you are familiar with regular expressions; if not, you may also wish to read flex & bison [Levine09]. Uflex is usually used with Iyacc, a parser generator tool documented in [Pereda18]. Uflex and Iyacc are additionally described in Programming with Unicon [Jeffery03].

2. Command Line Invocation

Uflex is invoked like this:
uflex [options] lexfile[.l]
The lex specification file normally ends in .l. This suffix is automatically added if it is omitted and adding .l will result in a valid filename.

The uflex options are as follows:

    option    meaning
-automaton write out finite automaton of uflex regular expressions
-nfaThis option causes Uflex to write out a non-deterministic finite automaton, which is slower but sometimes less buggy than the default deterministic finite automaton.
-nopp skip the macro preprocessor step
-o file write output to file (default is flex-style: basename.icn)
-token write out tokens
-tree write out parse tree of uflex regular expressions
-versionwrite the uflex version and stop

3. Uflex Program Structure

The Uflex tool takes a lexical specification and produces a lexical analyzer that corresponds to that specification. The specification consists of a list of regular expressions, with auxiliary code fragments, variables, and helper functions. The resulting generated analyzer is in the form of a procedure named yylex().

All Uflex programs consist of a file with an extension of .l that contains three sections, separated by lines consisting of two percent signs. The three sections are the definitions section or header, the rules section, and the procedures section.

3.1 Header Section

The definitions section has two kinds of components: macros and code fragments. Macros define shorthand for regular expressions that will be used in the next section. A macro is given by a line starting with the macro name in straight alphanumeric form, followed by a space, followed by a regular expression. Beware the simple macro syntax. Code fragments enclosed by %{ and %} are copied verbatim to the generated lexical analyzer. Usually they will be used to declare global variables and types or include files.

3.2 Rules Section

The rules section contains the actual regular expressions that specify the lexical analysis that is to be performed. Each regular expression may be followed by an optional semantic action enclosed in curly brackets, which is a segment of Unicon code that will be executed whenever that regular expression is matched.

Here is an expanded description of some of the regular expressions that are supported in the rules section.
[a-z]{+}[aeiou] character set union
[a-z]{-}[aeiou] character set difference/subtraction
r{2,5} not sure we do ranges yet
r{4} or fixed # of repetitions
r{2,} or a N-or-more r's construct
\0 NUL
\123 octals
\xHH hexadecimals
[ [:...:] ] predefined csets
[:alnum:] &letters++&digits
[:cntrl:] [\1-\32]
[:lower:] &lcase
[:space:] [ \t\f\v\n\r]
[:alpha:] a-zA-Z
[:digit:] 0-9
[:print:] [\24-\176]
[:upper:] &ucase
[:blank:] [ \t]
[:graph:] [:print:]{-}[ ]
[:punct:] [:graph:]{-}([0-9a-zA-Z])
[:xdigit:] [0-9a-fA-F]

3.3 Procedures Section

The procedures section is also copied out verbatim into the generated lexical analyzer. It may include a main() procedure for standalone Uflex programs, although more frequently it may include helper functions called from within the semantic actions in the rules section. The following public interface is used to communicate with Uflex-generated lexical analyzers when they are called from other modules.

The generated yylex() function and its return value constitute the primary interface between the lexical analyzer and the rest of the program. yylex() returns a -1 if it consumes the entire input; returning different integer values from within semantic actions in the rules section allows yylex() to break the input up into multiple chunks of 1+ characters (called tokens), and to identify different kinds of tokens using different integer codes. In addition to the return value, the generated lexical analyzer also makes use of several global variables. The names and meanings of these are summarized in Table 2.

Variable Name Description
yyin File from which characters will be read; default: &input
yytext String of characters matched by a regular expression
yyleng Length of yytext
yychar Integer category of the most recent token
yylval Lexical values (often an object or record) of the most recent token
Table 2: Uflex global variables.

4. Examples

The best way to explore the capabilties of uflex is to look at examples. This section includes a couple of traditional toy standalone examples. The evaluation section includes a discussion of a Uflex specification written for a Unicon lexical analyzer. See [Jeffery21] for an example of a Uflex specification for a subset of Java, which also happens to compile and run under Jflex.

4.1 A Word Count Program

There is a UNIX program called wc, short for word count, that counts the number of lines, words, and characters in a file. This example demonstrates how to build such a program using Uflex. A short, albeit simplistic, definition of a word is any sequence of non-white space characters, where white space characters are blanks and tabs. See Listing 1 for a Uflex program that operates like wc.

ws [ \t]
nonws [^ \t\n]
%{
global cc, wc, lc
%}
%%
{nonws}+ { cc +:= yyleng; wc +:= 1 }
{ws}+ { cc +:= yyleng }
\n { lc +:= 1; cc +:= 1 }
%%
procedure main()
cc := wc := lc := 0
yylex()
write(right(lc, 8), right(wc, 8), right(cc, 8))
end
Listing 1. wc using uflex.

In the word count program, the definitions section consists of two definitions, one for white space characters (ws) and one for non-white space characters (nonws). These definitions are followed by code to declare three global variables: cc, wc, and lc. These are the counters for characters, words, and lines, respectively. The rules section in this example contains three rules. White space, words, and newlines each have a rule that matches and counts their occurrences. The procedure section has one procedure, main(). It calls the lexical analyzer and then prints out the counted values.

4.2: A Lexical Analyzer for a Desktop Calculator

The previous example demonstrates using Uflex to create standalone programs. However, yylex() is typically called from a parser. The yylex() function can be used to produce a sequence of words so that a parser such as that generated by the iyacc program can combine those words into sentences. Thus it makes sense to study how uflex is used in this context. One obvious difference is that in the earlier example, yylex() was only called once to process the entire file. In contrast, when a parser uses yylex(), it calls the analyzer repeatedly, and yylex() returns with each word that it finds. This will be demonstrated in the example that follows.

A calculator program is simple enough to understand in one sitting and complex enough to get a sense of how to use Uflex with its parser generator counterpart: Iyacc. In a general desktop calculator program, the user types in complex formulas, which the calculator evaluates and then prints the result. The generated lexical analyzer must recognize the words of this language, which will be handled by the parser. In this case the words are numbers, math operators, and variable names.

A number is one or more digits, followed by an optional decimal point and one or more digits. In regular expressions, this may be written as:

[0-9]+(\.[0-9]+)?
The math operators are simple words composed of one character. Variable names can be any combination of letters, digits, and underscores. So as not to confuse them with numbers, refine the definition by making sure that the variables do not begin with a number. This definition of variable names corresponds to the following regular expression:
[a-zA-Z_][a-zA-Z0-9_]*
Recall that there are three sections to every Uflex program: a definitions section, a rules section, and a procedures section. The Uflex program for matching the words of a calculator is given in Listing 2.
%{
# y_tab.icn contains the integer codes for representing the
# terminal symbols NAME, NUMBER, and ASSIGNMENT.
$include y_tab.icn
%}
letter [a-zA-Z_]
digiletter [a-zA-Z0-9_]

%%
{letter}{digiletter}* { yylval := yytext; return NAME }
[0-9]+(\.[0-9]+)? { yylval := numeric(yytext); return NUMBER }
\n {
   return 0 # logical end-of-file
   }
":=" { return ASSIGNMENT }
[ \t]+ { # ignore white space }
. { return ord(yytext) }
%%
Listing 2. Uflex program for recognizing the lexical elements of a calculator.

The definitions section has both a component that is copied directly to the generated lexical analyzer as well as a set of macros. The first rule matches variable names; the second rule matches numbers. The third rule returns 0 to indicate to the parser that it should evaluate the expression. The fourth rule lets the parser know that there was an assignment operator, and the fifth is used to ignore white space. The last rule matches everything else including the other mathematical operators. The character’s numeric code (e.g. ASCII) is returned directly to the parser.

yylval is used to store the result whenever we match either a variable name or a number. This way when the lexical analyzer returns the integer code for name or number, the parser knows to look in yylval for the actual name or number that was matched. Since Unicon allows variables to hold any type of value there is no need for a complicated construct to handle the fact that different tokens have different types of lexical values.

Notice that the matches that are allowed by this set of regular expressions are somewhat ambiguous. For example, count10 may match a variable name and then an integer, or one variable name. The Uflex tool is designed to match the longest substring of input that can match the regular expression. So count10 would be matched as one word, which is a variable in this case. In the case where two different expression match the same number of characters, the first rule listed in the specification will be used.

4.3: A Lexical Analyzer for Unicon

The following Unicon lexical analyzer is not quite finished, complete, or correct. But, for example, it produces output that is identical to Unicon's handwritten lexical analyzer unilex.icn on the largest source file in the Unicon distribution, ipl/gprogs/gpxtest.icn. So, it is pretty close and can give the reader a feel for the extent to which uflex is about ready for real-world uses at this point.
%{
# A Uflex Unicon lexer

$include "ytab_h.icn"
$define Beginner 1
$define Ender 2
$define Newline 4
record token(tok, s, line, column, filename)
%}

O [0-7]
D [0-9]
L [[:alpha:]_]
H [[:xdigit:]]
R [[:alnum:]]
FS [fFlL]
IS [uUlL]
W [ \t\v]
idchars [[:alnum:]_]
global yylineno, yycolno, yyfilename, tokflags, lastid, buffer, lastchar

%%

"abstract"  { return ABSTRACT }
"break"     { tokflags +:= Beginner+Ender; return BREAK }
"by"        { return BY }
"case"      { tokflags +:= Beginner; return CASE }
"class"     { return CLASS }
"create"    { tokflags +:= Beginner; return CREATE }
"critical"  { tokflags +:= Beginner; return CRITICAL }
"default"   { tokflags +:= Beginner; return DEFAULT }
"do"        { return DO }
"else"      { return ELSE }
"end"       { tokflags +:= Beginner; return END }
"every"     { tokflags +:= Beginner; return EVERY }
"fail"      { tokflags +:= Beginner+Ender; return FAIL }
"global"    { return GLOBAL }
"if"        { tokflags +:= Beginner; return IF }
"import"    { return IMPORT }
"initial"   { tokflags +:= Beginner; return iconINITIAL }
"initially" { tokflags +:= Ender; return INITIALLY }
"invocable" { return INVOCABLE }
"link"      { return LINK }
"local"     { tokflags +:= Beginner; return LOCAL }
"method"    { return METHOD }
"next"      { tokflags +:= Beginner+Ender; return NEXT }
"not"       { tokflags +:= Beginner; return NOT }
"of"        { return OF }
"package"   { return PACKAGE }
"procedure" { return PROCEDURE }
"record"    { return RECORD }
"repeat"    { tokflags +:= Beginner; return REPEAT }
"return"    { tokflags +:= Beginner+Ender; return RETURN }
"static"    { tokflags +:= Beginner; return STATIC }
"suspend"   { tokflags +:= Beginner+Ender; return SUSPEND }
"then"      { return THEN }
"thread"    { tokflags +:= Beginner; return THREAD }
"to"        { return TO }
"until"     { tokflags +:= Beginner; return UNTIL }
"while"     { tokflags +:= Beginner; return WHILE }
{L}{idchars}* { tokflags +:= Beginner+Ender; return IDENT }

\'[^'\\\n]*\' { tokflags +:= Beginner + Ender; return CSETLIT }
\"([^"\\\n]|"_\n"|"\\"[n\\])*\" { tokflags +:= Beginner + Ender; mls(); return STRINGLIT }
"!"    { tokflags +:= Beginner; return BANG }
"%"    { return MOD }
"%:="  { return AUGMOD }
"&"    { tokflags +:= Beginner; return AND }
"&&"   { return PAND }
"&:="  { return AUGAND }
"*"    { tokflags +:= Beginner; return STAR }
"**"   { tokflags +:= Beginner; return INTER }
"**:=" { return AUGINTER }
"*:="  { return AUGSTAR }
"+"    { tokflags +:= Beginner; return PLUS }
"++"   { tokflags +:= Beginner; return UNION }
"++:=" { return AUGUNION }
"+:"   { return PCOLON }
"+:="  { return AUGPLUS }
"-"    { tokflags +:= Beginner; return MINUS }
"->"   { return PASSNONMATCH }
"--"   { tokflags +:= Beginner; return DIFF }
"--:=" { return AUGDIFF }
"-:"   { return MCOLON }
"-:="  { return AUGMINUS }
"."    { tokflags +:= Beginner; return DOT }
".$"   { return PSETCUR }
".>"   {
   # missing logic here if next_gt_is_ender === 1 then ...
   return PSETCUR
   }
".|" { return POR }
"."[0-9]+ { tokflags +:= Beginner + Ender; return REALLIT }
"/"     { tokflags +:= Beginner; return SLASH }
"/:="   { return AUGSLASH }
":"     { return COLON }
"::"    { tokflags +:= Beginner; return COLONCOLON }
":="    { return ASSIGN }
":=:"   { return SWAP }
"<"     { return NMLT }
"<="    { return NMLE }
"<=:="  { return AUGNMLE }
"<<"    { return SLT }
"<<="   { return SLE }
"<<=:=" { return AUGSLE }
"<<:=" { return AUGSLT }
"<<@" { tokflags +:= Beginner + Ender; return RCVBK }
"<-" { return REVASSIGN }
"<->" { return REVSWAP }
"<:=" { return AUGNMLT }
"<@" { tokflags +:= Beginner + Ender; return RCV }
"=" { tokflags +:= Beginner; return NMEQ }
"=>" { return PIMDASSN }
"=:=" { return AUGNMEQ }
"==" { tokflags +:= Beginner; return SEQ }
"==:=" { return AUGSEQ }
"===" { tokflags  +:= Beginner; return EQUIV }
"===:=" { return AUGEQUIV }
">" {
   if \next_gt_is_ender then {
      tokflags +:= Ender
      next_gt_is_ender := &null
      }
   return NMGT
   }
">=" { return NMGE }
">=:=" { return AUGNMGE }
">>" { return SGT }
">>=" { return SGE }
">>=:=" { return AUGSGE }
">>:=" { return AUGSGT }
">:=" { return AUGNMGT }
"?" { tokflags +:= Beginner; return QMARK }
"??" { return PMATCH }
"?:=" { return AUGQMARK  }
"@" { tokflags +:= Beginner; return AT }
"@>>" { tokflags +:= Beginner + Ender; return SNDBK }
"@>"  { tokflags +:= Beginner + Ender; return SND }
"@:=" { return AUGAT }
"\\" { tokflags +:= Beginner; return BACKSLASH }
"^" { tokflags +:= Beginner; return CARET }
"^:=" { return AUGCARET }
"|" { tokflags +:= Beginner; return BAR }
"||" { tokflags +:= Beginner; return CONCAT }
"|||" { tokflags +:= Beginner; return LCONCAT }
"||:=" { return AUGCONCAT }
"|||:=" { return AUGLCONCAT }
"~" { tokflags +:= Beginner; return TILDE }
"~=" { tokflags +:= Beginner; return NMNE }
"~=:=" { return AUGNMNE }
"~==" { tokflags +:= Beginner; return SNE }
"~==:=" { return AUGSNE }
"~===" { tokflags +:= Beginner; return NEQUIV }
"~===:=" { return AUGNEQUIV }
"(" { tokflags +:= Beginner; return LPAREN }
")" { tokflags +:= Ender; return RPAREN }
"[" { tokflags +:= Beginner; return LBRACK }
"]" { tokflags +:= Ender; return RBRACK }
"{" { tokflags +:= Beginner; return LBRACE }
"}" { tokflags +:= Ender; return RBRACE }
"," { return COMMA }
";" { return SEMICOL }
"$(" { tokflags +:= Beginner; return LBRACE }
"$)" { tokflags +:= Ender; return RBRACE }
"$<" { tokflags +:= Beginner; return LBRACK }
"$>" { tokflags +:= Ender; return RBRACK}
"$$" { return PIMDASSN }
"$" { return DOLLAR }
"``"[^`]*"``" { tokflags +:= Ender; return PUNEVAL }
"`"[^`]*"`" { tokeflags +:= Ender; return PUNEVAL }
{D}+ { tokflags +:= Beginner+Ender; return INTLIT }
{D}+[rR]{R}+ { tokflags +:= Beginner+Ender; return INTLIT }
{D}+[kK] { tokflags +:= Beginner+Ender; yytext:=string(yytext[1:-1]*1024); return INTLIT }
{D}+[mM] { tokflags +:= Beginner+Ender; yytext:=string(yytext[1:-1]*1024^2); return INTLIT }
{D}+[gG] { tokflags +:= Beginner+Ender; yytext:=string(yytext[1:-1]*1024^3); return INTLIT }
{D}+[tT] { tokflags +:= Beginner+Ender; yytext:=string(yytext[1:-1]*1024^4); return INTLIT }
{D}+[pP] { tokflags +:= Beginner+Ender; yytext:=string(yytext[1:-1]*1024^5); return INTLIT }
{D}+[dD]{D}+ { tokflags +:= Beginner+Ender; return INTLIT }
{D}+((([eE][+\-]?){D}+)?) { tokflags +:= Beginner+Ender; return REALLIT }
{D}+\.{D}+ { tokflags +:= Beginner+Ender; return REALLIT }
{D}+\.{D}*((([eE][+\-]?){D}+)?) { tokflags +:= Beginner+Ender; return REALLIT }
"#line "[0-9]+ \"[^\"]*\".* {
   yytext ? { move(6)
      yylineno := integer(tab(many(&digits)));
      if =" \"" & (new_filename := tab(find("\""))) then
         yyfilename := new_filename
      }
   }
"#".* { }
"\n" { yylineno +:= 1; yycolno := 1;
       if tokflags < Newline then tokflags +:= Newline }
" " { yycolno +:= 1 }
[\v\014] { }
"\t"  { yycolno +:= 1; while(yycolno-1)%8~=0 do yycolno +:= 1 }

%%

procedure lex_error()
   yyerror("token not recognized")
end

procedure uni_error(s)
   /errors := 0
   /s := "token not recognized"
write("uni_error calls yyerror ", image(s))
   yyerror(s)
   errors +:= 1
end

# implement semi-colon insert here, swap yylex:=:yylex2
procedure yylex2()
  static saved_tok, saved_yytext, seen_eof
  local rv
  if \seen_eof then return 0
  if \saved_tok then {
    rv := saved_tok
    saved_tok := &null
    yytext := saved_yytext
    yylval := yytoken := token(rv, yytext, yylineno, yycolno, yyfilename)
    if \debuglex then
      write("yylex() : ", tokenstr(rv), "\t", image(yytext))
    return rv
  }
  if /ender := iand(tokflags, Ender) then
      tokflags := 0
  if not (rv := yylex2()) then {
	 yytext := ""
	 if \debuglex then
	     write("yylex() : EOFX")
         seen_eof := 1
         return EOFX
  }
  if ender~=0 & iand(tokflags, Beginner)~=0 & iand(tokflags, Newline)~=0 then {
    saved_tok := rv
    saved_yytext := yytext
    yytext := ";"
    rv := SEMICOL
    }
   yylval := yytoken := token(rv, yytext, yylineno, yycolno, yyfilename)
   if \debuglex then
      write("yylex() : ", tokenstr(rv), "\t", image(yytext))
   return rv
end

# handle multi-line string, eating underscore-newline pairs and the leading
# spaces of the next line
procedure mls()
  outs := ""
  yytext ? {
     while outs ||:= tab(find("_\n")) do {
       move(2); tab(many(' \t'))
       }
     outs ||:= tab(0)
     }
  yytext := outs
  yyleng := *yytext
end

5. Evaluation

A lexical analyzer generator ought to be reliable and correct enough for use on production tools. Uflex satisfies this requirement. Flaws and limitations in the tool should be known and documented. This has not yet happened. Uflex has a test suite, in a subdirectory named test/, that allows you to run a range of examples that validate working functionality and identify areas of deficiency.

One basic question about any lexical analyzer whether its speed is acceptable. Historically, the original UNIX lex(1) was famous for being slow. If a lex-generated lexical analyzer is too much slower than a handwritten lexical analyzer, then the tool is unattractive despite the advantage it may convey in terms of improved maintainability. Performance mattered a lot more back when compilers were glacial and every bit of speed counted; at that time, if you could double the speed of your lexical analyzer and make your whole compiler 10% faster, it made a huge difference in programmer productivity. Performance is less of a concern at this point, but still matters.

One of the obvious tests for a Unicon lexical analyzer generator such as uflex is whether it is robust enough to support lexical analyzers for real programming languages. For example, can uflex be used to produce an operational lexical analyzer for the Unicon language? As seen in section 4.3 the answer is yes. A second test is whether the generated lexical analyzer runs fast enough to compete with a handwritten lexical analyzer. Unicon's handwritten lexical analyzer in unicon/uni/unicon/unilex.icn is written in about 750 lines of Unicon. It is well structured into about 39 procedures and uses string scanning. A test program that uses this lexical analyzer to process files named on the command line looks like this:

link unilex
procedure main(argv)
   t1 := &time 
   count := 0
   yyin := open(argv[1]) | stop("can't open ", argv[1]|"no file given")
   while (i := yylex()) > 0 do {
      count +:= 1
      }
   close(yyin)
   t2 := &time
   write("processed ", count, " tokens in ", t2-t1, "ms")
end
Linked with the handwritten Unicon lexical analyzer, the above program can be run on the largest .icn file in the Unicon distribution, the 9.8KLOC, 272KB file tests/graphics/gpxtest.icn, which is a fatter, self-contained version of ipl/gprogs/gpxtest.icn. In preliminary tests, the handwritten lexical analyzer processing 50744 tokens in the ballpark of 0.267 seconds on an i7-7700 running Windows 10. We compare our uflex-generated lexical analyzer against this result. At present it runs in 127 seconds, roughly 500x slower than the handwritten lexical analyzer.

If uflex did not generate a usable lexical analyzer for Unicon, it would not be ready for prime time. As it stands currently, uflex generates a lexical analyzer that runs the same test and produces the same output two or more orders of magnitude more slowly. Our interpretation of this is that uflex is not yet ready to be taken seriously, hence the term "farcical" in its name. On the other hand, its generated output deterministic finite automata may be ready for some performance tuning at this point, and it may be possible to generate a better representation of the finite automaton.

6. Conclusions

Despite Unicon's plethora of string processing facilities, a yacc-compatible lexical analyzer generator remains of high value to language developers. With Uflex, it now becomes more practical to write tools such as compilers and transpilers in Unicon for full-sized languages.

Uflex may be useful in language processing either as a standalone application or, as is more usual, in conjunction with a parser. Uflex is still immature but is now useful for experimental compiler prototyping.

Acknowledgements

Unicon itself came into being in no small part thanks to the development of iyacc, a port of byacc/j done by Ray Pereda. iyacc enabled the development of a full parser to replace Idol's line-oriented semi-parser. Neither Ray Pereda's iflex tool nor Katrina Ray's ulex tool were completed to a state that would allow them to handle typical programming language lexical specifications.

Uflex started with Clint Jeffery's small extension to Katrina Ray's ulex to handle lex macros; the extension was called ulpp, the ulex preprocessor. Clint's Unicon translation of Ulex was written later, followed by much debugging to get to the point where a lexical analyzer for a Java-like language (Jzero) ran correctly. A conversion from Katie Ray's NFA representation to DFA was added by Susan Jeffery. The code was further debugged and remains in development.

Katrina Ray's original development of ulex was supported by an NMSU fellowship and by the National Library of Medicine during its development. Ray Pereda wrote the first version of UTR#2 to describe his planned iflex tool. The UTR2 text was substantially altered to describe ulex and eventually uflex, and Ray bears none of the blame/responsibility for the end result.

References

[Jeffery03] C. Jeffery, S. Mohamed, R. Pereda, and R. Parlett. Programming with Unicon. http;//unicon.sf.net/book/ub.pdf, 2003.

[Lesk75] M.E. Lesk and E. Schmidt. LEX – Lexical Analyzer Generator . Computer Science Technical Report No. 39, Bell Laboratories, Murray Hill New Jersey, October 1975.

[Levine09] J.R. Levine. flex & bison , O'Reilly & Associates, Cambridge, Massachusetts, 2009.

[Pereda18] Ray Pereda. Iyacc – a Parser Generator for Icon , Unicon Technical Report 3a, http://unicon.org/utr/utr3.pdf, February 2018.

[Jeffery21] Clinton Jeffery. Build Your Own Programming Language. Packt. In preparation.

Appendix: Differences Between Uflex and UNIX lex(1) and flex(1)

This appendix summarizes the known differences between Uflex and other implementations of the UNIX lex(1) tool. Uflex still has several important limitations to note.

In addition, the following (Flex) operators are not yet implemented.

(?r:patt) interpret regex patt with option r (i, s, or x)
(?-r:patt) (?r-s:patt) more options formats
r/s lookahead operator
^r beginning of line context
r$ end of line context
<:s>r start conditions
<s1,s2,s3>r multiple start conditions
<*>r any start condition
<<EOF>> end-of-file
<s1,s2><<EOF>> end-of-file occurring in start conditions