PunY Implementation Notes

These notes are extracted from discussion of PunY in NMT CSE 423 in Spring 2023.

How do I distinguish newlines that continue a previous line from newlines that start a new line?

Well, in terms of regular expressions in your Flex spec, maybe:

"\\\n"		{ /* code to continue a line, no logical LINE token */ }
"\n"		{ return NEWLINE; }

When do I count indentation levels?

Separate regular expressions for whitespace when at the start of a line and whitespace when not at the start of a line.

^[ \t]+         { /* whitespace at the start of a line, calc indents */ }
[ \t]+          { /* whitespace not at the start of a line, discard */ }

So how do I count indentation levels

Maybe something like

int indentlevel(char *s)
{      
   column = 1;
   while(*s != '\0') {
      if (*s == '\t') {
         while ((column % 8) != 0) column++;
      }
      else column++;
      s++;
   }
   return column;
}

How do I return a "fake" token, e.g. INDENT and DEDENT

For a token that is not directly based on regular expression content, you may allocate a struct/object instance for it, fill in its fields, put something (even an empty string?) as its lexeme/string, and return the integer code for it from the yylex() function. If you were on a regex that didn't (otherwise) have to return its own token, then you are good. If it did have a token but you need to return a "fake" token to be seen first, then you have to save the actual token in some global variable and return it later from a separate call to yylex(). See below.

How do I return several DEDENT's after matching some leading whitespace?

It is one thing to implement the stack of integers...it is another thing to enable yylex() to remember a bunch of saved DEDENT tokens and return them before moving on to the next thing after the whitespace. For that, you may want to use a wrapper function around the yylex() that reads from the saved tokens before calling the "real" flex yylex() to move forward. I have literally written makefile rules that after flex runs, its output function is renamed to yylex2(), and then a yylex() wrapper looks like:

  int yylex()
  {
     /* if saved tokens are on a saved token (DEDENT) stack, return top one */
     /* else return yylex2() */
  }

Token	Meaning/Comment
ENDMARKER	Some grammars expect a token upon seeing EOF!
NAME	identifier
NUMBER	what, python doesn't distinguish reals from integers??! From realpython we note that numbers can have _ in them, as in 1_000. Leading - seems to be part of the number. Real numbers include scientific notation, like 1e6. Exponent can have + or - after the e or E.
STRING	all kinds ' " and '''
NEWLINE	LOGICAL newline
INDENT	this line's initial whitespace larger than previous line
DEDENT	this line's initial whitespace smaller than previous line
LPAR	'('
RPAR	')'
LSQB	'['
RSQB	']'
COLON	':'
COMMA	','
SEMI	';'
PLUS	'+'
MINUS	'-'
STAR	'*'
SLASH	'/'
VBAR	'\|'
AMPER	'&'
LESS	'<'
GREATER	'>'
EQUAL	'='
DOT	'.'
PERCENT	'%'
LBRACE	'{'
RBRACE	'}'
EQEQUAL	'=='
NOTEQUAL	'!='
LESSEQUAL	'<='
GREATEREQUAL	'>='
TILDE	'~'
CIRCUMFLEX	'^'
LEFTSHIFT	'<<'
RIGHTSHIFT	'>>'
DOUBLESTAR	'**'
PLUSEQUAL	'+='
MINEQUAL	'-='
STAREQUAL	'*='
SLASHEQUAL	'/='
PERCENTEQUAL	'%='
AMPEREQUAL	'&='
VBAREQUAL	'\|='
CIRCUMFLEXEQUAL	'^='
LEFTSHIFTEQUAL	'<<='
RIGHTSHIFTEQUAL	'>>='
DOUBLESTAREQUAL	'**='
DOUBLESLASH	'//'
DOUBLESLASHEQUAL	'//='
AT	'@'
ATEQUAL	'@='
RARROW	'->'
ELLIPSIS	'...'
COLONEQUAL	':='
OP	don't know what this is yet. Grammar does not have it.
AWAIT	I guess this is a not-in-PunY new reserved word
ASYNC	I guess this is a not-in-PunY new reserved word
TYPE_IGNORE	# type: ignore, special comment, not in PunY
TYPE_COMMENT	e.g. # type:(str) -> str, not in PunY
ERRORTOKEN	turn lexical error into a syntax error

For-Loops

PunY has to support while loops, plus the following kinds of for-loops.

for v in range(stop): loop some variable from 0 to stop-1
for v in range(start,stop): loop some variable from start to stop-1
for v in range(start,stop,step): loop some variable from stop to stop-1, changing value by step each time
for v in x:: loop some variable through values in list or tuple x

So far, I have been hoping for PunY that we'd just get by with lists and not bother with tuples. How realistic is that for basic Python programming?

Syntax Issues

In this rule in the file: import_from: ('from' (('.' | '...')* dotted_name | ('.' | '...')+) 'import' ('*' | '(' import_as_names ')' | import_as_names)) They say: # note below: the ('.' | '...') is necessary because '..' is tokenized as ELLIPSIS. So should I put '...' as it is or should I add a token called ELLIPSIS which takes in 3 dots?: Yes, Python has a DOT token and it has an ELLIPSIS token. The "from ..... import" grammar rule allows an arbitrary number of periods before a name, which means it has to allow any number of either DOT or ELLIPSIS. However, PunY does not have from...import.

In Bison I would translate that roughly as:

compound_stmt:
     if_stmt
   | while_stmt
   | for_stmt
   | try_stmt { not_puny("try statement"); }
   | with_stmt { not_puny("try statement"); }
   | funcdef
   | classdef
   | decorated { not_puny("decorated statement"); }
   | async_stmt { not_puny("async statement"); }
;

Type Representation

For PunY, this might look like:

struct type {
   /*
    * Integer code that says what kind of type this is.
    * Includes all primitive types:
    * 1 = int, 2=float, 3=string, 4=bool,
    * Also includes codes for compound types that then also
    * hold type information in a supporting union...
    * 5=list, 6=dict., 7=func, 8=class */
   int base_type;
  /* gone away! for PunY */
  union {
   struct funcdef {
      struct type *return_type;
      int nparams;
      struct params **p;
      } f;
/* maybe we can get away with just "knowing" for only predefined class info
   struct classdef {
      struct methods **meth;
      struct members **mem;
      } f;
 */
   } u;
}

struct field {			/* members (fields) of structs */
   char *name;
   struct type *elemtype;
}