% Copyright 2012-2014, Alexander Shibakov % Copyright 2002-2014 Free Software Foundation, Inc. % This file is part of SPLinT % % SPLinT is free software: you can redistribute it and/or modify % it under the terms of the GNU General Public License as published by % the Free Software Foundation, either version 3 of the License, or % (at your option) any later version. % % SPLinT is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the % GNU General Public License for more details. % % You should have received a copy of the GNU General Public License % along with SPLinT. If not, see . \input limbo.sty \input frontmatter.sty \def\optimization{5} \input yy.sty % multi-column output \input dcols.sty \let\currentparsernamespace\parsernamespace \let\parsernamespace\mainnamespace \let\currenttokeneq\tokeneq \def\tokeneq#1#2{\prettytoken{#1}} \input bo.tok % re-use token equivalence table to set the typesetting of tokens \let\tokeneq\currenttokeneq \input btokenset.sty % index entries \let\parsernamespace\indexpseudonamespace \prettywordpair{emptyrhs}{$\circ$ {\rm(empty rhs)}}% \prettywordpair{inline_action}{$\diamond$ {\rm(inline action)}}% \prettywordpair{TOKEN}{{\tt TOKEN} {\rm(example)}}% \prettywordpair{token}{{\tt "token"} {\rm(example)}}% \let\parsernamespace\currentparsernamespace \let\hostparsernamespace\mainnamespace % the namespace where tokens are looked up % for typesetting purposes \immediate\openout\exampletable=\jobname.exl \def\nontitle#1{{\ttl #1}} \def\cite[#1]{% \def\next{#1}\setbox0=\hbox{l}% [\ifx\next\empty$\,$\hbox{\vrule width\wd0 height\ht0 depth\dp0}$\,$\else \locallink{#1bibref}#1\endlink\fi]% } \let\oldN\N \let\N\textN \let\M\textM \defreserved{Y}{\.{Y}} \showlastactiontrue @**Introduction. \setupfootnotes \splint\footnote{I was tempted to call the package {\tt ParLALRgram} which stands for Parsing {\sc LALR} Grammars or {\tt PinT} for `Parsing in \TeX' but both sounded too generic.} (Simple Parsing and Lexing in \TeX, or, following the great GNU tradition of creating recursive names, \splint\ Parses Languages in \TeX) is a system (or rather a m\'elange of systems) designed to facilitate developing parsing macros in \TeX\ and (to a lesser degree) documenting parsers written in other languages. As an application, a parser for \bison\ input file syntax has been developed, along with a macro collection that makes it possible to design and pretty print \bison\ grammars using \CWEB. Developing software in \CWEB\ involves two programs. The first of these is \CTANGLE\ that outputs the actual code, intended to be in \Cee. In reality, \CTANGLE\ cares very little about the language it produces. Exceptions are \Cee\ comments and |@[#line@]| directives that might confuse lesser software, although \bison\ is all too happy to swallow them (there are also some \Cee\ specific constructs that \CTANGLE\ tries to recognize). \CTANGLE's main function is to rearrange the text of the program as written by the programmer (in a way that, hopefully, emphasizes the internal logic of the code) into an appropriate sequence (e.g.~all variable declaration must textually precede their use). All that is required to adopt \CTANGLE\ to produce \bison\ output is some very rudimentary post- and pre-processing. Our main concern is thus \CWEAVE\ that not only pretty prints the program but also creates an index, cross-references all the sections, etc. Getting \CWEAVE\ to pretty print a language other than \Cee\ requires some additional attention. A true digital warrior would probably try to decipher \CWEAVE's output `in the raw' but, alas, my WebFu is not that strong. The loophole comes in the form of a rarely (for a good reason) used \CWEB\ command: the verbatim (\.{@@=...@@>}) output. The material to be output by this construct undergoes minimal processing and is put inside \.{\\vb\{}$\ldots$\.{\}}. All that is needed now is a way to process this virtually straight text inside \TeX. @*1 Using the \bison\ parser. The process of using \splint\ for writing parsing macros in \TeX\ is treated in considerable detail later in this document. We begin, instead, by outlining how one such parser can be used to pretty print a \bison\ grammar. Following the convention mentioned above and putting all non-\Cee\ code inside \CWEAVE's verbatim blocks, consider the following (meaningless) code fragment. The fragment contains a mixture of \Cee\ and \bison\ code, the former appears outside of the verbatim blocks. \begindemo ^@@= non_terminal: @@> ^@@= term.1 term.2 {@@> a = b; @@=}@@> ^@@= **H term.3 other_term {@@> $$ = $1; @@=}@@> ^@@= **H still more terms {@@> f($1); @@=}@@> ^@@= ; @@> \enddemo The fragment above will appear as (the output of \CTANGLE\ can be examined in \.{sill.y}) @= @G non_terminal: term.1 term.2 {@> a = b; @=} | term.3 other_term {@> $$ = $1; @=} | still more terms {@> f($1); @=} ; @g @ $\ldots$ if the syntax is correct. In case it is a bit off, the parser will give up and you will see a different result. The code in the fragment below is easily recognizable, and some parts of it (all of \Cee\ code, in fact) are still pretty printed in \CWEAVE. Only the verbatim portion is left unprocessed. @= @G whoops term.1 term.2 {@>@+ a = b; @+@=} | term.3 other_term {@>@+ $$ = $1; @+@=} | still more terms {@>@+ f($1); @+@=} ; @g @ The \TeX\ header that makes such output possible is quite plain. In this example (i.e.\ this very file) it consists of \begindemo ^\input limbo.sty ^\input frontmatter.sty ^\input yy.sty \nooutput \enddemo The first two lines are presented here merely for completeness: there is no parsing-relevant code in them. The line that follows loads the macros that implement the parsing and scanning machinery. This is enough to set up all the basic mechanisms used by the parsing and lexing macros. The rest of the header provides a few definitions to fine tune the typesetting of grammar productions. It starts with \begindemo ^\let\currentparsernamespace\parsernamespace ^ \let\parsernamespace\mainnamespace ^ \let\currenttokeneq\tokeneq ^ \def\tokeneq#1#2{\prettytoken{#1}} ^ \input bo.tok % re-use token equivalence table to set the typesetting of tokens ^ \let\tokeneq\currenttokeneq ^ \input btokenset.sty \nooutput \enddemo We will have a chance to discuss all the \.{\\}$\ldots$\.{namespace} macros later, at this point it will suffice to say that the lines above are responsible for controlling the typesetting of term names. The file \.{bo.tok} consists of a number of lines like the ones below: \begindemo ^\tokeneq {STRING}{{34}{115}{116}{114}{105}{110}{103}{34}} ^\tokeneq {PERCENT_TOKEN}{{34}{37}{116}{111}{107}{101}{110}{34}} \nooutput \enddemo The cryptic looking sequences of integers above are strings of {\sc ASCII} codes of the letters that form the name \bison\ uses when it needs to refer to the corresponding token (thus, the second one is \toksa{}\numberstochars{34}{37}{116}{111}{107}{101}{110}{34}\end \.{\the\toksa} which might help explain why such an elaborate scheme has been chosen). The macro \.{\\tokeneq} is defined in \.{yymisc.sty}, which in turn is input by \.{yy.sty} but what about the token names themselves? In this case they were extracted automatically from the \CWEB\ source file by the parser during the \CWEAVE\ processing stage. All of these definitions can be overwritten to get the desired output (say, one might want to typeset \.{ID} in a roman font, as `identifier'; all that needs to be done is a macro that says \.{\\prettywordpair\{ID\}\{\{\\rm identifier\}\}}). The file \.{btokenset.sty} input above contains a number of such definitions. @ To round off this short overview, I must mention a caveat associated with using the macros in this collection: while one of the greatest advantages of using \CWEB\ is its ability to rearrange the code in a very flexible way, the parser will either give up or produce unintended output if this feature is abused while describing the grammar. For example, in the code below @= @G next_term: stuff @> @ @={@> a = f( x ); @=} @g @@; @ the line titled |@| is intended to be a rule defined later. Notice that while it seems that the parser was able to recognize the first code fragment as a valid \bison\ input, it misplaced the |@|, having erroneously assumed it to be a part of the action code for this grammar (later on we will go into the details of why it is necessary to collect all the non-verbatim output of \CWEAVE, even the one that contains no interesting \Cee\ code; hint: it has something to do with money (\.{\$}), also known as math and the way \CWEAVE\ processes the `gaps' between verbatim sections). The production line that follows did not fare as well: the parser gave up. There is simply no point in including such a small language fragment as a valid input for the grammar the parser uses to process the verbatim output. @= @G more stuff in this line {@> @[b = g(y);@]@=} @g @ Finally, if you forget that only the verbatim part of the output is looked at by the parser you might get something unrecognizable, such as @= but not all of it @ To correct this, one can provide a more complete grammar fragment to allow the parser to complete its task successfully. In some cases, this imposes too strict a constraint on the programmer. Instead, the parser that pretty prints \bison\ grammars allows one to add {\it hidden context\/} to the code fragments above. The context is added inside \.{\\vb} sections using \CWEB's \.{@@t}$\ldots$\.{@@>} facility. The \CTANGLE\ output is not affected by this while the code above can now be typeset as: @= @G next_term: stuff @> @t}\vb{\formatlocal{\let\peekstash\stashtoterm}}{@> @ @t}\vb{FAKE}{@> @={@> a = f( x ); @=} @g @@; @ $\ldots$ even a single line can now be displayed properly. @= @G @g @g @t}\vb{\formatlocal{\skipheader} FAKE:}{@> @G more stuff in this line {@> b = g( y ); @=} @g @ With enough hidden context, even a small rule fragment can be typeset as intended. The `action star' was inserted to reveal some of the context. @= @G @g @g @t}\vb{\formatlocal{\skipheader} FAKE:}{@> @G but not all of it @g @t}\vb{\{\stashed{$\star$}\}}{@> @ What makes all of this even more confusing is that \CTANGLE\ will have no trouble outputting this as a(n almost, due to the intentionally bad \.{whoops} production above) valid \bison\ file (as can be checked by looking into \.{sill.y}). The author happens to think that one should not fragment the software into pieces that are too small: \bison\ is not \Cee\ so it makes sense to write \bison\ code differently. However, if the logic behind your code organization demands such fine fragmentation, hidden context provides you with a tool to show it off. A look inside the source of this document shows that adding hidden context can be a bit ugly so it is not recommended for routine use. The short example above is output in the file below. @(sill.y@>= @@; @*1 On debugging. This concludes a short introduction to the \bison\ grammar pretty printing using this macro collection. It would be incomplete, however, without any reference to debugging\footnote{Here we are talking about debugging the output produced by \CWEAVE\ when the included \bison\ parser is used, {\it not\/} debugging parsers written with the help of this software: the latter topic is covered in more detail later on}. There is a fair amount of debugging information that the macros can output, unfortunately, very little of it is tailored to the {\it use\/} of the macros in the \bison\ parser. Most of it is designed to help {\it build\/} a new parser. If you find that the parser gives up too often or even crashes (the latter is most certainly a bug in the parser itself), the first approach is to make sure that your code {\it compiles\/} i.e.\ forget about the printed output and try to see if the `real' \bison\ accepts the code (just the syntax, no need to worry about conflicts and such). If this does not shed any light on why the macros seem to fail, turn on the debugging output by saying \.{\\trace$\ldots$true} for various trace macros. This can produce {\it a lot\/} of output, even for small fragments, so turn it on only for a section at a time. If you need still {\it more\/} details of the inner workings of the parser and the lexer, various other debugging conditionals are available. For example, \.{\\yyflexdebugtrue} turns on the debugging output for the scanner. There are a number of such conditionals that are discussed in the commentary for the appropriate \TeX\ macros. Remember, what you are seeing at this point is the parsing process of the \bison\ input file, not the one for {\it your\/} grammar (which might not even be complete at this point). However, if this fails, you are on your own: drop me a line if you figure out how to fix any bugs you find. @*1 Terminology. We now list a few definitions of the concepts used repeatedly in this documentation. Most of this terminology is rather standard. Formal precision is not the goal here, and intuitive explanations are substituted whenever possible. {% \def\aterm#1{\item{\sqebullet}{\ttl #1}: \ignorespaces}% \setbox0=\hbox{\sqebullet\enspace} \parindent=0pt \advance\parindent by \wd0 \smallskip \aterm{bison parser} while, strictly speaking, not a formally defined term, this combination will always stand for one of the parsers generated by this package designed to parse a subset of the `official' grammar for \bison\ input files. All of these parsers are described later in this documentation. The term {\it main parser\/} will be used as a substitute in example documentation for the same purpose. \aterm{driver} a generic but poorly defined concept. In this documentation it is used predominantly to mean both the \Cee\ code and the resulting executable that outputs the \TeX\ macros that contain the parser tables, token values, etc., for the parsers built by the user. It is understood that the \Cee\ code of the `driver' is unchanged and the information about the parser itself is obtained by {\it including\/} the \Cee\ file produced by \bison\ in the `driver' (see the examples supplied with the package). \aterm{lexer} a synonym for {\it scanner}, a subroutine that performs the {\it lexical analysis\/} phase of the parsing process, i.e.\ groups various characters from the input stream into parser {\it tokens}. \aterm{namespace} this is an overused bit of terminology meaning a set of names grouped together according to some relatively well defined principle. In a language without a well developed type system (such as \TeX) it is usually accompanied by a specially designed naming scheme. {\it Parser namespaces\/} are commonly used in this documentation to mean a collection of all the data structures describing a parser and its state, including tables, stacks, etc., named by using the `root' name (say \.{\\yytable}) and adding the name of the parser (for example, \.{[main]}). To support this naming scheme, a number of macros work in unison to create and rename the `data macros' accordingly. \aterm{symbolic switch} a macro (or an associative array of macros) that let the \TeX\ parser generated by the package associate {\it symbolic term names\/} with the terms. Unlike the `real' parser, the parser created with this suite requires some extra setup as explained in the included examples (one can also consult the source for this documentation which creates but does not use a symbolic switch). \aterm{symbolic term name} a (relatively new) way to refer to stack values in \bison. In addition to using the `positional' names such as \.{\$}$n$ to refer to term values, one can utilize the new syntax: \.{\$}\.{[}{\it name\/}\.{]}. The `{\it name}' can be assigned by the user or can be the name of the nonterminal or token used in the productions. \aterm{term} in a narrow sense, an `element' of a grammar. Instead of a long winded definition, an example, such as \prodstyle{ID} should suffice. Terms are further classified into {\it terminals\/} (tokens) and {\it nonterminals\/} (which can be intuitively thought of as composite terms). \aterm{token} in short, an element of a set. Usually encoded as an integer by most parsers, an indivisible {\it term\/} produced for the parser by the scanner. \TeX's scanner uses a more sophisticated token classification, for example, $($character code, character category$)$ pairs, etc. } @** Languages, scanners, parsers, and \TeX. % Or $\ldots$ $$\vbox{\halign to\hsize{\kern-1.5pt\it#\hfil\tabskip0pt plus1fil\cr Tokens and tables keep macros in check.\cr Make 'em with \bison, use \.{WEAVE} as a tool.\cr Add \TeX\ and \CTANGLE, and \Cee\ to the pool.\cr Reduce 'em with actions, look forward, not back.\cr Macros, productions, recursion and stack!\cr \noalign{\vskip2pt} \omit\hfil\eightpoint Computer generated (most likely)\cr}} $$ \def\recount#1{${}^{(#1)}$}% In order to understand the parsing routines in this collection, it would help to gain some familiarity with the internals of the parsers produced by \bison\ for its intended target: \Cee. A person looking inside a parser delivered by \bison\ would quickly discover that the parsing procedure itself (|yyparse|) occupies a rather small portion of the file. If (s)he were to further reduce the size of the file by removing all the preprocessor directives intended to anticipate every conceivable combination of the operating system, compiler, and \Cee\ dialect, and various reporting and error logging functions it would become very clear that the most valuable product of \bison's labor is a collection of integer {\it tables\/} that control the actions of the parser routine. Moreover, the routine itself is an extremely concise and well-structured loop composed of |goto|'s and a number of numerical conditionals. If one were to think of a way of accessing arrays and processing conditionals in the language of one's choice, once the tables produced by \bison\ have been converted into a form suitable for the consumption by the appropriate language engine, the parser implementation becomes straightforward. Or nearly so. The {\it scanning\/} (or {\it lexing\/}) step of this process---a way to convert a stream of symbols into a stream of integers, also deserves some attention here. There are a number of excellent tools written to automate this step in much the same fashion as \bison\ automates the generation of parsers. One such tool, \flex, though (in the opinion of this author) slightly lacking in the simplicity and elegance as compared to \bison, was used to implement the lexer for this software suite. Lexing in \TeX\ will be discussed in considerable detail later in this manual. The language of interest in our case is, of course, \TeX, so our future discussion will revolve around the five elements mentioned above: \recount{1}data structures (mainly arrays and stacks), \recount{2}converting \bison's output into a form suitable for \TeX's consumption, \recount{3}processing raw streams of \TeX's tokens and converting them into streams of parser tokens, \recount{4}the implementation of \bison's |yyparse| in \TeX, and, finally, \recount{5}producing \TeX\ output via {\it syntax-directed translation} (which requires an appropriate abstraction to represent \bison's actions inside \TeX). We shall begin by discussing the parsing process itself. @*1 Arrays, stacks and the parser. Let us briefly examine the programming environment offered by \TeX. Designed for typesetting, \TeX's remarkable language provides a layer of macro processing atop of a set of commands that produce the output fulfilling its primary mission: delivering page layouts. In The \TeX book, macro {\it expansion\/} is likened to mastication, whereas \TeX's main product, the typographic output is the result of its `digestion' process. Not everything that goes through \TeX's digestive tract ends up leaving a trace on the final page: a file full of \.{\\relax}'s will produce no output, even though \.{\\relax} is not a macro, and thus would have to be processed by \TeX\ at the lowest level. It is time to describe the details of defining suitable data structures in \TeX. At first glance, \TeX\ provides rather standard means of organizing and using general memory. At the core of its generic programming environment is an array of \.{\\count}$\,n$ {\it registers\/}, which may be viewed as general purpose integer variables that are randomly accessible by their indices. The integer arithmetic machinery offered by \TeX\ is spartan but is very adequate for the sort of operations a parser would perform: mostly additions and comparisons. Is the \.{\\count} array a good way to store tables in \TeX? Probably not. The first factor is the {\it size\/} of this array: only 256 \.{\\count} registers exist in a standard \TeX\ (the actual number of such registers on a typical machine running \TeX\ is significantly higher but this author is a great believer in standards, and to his knowledge, none of the standardization efforts in the \TeX\ world has resulted in anything even close to the definitive masterpiece that is The \TeX book). The issue of size can be mitigated to some extent by using a number of other similar arrays used by \TeX\ (\.{\\catcode}, \.{\\uccode}, \.{\\dimen}, \.{\\sfcode} and others can be used for this purpose as long as one takes care to restore the `sane' values before control is handed off to \TeX's typesetting mechanisms). If a table has to span several such arrays, however, the complexity of accessing code would have to increase significantly, and the issue of size would still haunt the programmer. The second factor is the use of several registers by \TeX\ for special purposes (in addition, some of these registers can only store a limited range of values). Thus, the first 10 \.{\\count} registers are used by plain \TeX\ for (well, {\it intended\/} for, anyway) the purposes of page accounting: their values would have to be carefully saved and restored before and after each parsing call, respectively. Other registers (\.{\\catcode} in particular) have even more disrupting effects on \TeX's internal mechanisms. While all of this can be managed (after all, using \TeX\ as an arithmetic engine such as a parser suspends the need for any typographic or other specialized functions controlled by these arrays), the added complexity of using several memory banks simultaneously and the speed penalty caused by the need to store and restore register values make this approach much less attractive. What other means of storing arrays are provided by \TeX? Essentially, only three options remain: \.{\\token} registers, macros holding whole arrays, and associative arrays accessed through \.{\\csname}$\,\ldots\,$\.{\\endcsname}. In the first two cases if care is taken to store such arrays in an appropriate form one can use \TeX's \.{\\ifcase} primitive to access individual elements. The trade-off is the speed of such access: it is {\it linear\/} in the size of the array for most operations, and worse than that for others, such as removing the last item of an array. Using clever ways of organizing such arrays, one can improve the linear access time to $O(\log n)$ by simply modifying the access macros but at the moment, a straightforward \.{\\ifcase} is used after expanding a list macro or the contents of a \.{\\token}$\,n$ register in an {\it un\/}optimized parser. An {\it optimized\/} parser uses associative arrays. The array discussion above is just as applicable to {\it stacks\/} (indeed, an array is the most common form of stack implementation). Since stacks pop up and disappear frequently (what else are stacks to do?), list macros are usually used to store them. The optimized parser uses a separate \.{\\count} register to keep track of the top of the stack in the appropriate associative array. Let us now switch our attention to the code that implements the parser and scanner {\it functions\/}. If one has spent some time writing \TeX\ macros of any sophistication (or any macros, for that matter) (s)he must be familiar with the general feeling of frustration and the desire to `just call a function here and move on'. Macros produce {\it tokens\/}, however, and tokens must either expand to nothing or stay and be contributed to your input, or worse, be out of place and produce an error. One way to sustain a stream of execution with macros is {\it tail recursion\/} (i.e.~always expanding the {\it last token left standing}). As we have already discussed, \bison's |yyparse()| is a well laid out loop organized as a sequence of |goto|'s (no reason to become religious about structured programming here). This fact, and the following well known trick, make \Cee\ to \TeX\ translation almost straightforward. % The macro mess below looks painful but this is the only place such layout is used % The approach can be easily generalized and put in limbo.sty but it seems % a bit redundant at this point. \newcount\piccount \newdimen\lasthsize \setbox5=\vtop{ \demomargin=0pt \let\demoastyle\empty \begindemo ^label A: ... \nooutput ^ if**L**Krm(condition)**N ^ goto C; \nooutput ^label B: ... \nooutput ^ goto A; \nooutput ^label C: ... \nooutput \enddemo } \dp5=\z@@ \setbox3=\vtop{ \demomargin=0pt \let\demoastyle\empty \begindemo ^\if**L**Krm(condition)**N ^ \let\next=\labelC ^\else ^ \let\next=\labelAtail \enddemo } \dp3=\z@@ \newdimen\lastdepth \def\startfitpar{% \bgroup \lasthsize=\hsize \advance\lasthsize-1.5in \vsize=\baselineskip \topskip=\z@@ \setbox0\box2 % empty it % this sounds good at first but there is no good way to pull the insertions out after the % box manipulations that follow; % insertions will thus be contributed to whatever page was being worked on when the % picture insertions {\it started}; hence, if these happen to start at the very top of the page, % any insertion that follows will be contributed to the previous page; we correct this for footnotes % below % \holdinginserts=1 \output{% \global\setbox2=\vbox{ \ifvoid2 \else \prevdepth=\dp2 \unvbox2 \fi \lastdepth=\dp255 \unvbox255 % this would be tempting, however, the \eject that follows should disappear % in addition, one really should not be playing with page breaking in the middle of % such tricky insertions % \penalty\outputpenalty % \kern-\lastdepth % to make sure \baselineskip is accounted for }% }\eject \output{% \setbox0=\vbox{% \unvbox255% }% \lastbox would almost work ... if not for insertions \global\advance\piccount1 \global\setbox2=\vbox{% \prevdepth=\dp2 \unvbox2 \hbox to\hsize{% \ifnum\piccount<15 \hbox to1.5in{% \ifnum\piccount=1 \ \box5 \fi \hfill}% \fi \box0 \hfill \ifnum\piccount=1 \box3 \ % \fi \ifvoid\footins % reinsert footnotes \else \insert\footins{\unvbox\footins}% \fi }% }% }% \parshape=15 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt 2.7in 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \lasthsize 0pt \hsize } \def\endfitpar{% \par \eject \egroup % see the comment above % \holdinginserts=0 \prevdepth=\dp2 \unvbox2 } \startfitpar \noindent Given the code on the left (where |goto|'s are the only means of branching but can appear inside conditionals), one way to translate it into \TeX\ is to define a set of macros (call them \.{\\labelA}, \.{\\labelAtail} and so forth for clarity) that end in \.{\\next} (a common name for this purpose). Now, \.{\\labelA} will implement the code that comes between \.{label A:} and \.{goto C;}, whereas \.{\\labelAtail} is responsible for the code after \.{goto C;} and before \.{label B:} (provided no other |goto|'s intervene which can always be arranged). The conditional which precedes \.{goto C;} can now be written in \TeX\ as presented on the right, where (condition) is an appropriate translation of the corresponding condition in the code being translated (usually, one of `$=$' or `$\not=$'). Further details can be extracted from the \TeX\ code that implements these functions where the corresponding \Cee\ code is presented alongside the macros that mimic its functionality% \footnote{Running the risk of overloading the reader with details, the author would like to note that the actual implementation follows a {\it slightly\/} different route in order to avoid any \.{\\let} assignments or changing the meaning of \.{\\next}}. This concludes an overview of the general approach, It is time to consider the way characters get consumed on the lower levels of the macro hierarchy and the interaction between the different layers of the package. \endfitpar @*1 \TeX\ into tokens. Thus far we have covered the ideas behind items \recount{1} and \recount{4} on our list. It is time to discuss the lowest level of processing done by these macros: converting \TeX's tokens into the tokens consumed by the parser, i.e.\ part\recount{3} of the plan. Perhaps, it would be most appropriate to begin by defining the term {\it token}. As commonly defined, a token is simply an element of a set. Depending on how much structure the said set possesses, a token can be represented by an integer or a more complicated data structure. In the discussion below, we will be dealing with two kinds of tokens: the tokens consumed by the parsers and the \TeX\ tokens seen by the input routines. The latter play the role of {\it characters\/} that combine to become the former. \bison's internal representation for its tokens is non-negative integers so this is what a scanner must produce. \TeX's tokens are a good deal more sophisticated: they can be either pairs $(c_{\rm ch}, c_{\rm cat})$, where $c_{\rm ch}$ is the character code and $c_{\rm cat}$ is \TeX's category code ($1$ and $2$ for group characters, $5$ for end of line, etc.), or {\it control sequences\/}, such as \.{\\relax}. Some of these tokens (control sequences and {\it active}, i.e.~category~13 characters) can have complicated internal structure (expansion). The situation is further complicated by \TeX's \.{\\let} facility, which can create `character-like' control sequences, and the lack of conditionals to distinguish them from the `real' characters. Finally, not all pairs can appear as part of the input (say, there is no $(n, 0)$ token for any $n$, in the terminology above). The scanner expects to see {\it characters} in its input, which are represented by their {\sc ASCII} codes, i.e.~integers between $0$ and $255$ (actually, a more general notion of the Unicode character is supported but we will not discuss it further). Before character codes appear as the input to the scanner, however, and make its integer table-driven mechanism `tick', a lot of work must be done to collect and process the stream of \TeX\ tokens produced after \CWEAVE\ is done with your input. This work becomes further complicated when the typesetting routines that interpret the parser's output must sneak outside of the parsed stream of text (which is structured by the parser) and insert the original \TeX\ code produced by \CWEAVE\ into the page. \splint\ comes with a customizeable input routine of moderate complexity (\.{\\yyinput}) that classifies all \TeX\ tokens into seven categories: `normal' spaces (i.e.~category~10 tokens, skipped by \TeX's parameter scanning mechanism), `explicit' spaces (includes the control sequences \.{\\let} to \.{\ }, as well as \.{\\\ }), groups ({\it avoid} using \.{\\bgroup} and \.{\\egroup} in your input but `real', \.{\{}$\ldots$\.{\}} groups are fine), active characters, normal characters (of all character categories that can appear in \TeX\ input, including \.{\$}, \.{\^}, \.{\#}, \.{a}--\.{Z}, etc.), single letter control sequences, and multi-letter control sequences. Each of these categories can be processed separately to `fine-tune' the input routine to the problem at hand. The input routine is not very fast, instead, flexibility was the main goal. Therefore, if speed is desirable, a customized input routine is a great place to start. As an example, a minimalistic \.{\\yyinputtrivial} macro is included. When \.{\\yyinput} `returns' by calling \.{\\yyreturn} (which is a macro you design), your lexing routines have access to three registers: \.{\\yycp@@}, that holds the character value of the character just consumed by \.{\\yyinput}, \.{\\yybyte}, that most of the time holds the token just removed from the input, and \.{\\yybytepure}, that (again, with very few exceptions) holds a `normalized' version of the read character (i.e.~a character of the same character code as \.{\\yycp@@}, and category~11 (to be even more precise (and to use nested parentheses), `normalized' characters have the same category code as the current category code of \.{@@})). Most of the time it is the character code one needs (say, in the case of \.{\\\{}, \.{\\\}}, \.{\\\&} and so on) but under some circumstances the distinction is important (outside of \.{\\vb\{}$\ldots$\.{\}}, the sequence \.{\\1} has nothing to do with the digit `\.{1}'). This mechanism makes it easy to examine the consumed token. It also forms the foundation of the `hidden context' passing mechanism described later. The remainder of this section discusses the internals of \.{\\yyinput} and some of the design trade-offs one has to make while working on processing general \TeX\ token streams. It is typeset in `small print' and can be skipped if desired. \smallskip \begingroup \abovedisplayskip=5pt% \abovedisplayshortskip=2pt% \belowdisplayskip=5pt% \belowdisplayshortskip=2pt% \fnotesstart=1 \fnotesspan=2 \noofcolumns=2 \icgap=1em% \eightpoint \linecount=73 \setmcparams \def\.#1{{\chardef\\=`\\\chardef\&=`\&\tt #1}}% \dsskip=0pt% \begindoublecols To examine every token in its path (including spaces that are easy to skip), the input routine uses one of the two well-known {\sc \TeX}nologies: \.{\\futurelet\\next\\examinenext} or equally effective \hbox{\.{\\afterassignment\\next\\let={\tt\char"20}}}. Recursively inserting one of these sequences, \.{\\yyinput} can go through any list of tokens, as long as it knows where to stop (i.e.~return an end of file character). The signal to stop is provided by the \.{\\yyeof} primitive which should not appear in any `ordinary' text presented for parsing, other than for the purpose of providing such a stop signal. Even the dependence on \.{\\yyeof} can be eliminated if one is willing to invest the time in writing macros that juggle \TeX's \.{\\token} registers and only limit oneself to input from such registers (which is, aside from an obvious efficiency hit, a strain on \TeX's memory, as you have to store multiple (3 in the general case) copies of your input to be able to back up when the lexer makes a wrong choice). There does not seem to be a way of doing it unless the text has been stored in a \.{\\token} register first (or storing the whole input as a {\it parameter\/} for the appropriate macro: this scheme is remarkably powerful and leads to {\it expandable\/} versions of very complicated macros, although the amount of effort required to write such macros grows at a frightening rate). All of these are non-issues for the text inside \.{\\vb\{}$\ldots$\.{\}} and the care that \.{\\yyinput} takes in processing characters inside such lists is an overkill. In a more `hostile' environment (such as the one encountered by the now obsolete \.{\\Tex} macros), this extra attention to detail pays off in the form of a more robust input mechanism. One subtlety deserves a special mention here, as it can be important to the designer of `higher-level' scanning macros. Two types of tokens are extremely difficult to deal with whenever \TeX's own lexing mechanisms are used: (implicit) spaces and even more so, braces. We will only discuss braces here, however, almost everything that follows applies equally well to spaces (category 10 tokens to be precise), with a few simplifications (or complications, in a couple of places). To understand the difficulty, let's consider one of the approaches above: $$ \.{\\futurelet\\next\\examinenext}. $$ The macro \.{\\examinenext} usually looks at \.{\\next} and inserts another macro (usually also called \.{\\next}) at the very end of its expansion list. This macro usually takes one parameter, to consume the next token. This mechanism works flawlessly, until the lexer encounters a \.{\{}br\.{,}sp\.{\}}ace. The \.{\\next} sequence, seen by \.{\\examinenext} contains a lot of information about the brace ahead: it knows its category code (left brace, so $1$), its character code (in case there was, say a \.{\\catcode`\\[=1{\tt\char`\ }} earlier) but not whether it is a `real' brace (i.e.\ a character \.{\{}$_1$) or an implicit one (a \.{\\bgroup}). There is no way to find that out until the control sequence `launched' by \.{\\examinenext} sees the token as a parameter. If the next token is a `real' brace, however, \.{\\examinenext}'s successor will never see the token itself: the braces are stripped by \TeX's scanning mechanism. Even if it finds a \.{\\bgroup} as the parameter, there is no guarantee that the actual input was not \.{\{\\bgroup\}}. One way to handle this is by using \.{\\string} ahead of any consumption of the next token. If prior to expanding \.{\\string} care has been taken to set the \.{\\escapechar} appropriately (remember, we know the character code in advance), as soon as one sees a character with \.{\\escapechar}'s character code, (s)he knows that an implicit brace has just been seen. One added complication to all this is that a very determined programmer can insert an {\it active\/} character (using, say, the \.{\\uccode} mechanism) that has the {\it same\/} character code as the {\it brace\/} token that it has been \.{\\let} to! Setting this possibility aside, the \.{\\string} mechanism (or, its cousin, \.{\\meaning}) is not perfect: both produce a sequence of category 12 and 10 tokens. If it is indeed a brace character that we just saw, we can consume the next token and move on but what if this was a control sequence? After all, just as easily as \.{\\string} makes a sequence into characters, \.{\\csname}$\,\ldots\,$\.{\\endcsname} pair will make any sequence of characters into a control sequence. Huh~$\ldots$ What we need is a backup mechanism: if one has a copy of the token sequence ahead, one can use \.{\\string} to see if it is a real brace first, and if it is, consume it and move on (the active character case can be handled as the implicit case below, with one extra backup to count how many tokens have been consumed). At this point one has to {\it reinsert\/} the brace in case, at some point, a future `back up' requires that the rest of the tokens are removed from the output (to avoid `\.{Too many \}'s}' complaints from \TeX). This can be done by using the \.{\\iftrue\{\\else\}\\fi} trick but of course, some bookkeeping is needed to keep track of how far inside the brace groups we are. If it is an implicit brace, more work is needed: read all the characters that \.{\\string} produced (an maybe more), then remember the number of characters consumed. Remove the rest of the input using the method described above and restart the scanning from the same point knowing that the next token can be scanned as a parameter. Another strategy is to design a general enough macro that counts tokens in a token register and simply recount the tokens after every brace was consumed. Either way, it takes a lot of work. If anyone would like to pursue the counting strategy, simple counting macros are provided in \.{/examples/count/count.sty}. The macros in this example supply a very general counting mechanism that does not depend on \.{\\yyeof} (or {\it any\/} other token) being `special' and can count the tokens in any token register, as long as none of those tokens is an \.{\\outer} control sequence. In other words, if the macro is used immediately after the assignment to the token register, it should always produce a correct count. Needless to say, if such a general mechanism is desired, one has to look elsewhere. The added complications of treating spaces (\TeX\ tends to ignore them most of the time) make this a torturous exercise in \TeX's macro wizardry. The included \.{\\yyinput} has two ways of dealing with braces: strip them or view the whole group as a token. Pick one or write a different \.{\\yyinput}. Spaces, implicit or explicit are reported as a specially selected character code and consumed with a likeness of $$ \hbox{\.{\\afterassignment\\moveon\\let\\next={\tt\char`\ }}}. $$ Now that a steady stream of character codes is arriving at \.{\\yylex} after \.{\\yyreturn} the job of converting it into numerical tokens is performed by the {\it scanner} (or {\it lexer\/}, or {\it tokenizer\/}, or even {\it tokener}), discussed in the next section. \enddoublecols \endgroup @*1 Lexing in \TeX. In a typical system that uses a parser to process text, the parsing pass is usually split into several stages: the raw input, the lexical analysis (or simply {\it lexing}), and the parsing proper. The {\it lexing\/} (also called {\it scanning}, we use these terms interchangeably) clumps various sequences of characters into {\it tokens\/} to facilitate the parsing stage. The reasons for this particular hierarchy are largely pragmatic and are partially historic (there is no reason that {\it parsing\/} cannot be done in multiple phases, as well, although it usually isn't). If one remembers a few basic facts from the formal language theory, it becomes obvious that a lexer, that parses {\it regular\/} languages, can (theoretically) be replaced by an {\sc LALR} parser, that parses {\it context-free\/} ones (or some subset thereof, which is still a super set of all regular languages). A common justification given for creating specialized lexers is efficiency and speed. The reality is somewhat more subtle. While we do care about the efficiency of parsing in \TeX, having a specialized scanner is important for a number of different reasons. The real advantage of having a dedicated scanner is the ease with which it can match incomplete inputs and back up. A parser can, of course, {\it recognize\/} any valid input that is also acceptable to a lexer, as well as {\it reject\/} any input that does not form a valid token. Between those two extremes, however, lies a whole realm of options that a traditional parser will have great difficulty exploring. Thus, to mention just one example, it is relatively easy to set up a DFA\footnote{Which stands for Deterministic Finite Automaton, a common (and mathematically unique) way of implementing a scanner for regular languages. Incidentally {\sc LALR} mentioned above is short for Look Ahead Left to Right.} so that the {\it longest\/} matching input is accepted. The only straightforward way to do this with a traditional parser is to parse longer and longer inputs again and again. While this process can be optimized to a certain degree, the fact that a parser has a {\it stack\/} to maintain limits its ability to back up. As an aside, the mechanism by which \CWEB\ assembles its `scraps' into chunks of recognized code is essentially iterative lexing, very similar to what a human does to make sense of complicated texts. Instead of trying to match the longest running piece of text, \CWEB\ simply looks for patterns to combine inputs into larger chunks, which can later be further combined. Note that this is not quite the same as the approach taken by, say {\sc GLR} parsers, where the parser must match the {\it whole\/} input or declare a failure. Where a \CWEB-type parser may settle for the first available match (or the longest available) a {\sc GLR} parser must try {\it all\/} possible matches or use an algorithm to reject the majority of the ones that are bound to fail in the end. This `\CWEB\ way' is also different from a traditional `strict' {\sc LR} parser/scanner approach and certainly deserves serious consideration when the text to be parsed possesses some rigid structure but the parser is only allowed to process it one small fragment at a time. Returning to the present macro suite, the lexer produced by \flex\ uses integer tables similar to those employed by \bison\ so the usual {\sc\TeX}niques used in implementing \.{\\yyparse} are fully applicable to \.{\\yylex}. An additional advantage provided by having a \flex\ scanner implemented as part of the suite is the availability of the original \bison\ scanner written in \Cee\ for the use by the macro package. This said, the code generated by \flex\ contains a few idiosyncrasies not present in the \bison\ output. These `quirks' mostly involve handling of end of input and error conditions. A quick glance at the \.{\\yylex} implementation will reveal a rather extensive collection of macros designed to deal with end of input actions. Another difficulty one has to face in translating \flex\ output into \TeX\ is a somewhat unstructured namespace delivered in the final output (this is partially due to the \POSIX\ standard that \flex\ strives to follow). One consequence of this `messy' approach is that the writer of a \flex\ scanner targeted to \TeX\ has to declare \flex\ `states' (more properly called {\it subautomata}) twice: first for the benefit of \flex\ itself, and then again, in the {\it \Cee\ preamble\/} portion of the code to output the states to be used by the action code in the lexer. \.{Define\_State($\ldots$)} macro is provided for this purpose. This macro can be used explicitly by the programmer or be inserted by a specially designed parser. Using \CWEB\ helps to keep these declarations together. The `hand-off' from the scanner to the parser is implemented through a pair of registers: \.{\\yylval}, a token register containing the value of the returned token and \.{\\yychar}, a \.{\\count} register that contains the numerical value of the token to be returned. Upon matching a token, the scanner passes one crucial piece of information to the user: the character sequence representing the token just matched (\.{\\yytext}). This is not the whole story though. There are three more token sequences that are made available to the parser writer whenever a token is matched. The first of these is simply a `normalized' version of \.{\\yytext} (called \.{\\yytextpure}). In most cases it is a sequence of \TeX\ tokens with the same character codes as the one in \.{\\yytext} but with their category codes set to 11. In cases when the tokens in \.{\\yytext} are {\it not} $(c_{\rm ch}, c_{\rm cat})$ pairs, a few simple conventions are followed, some of which will be explained below. This sequence is provided merely for convenience and its typical use is to generate a key for an associate array. The other two sequences are special `stream pointers' that provide access to the extended scanner mechanism in order to implement passing of `formatting hints' to the parser without introducing any changes to the original grammar. As the mechanism itself and the motivation behind it are somewhat subtle, let me spend a few moments discussing the range of formatting options desirable in a generic pretty-printer. Unlike strict parsers employed by most compilers, a parser designed for pretty printing cannot afford being too picky about the structure of its input (\cite[Go] calls such parsers `loose'). To provide a simple illustration, an isolated identifier, such as `\.{lg\_integer}' can be a type name, a variable name, or a structure tag (in a language like \Cee\ for example). If one expects the pretty printer to typeset this identifier in a correct style, some context must be supplied, as well. There are several strategies a pretty printer can employ to get a hold of the necessary context. Perhaps the simplest way to handle this, and to reduce the complexity of the pretty printing algorithm is to insist on the user providing enough context for the parser to do its job. For short examples like the one above, this is an acceptable strategy. Unfortunately, it is easy to come up with longer snippets of grammatically deficient text that a pretty printer should be expected to handle. Some pretty printers, such as the one employed by \CWEB\ and its ilk (the original \.{WEB}, \.{FWEB}), use a very flexible bottom-up technique that tries to make sense of as large a portion of the text as it can before outputting the result (see also \cite[Wo], which implements a similar algorithm in \LaTeX). The expectation is that this algorithm will handle the majority (about 90\%? it would be interesting to carry out a study in the spirit of the ones discussed in \cite[Jo] to find out) of the cases with the remaining few left for the author to correct. The question is, how can such a correction be applied? \CWEB\ itself provides two rather different mechanisms for handling these exceptions. The first uses direct typesetting commands (for example, \.{@@/} and \.{@@\#} for canceling and introducing a line break, resp.) to change the typographic output. The second (preferred) way is to supply {\it hidden context\/} to the pretty-printer. Two commands, \.{@@;} and \.{@@[}$\ldots$\.{@@]} are used for this purpose. The former introduces a `virtual semicolon' that acts in every way like a real one except it is not typeset (it is not output in the source file generated by \CTANGLE, either but this has nothing to do with pretty printing, so I will not mention \CTANGLE\ anymore). For instance, from the parser's point of view, if the preceding text was parsed as a `scrap' of type {\it exp}, the addition of \.{@@;} will make it into a `scrap' of type {\it stmt\/} in \CWEB's parlance. The second construct (\.{@@[}$\ldots$\.{@@]}), is used to create an {\it exp\/} scrap out of whatever happens to be inside the brackets. This is a powerful tool at the author's disposal. Stylistically, this is the right way to handle exceptions as it forces the writer to emphasize the {\it logical\/} structure of the formal text. If the pretty printing style is changed extensively later, the texts with such hidden contexts should be able to survive intact in the final document (as an example, using a break after every statement in \Cee\ may no longer be considered appropriate, so any forced break introduced to support this convention would now have to be removed, whereas \.{@@;}'s would simply quietly disappear into the background). The same hidden context idea has another important advantage: with careful grammar fragmenting (facilitated by \CWEB's or any other literate programming tool's `hypertext' structure) and a more diverse hidden context (or even arbitrary hidden text) mechanism, it is possible to use a strict parser to parse incomplete language fragments. For example, the productions that are needed to parse \Cee's expressions form a complete subset of the grammar. If the grammar's `start' symbol is changed to {\it expression\/} (instead of the {\it translation-unit\/} as it is in the full \Cee\ grammar), a variety of incomplete \Cee\ fragments can now be parsed and pretty-printed. Whenever such granularity is still too `coarse', carefully supplied hidden context will give the pretty printer enough information to adequately process each fragment. A number of such {\it sub}-parsers can be tried on each fragment (this may sound computationally expensive, however, in practice, a carefully chosen hierarchy of parsers will finish the job rather quickly) until a correct parser produced the desired output (this approach is similar to, although not quite the same one employed by the {\it General LR parsers}). This somewhat lengthy discussion brings us to the question directly related to the tools described in this article: how does one provide typographical hints or hidden context to the parser? One obvious solution is to build such hints directly into the grammar. The parser designer can, for instance, add new tokens (say, \.{BREAK\_LINE}) to the grammar and extend the production set to incorporate the new additions. The risk of introducing new conflicts into the grammar is low (although not entirely non-existent, due to the lookahead limitations of LR(1) grammars) and the changes required are easy, although very tedious, to incorporate. In addition to being labor intensive, this solution has two other significant shortcomings: it alters the original grammar and hides its logical structure; it also `bakes in' the pretty-printing conventions into the language structure (making `hidden' context much less `stealthy'). It does avoid the `synchronicity problem' mentioned below. A marginally better technique is to introduce a new regular expression recognizable by the scanner which will then do all the necessary bookkeeping upon matching the sequence. All the difficulties with altering the grammar mentioned above apply in this case, as well, only at the `lexical analysis level'. At a minimum, the set of tokens matched by the scanner would have to be changed. A much better approach involves inserting the hints at the input stage and passing this information to the scanner and parser as part of the token `values'. The hints themselves can masquerade as characters ignored by the scanner (white space, for example) and preprocessed by a specially designed input routine. The scanner then simply passes on the values to the parser. This makes hints, in effect, invisible. The difficulty lies in synchronizing the token production with the parser. This subtle complication is very familiar to anyone who has designed \TeX's output routines: the parser and the lexer are not synchronous, in the sense that the scanner might be reading several (in the case of the general LR$(n)$ parsers) tokens ahead of the parser before deciding on how to proceed (the same way \TeX\ can consume a whole paragraph's worth of text before exercising its page builder). If we simple-mindedly let the scanner return every hint it has encountered so far, we may end up feeding the parser the hints meant for the token that appears {\it after\/} the fragment the parser is currently working on. In other words, when the scanner `backs up' it must correctly back up the hints as well. This is exactly what the scanner produced by the tools in this package does: along with the main stream of tokens meant for the parser, it produces two hidden streams (called the \.{\\format} stream and the \.{\\stash} stream) and provides the parser with two strings (currently only strings of digits are used although arbitrary sequences of \TeX\ tokens can be used as pointers) with the promise that {\it all the `hints' between the beginning of the corresponding stream and the point labeled by the current stream pointer appeared among the characters up to and, possibly, including the ones matched as the current token}. The macros to extract the relevant parts of the streams (\.{\\yyreadfifo} and its cousins) are provided for the convenience of the parser designer. The interested reader can consult the input routine macros for the details of the internal representation of the streams. In the interest of full disclosure, let me point out that this simple technique introduces a significant strain on \TeX's computational resources: the lowest level macros, the ones that handle character input and are thus executed (sometimes multiple times), for {\it every\/} character in the input stream are rather complicated and therefore, slow. Whenever the use of such streams is not desired a simpler input routine can be written to speed up the process (see \.{\\yyinputtrivial} for a working example of such macro). Finally, while probably not directly related to the present discussion, this approach has one more interesting feature: after the parser is finished, the parser output and the streams exist `statically', fully available for any last minute preprocessing or for debugging purposes, if necessary. Under most circumstances, the parser output is `executed' and the macros in the output are the ones reading the various streams using the pointers supplied at the parsing stage (at least, this is the case for all the parsers supplied with the package). @*1 Inside semantic actions: switch statements and `functions' in \TeX. Now you have a lexer for your input, and a grammar ready to be put into action (we will talk about actions a bit later). It is time to discuss how the tables produced by \bison\ get converted into \TeX\ {\it macros\/} that drive the parser in {\it \TeX}. The tables that drive the \bison\ input parsers are collected in various \.{\{b,d,f,g,n\}yytab.tex} and \.{small\_tab.tex}. Each one of these files contains the tables that implement a specific parser used during different stages of processing. Their exact function is well explained in the source file produced by \bison\ ({\it how} this is done is explained elsewhere, see \cite[Ah] for a good reference). It would suffice to mention here that there are three types of tables in this file: \recount{1}numerical tables such as \.{\\yytable} and \.{\\yycheck} (both are either \TeX's token registers in an unoptimized parser or associate arrays in an optimized version of such as discussed below), \recount{2}a string array \.{\\yytname}, and \recount{3}an action switch. The action switch is what gets called when the parser does a {\it reduction}. It is easy to notice that the numerical tables come `premade' whereas the string array consisting of token names is difficult to recognize. This is intentional: this form of initialization is designed to allow the widest range of characters to appear inside names. The macros that do this reside in \.{yymisc.sty}. The generated table files also contain constant and token declarations used by the parser. The description of the process used to output \bison\ tables in an appropriate form continues in the section about \locallink{bsfile}outputting \TeX\ tables\endlink, we pick it up here with the description of the syntax-directed translation and the actions. The line $$ \.{\\switchon\\next\\in\\currentswitch} $$ is responsible for calling an appropriate action in the current switch, as is easy to infer. A {\it switch\/} is also a macro that consists of strings of \TeX\ tokens intermixed with \TeX\ macros inside braces. Each group of macros gets executed whenever the character or the group of characters in \.{\\next} matches a substring preceding the braced group. If there are two different substrings that match, only the earliest group of macros gets expanded. Before a state is used, a special control sequence, \.{\\setspecialcharsfrom\\switchname} can be used to put the \TeX\ tokens in a form suitable for the consumption by \.{\\switchon}'s. The most important step it performs is it {\it turns every token in the list into a character with the same character code and category 12\/}. Thus \.{\\\{} becomes \.{\{}$_{12}$. There are other ways of inserting tokens into a state: enclosing a token or a string of tokens in \.{\\raw...\\raw} adds it to the state macro unchanged. If you have a sequence of category 12 characters you want to add to the state, put it after \.{\\classexpand} (such sequences are usually prepared by the \.{\\setspecialchars} macro that uses the token tables generated by \bison\ from your grammar). You can give a case a readable label (say, \.{brackets}) and enclose this label in \.{\\raw}$\ldots$\.{\\raw}. A word of caution: an `a' inside of \.{\\raw}$\ldots$\.{\\raw} (which is most likely an \.{a}$_{11}$ unless you played with category codes before loading the \.{\\switchon} macros) and the one outside it are two different characters, as one is no longer a letter (category 11) in the eyes of \TeX\ whereas the other one still is. For this reason one should not use characters other than letters in h\.{\{}is\.{,}er\.{\}} state names: the way a state picks an action does not distinguish between, say, a `\.{(}' in `\.{(letter)}' and a stand alone `\.{(}' and may pick an action that you did not intend. This applies even if `\.{(}' is not among the characters explicitly inserted in the state macro: if an action for a given character is not found in the state macro, the \.{\\switchon} macro will insert a current \.{\\default} action instead, which most often you would want to be \.{\\yylex} or \.{\\yyinput} (i.e.\ skip this token). If `\.{(}' or `\.{)}' matches the braced group that follows `\.{(letter)}' chaos may ensue (most likely \TeX\ will keep reading past the \.{\\end} or \.{\\yyeof} that should have terminated the input). Make the names of character categories as unique as possible: the \.{\\switchon} is simply a string matching mechanism, with the added distinction between characters of different categories. Finally, the construct \.{\\statecomment}{\it anything\/}\.{\\statecoment} allows you to insert comments in the state sequence (note that the state {\it name\/} is put at the beginning of the state macro (by \.{\\setspecialcharsfrom}) in the form of a special control sequence that expands to nothing: this elaborate scheme is needed because another control sequence can be \.{\\let} to the state macro which makes the debugging information difficult to decipher). The debugging mode for the lexer implemented with these macros is activated by \.{\\tracedfatrue}. The functionality of the \.{\\switchon} macros (for `historical' reasons, one can also use \.{\\action} as a synonym) has been implemented in a number of other macro packages (see \cite[Fi] that discusses the well-known and widely used \.{\\CASE} and \.{\\FIND} macros). The macros in this collection have the additional property that the only assignments that persist after the \.{\\switchon} completes are the ones performed by the user code inside the selected case. This last property of the switch macros is implemented using another mechanism that is part of this macro suite: the `subroutine-like' macros, \.{\\begingroup}$\ldots$\.{\\tokreturn}. For examples, an interested reader can take a look at the macros included with the package. A typical use is \.{\\begingroup}$\ldots$\.{\\tokreturn\{\}\{\\toks0 \}\{\}} which will preserve all the changes to \.{\\toks0} and have no other side effects (if, for example, in typical \TeX\ vernacular, \.{\\next} is used to implement tail recursion inside the group, after the \.{\\tokreturn}, \.{\\next} will still have the same value it had before the group was entered). This functionality comes at the expense of some computational efficiency. This covers most of the routine computations inside semantic actions, all that is left is a way to `tap' into the stack automaton built by \bison\ using an interface similar to the special \.{\$$n$} variables utilized by the `genuine' \bison\ parsers (i.e.\ written in \Cee\ or any other target language supported by \bison). This role is played by the several varieties of \.{\\yy$\,p$} command sequences (for the sake of completeness, $p$ stands for one of \.{($n$)}, \.{[{\rm name}]}, \.{]{\rm name}[} or $n$, here $n$ is a string of digits, and a `name' is any name acceptable as a symbolic name for a term in \bison). Instead of going into the minutia of various flavors of \.{\\yy}-macros, let me just mention that one can get by with only two `idioms' and still be able to write parsers of arbitrary sophistication: \.{\\yy($n$)} can be treated as a token register containing the value of the $n$-th term of the rule's right hand side, $n>0$. The left hand side of a production is accessed through \.{\\yyval}. A convenient shortcut is \.{\\yy0\{{\rm \TeX\space material}\}} which will expand the `\TeX\ material inside the braces. Thus, a simple way to concatenate the values of the first two production terms is \.{\\yy0\{\\the\\yy(1)\\the\\yy(2)\}}. The included \bison\ parser can also be used to provide support for `symbolic names', analogous to \bison's \.{{\$}[{\rm name}]} but this requires a bit more effort on the user's part to initialize such support. It could make the parser more readable and maintainable, however. Naturally, a parser writer may need a number of other data abstractions to complete the task. Since these are highly dependent on the nature of the processing the parser is supposed to provide, we refer the interested reader to the parsers included in the package as a source of examples of such specialized data structures. One last remark about the parser operation is worth making here: the parser automaton itself does not make any \.{\\global} assignments. This (along with some careful semantic action writing) can be used to `localize' the effects of the parser operation and, most importantly, to create `reentrant' parsers that can, e.g.\ call {\it themselves\/} recursively. @*1 `Optimization'. By default, the generated parser and scanner keep all of their tables in separate token registers. Each stack is kept in a single macro (this description is further complicated by the support for parser {\it namespaces\/} that exists even for unoptimized parsers but this subtlety will not be mentioned again---see the macros in the package for further details). Thus, every time a table is accessed, it has to be expanded making the table access latency linear in {\it the size of the table}. The same holds for stacks and the action `switches', of course. While keeping the parser tables (which are immutable) in token registers does not have any better rationale than saving the control sequence memory (the most abundant memory in \TeX), this way of storing {\it stacks} does have an advantage when multiple parsers get to play simultaneously. All one has to do to switch from one parser to another is to save the state by renaming the stack control sequences accordingly. When the parser and scanner are `optimized', all these control sequenced are `spread over' appropriate associative arrays. One caveat to be aware of: the action switches for both the parser and the scanner have to be output differently (a command line option is used to control this) for optimized and unoptimized parsers. While it is certainly possible to optimize only some of the parsers (if your document uses multiple) or even only some {\it parts\/} of a given parser (or scanner), the details of how to do this are rather technical and are left for the reader to discover by reading the examples supplied with the package. At least at the beginning it is easier to simply set the highest optimization level and use it consistently throughout the document. @*1 {\it \TeX\/} with a different {\sl slant} or do you C an escape?. %\def\texnspace{other} Some \TeX\ productions below probably look like alien script. The authors of \cite[Er] cite a number of reasons pretty printing of \TeX\ in general is a nearly impossible task. The macros included with the package follow a very straightforward strategy and do not try to be very comprehensive. Instead, the burden of presenting \TeX\ code in a readable form is placed on the programmer. Appropriate hints can be supplied by means of indenting the code, using assignments ($=$) where appropriate, etc. If you would rather look at straight \TeX\ instead, the line \.{\\def\\texnspace\{other\}} at the beginning of this section can be uncommented and |TeX_( "/noexpand/inmath{/yy0{/yy1{}}}" );| becomes \def\texnspace{other}% |TeX_( "/noexpand/inmath{/yy0{/yy1{}}}" );|. \def\texnspace{texline}% There is, however, more to this story. A look at the actual file will reveal that the line above was typed as $$ \.{TeX\_( "/noexpand/inmath\{/yy0\{/yy1\{\}\}\}" );} $$ The `escape character' is leaning the other way! The lore of \TeX\ is uncompromising: `\.{\\}' is {\it the\/} escape character. What is the reason to avoid it in this case? The mystery is not very deep: `\.{/}' was chosen as an escape character by the parser macros (a quick glance at \.{?yytab.tex} will reveal as much). There is, of course, nothing sacred (other than tradition, which this author is trying his hardest to follow) about what character code the escape character has. The reason for this choice is straightforward: `\.{\\}' is a special character in \Cee, as well (also an `escape' in fact). The line \.{TeX\_( "..." );} is a {\it macro-call\/} but $\ldots$ in \Cee. This function simply prints out (almost `as-is') the line in parenthesis. An attempt at \.{TeX\_( "\\noexpand" );} would result in \numberlinestrue \begindemo ^ ^oexpand \enddemo \numberlinesfalse Other escape combinations\footnote{Here is a full list of {\it defined\/} escaped characters in \Cee: \.{\\a}, \.{\\b}, \.{\\f}, \.{\\n}, \.{\\r}, \.{\\t}, \.{\\v}, \.{\\}{$[$\it octal digit$]$}, \.{\\'}, \.{\\"}, \.{\\?}, \.{\\\\}, \.{\\x}, \.{\\u}, \.{\\U}. Note that the last three combinations must be followed by a specific string of characters to appear in the input without generating errors.} are even worse: most are simply undefined. If anyone feels trapped without an escape, however, the same line can be typed as $$ \.{TeX\_( "\\\\noexpand\\\\inmath\{\\\\yy0\{\\\\yy1\{\}\}\}" );} $$ Twice the escape! If one were to look closer at the code, another oddity stands out: there are no \.{\$}'s anywhere in sight. The big money, \.{\$} is a beloved character in \bison. It is used in action code to reference the values of the appropriate terms in a production. If mathematics pays your bills, use \.{\\inmath} instead. @*1 The \bison\ parser(s). Let's take a short break for a broad overview of the input file. The basic structure is that of an ordinary \bison\ file that produces plain \Cee\ output. The \Cee\ actions, however, are programmed to output \TeX. @s TeX_ TeX @(bg.yy@>= @G Switch to generic mode. %{@> @ @=%} @g @@; @G %union {@> @ @=} %{@> @ @=%} @g @@; @G %% @g @@; @@; @@; @G %% @g @ Bootstrap mode is next. The reason for a separate bootstrap parser is to collect the minimal amount of information to `spool up' the `production' parsers. To understand the mechanics and the reasons behind it, consider what happens following a declaration such as \.{\%token TOKEN "token"} (or, as it would be typeset by the macros in this package `\prodstyle{\%token} \.{TOKEN} \.{token}'; see the index entries for more details)% \idxinline{TOKEN}\idxinline{token}. The two names for the same token are treated very differently. \.{TOKEN} becomes an |enum| constant in the \Cee\ parser generated by \bison. Even when that parser becomes part of the `driver' program that outputs the \TeX\ version of the parser tables, there is no easy way to output the {\it names\/} of the appropriate |enum| constants. The other name (\.{"token"}) becomes an entry in the |yytname| array. These names can be output by either the `driver' or \TeX\ itself after the \.{\\yytname} table has been input. The scanner, on the other hand, will use the first version (\.{TOKEN}). Therefore, it is important to establish an equivalence between the two versions of the name. In the `real' parser, the token values are output in a special header file. Hence, one has to either parse the header file to establish the equivalences or find some other means to find out the numerical values of the tokens. One approach is to parse the file containing the {\it declarations\/} and extract the equivalences between the names from it. This is the function of the bootstrap parser. Since the lexer is reused, some token values need to be known in advance (and the rest either ignored or replaced by some `made up' values). These tokens are `hard coded' into the parser file generated by \bison\ and output using a special function. The switch `|@[#define@]@; BISON_BOOTSTRAP_MODE|' tells the `driver' program to output the hard coded token values. @q Bizarre looking way of typing #define is due to the awkward way@> @q \CWEB\ treats switching in and out of $-mode in inline \Cee@> Note that the equivalence of the two versions of token names would have to be established every time a `string version' of a token is declared in the \bison\ file and the `macro name version' of the token is used by the corresponding scanner. To establish this equivalence, however, the bootstrapping parser below is not always necessary (see the \.{xxpression} example, specifically, the file \.{xxpression.w} in the \.{examples} directory for an example of using a different parser for this purpose). The reason it is necessary here is that a parser for an appropriate subset of the \bison\ syntax is not yet available (indeed, {\it any\/} functional parser for a \bison\ syntax subset would have to use the same scanner (unless you want to write a custom scanner for it), which would need to know how to output tokens, for which it would need a parser for a subset of \bison\ syntax $\ldots$ it is a `chicken and egg'). Hence the name `bootstrap'. Once a functional parser for a large enough subset of the \bison\ input grammar is operational, {\it it\/} can be used to pair up the token names. The second function of the bootstrap parser is to collect information about the scanner's states. The mechanism is slightly different for states. While the token equivalences are collected purely in `\TeX\ mode', the bootstrap parser collects all the state names into a special \Cee\ header file. The reason is simple: unlike the token values, the numerical values of the scanner states are not passed to the `driver' program in any data structure and are instead defined as ordinary macros. The header file is the information the `driver' file needs to output the state values. An additional subtlety in the case of state value output is that the main lexer for the \bison\ grammar utilizes states extensively and thus cannot be easily used with the bootstrap parser before the state values are known. The solution is to substitute a very simple scanner barely capable of lexing state declarations. Such a scanner is implemented in \.{ssffo.w} (the somewhat cryptic name stands for `{\bf s}imple {\bf s}canner {\bf f}or {\bf f}lex {\bf o}ptions'). \saveparseoutputtrue @(bb.yy@>= @G Switch to generic mode. %{ @g @@; #define BISON_BOOTSTRAP_MODE @G %} @g @@; @G %union {@> @ @=} %{@> @ @=%} @g @@; @G %% @g @@; @@; @<\flex\ options parser productions@>@; @@; @@; @G %% @g @ The prologue parser is responsible for parsing various grammar declarations as well as parser options. \saveparseoutputfalse %\traceparserstatestrue %\tracestackstrue %\tracerulestrue %\traceactionstrue \saveparseoutputtrue @(bd.yy@>= @G Switch to generic mode. %{@> @ @=%} @g @@; @G %union {@> @ @=} %{@> @ @=%} @g @@; @G %% @g @@; @@; @@; @G %% @g @ Full \bison\ input parser is used when a complete \bison\ file is expected. It is also capable of parsing a `skeleton' of such a file, similar to the one that follows this paragraph. \traceparserstatesfalse \tracestacksfalse \tracerulesfalse \traceactionsfalse \checktablefalse \saveparseoutputfalse @(bf.yy@>= @G Switch to generic mode. %{@> @ @=%} @g @@; @G %union {@> @ @=} %{@> @ @=%} @g @@; @G %% @g @@; @@; @@; @@; @G %% @g @ The first two options are essential for the parser operation. The start symbol can be set implicitly by listing the appropriate production first. @q %define lr.type canonical-lr @> @q Make not on this and lexing too much lookahead and the \stashed trick@> @q Explain other options @> @= @G %token-table %debug %start input @g @*2 Grammar rules. Most of the original comments present in the grammar file used by \bison\ itself have been preserved and appear in {\it italics\/} at the beginning of each appropriate section. To facilitate the {\it bootstrapping\/} of the parser (see above), some declarations have been separated into their own sections. Also, a number of new rules have been introduced to create a hierarchy of `subparsers' that parse subsets of the grammar. We begin by listing most of the tokens used by the grammar. Only the string versions are kept in the |yytname| array, which, in part is the reason for a special bootstrapping parser as explained earlier. @= @G %token GRAM_EOF 0 "end of file" %token STRING "string" %token PERCENT_TOKEN "%token" %token PERCENT_NTERM "%nterm" %token PERCENT_TYPE "%type" %token PERCENT_DESTRUCTOR "%destructor" %token PERCENT_PRINTER "%printer" %token PERCENT_LEFT "%left" %token PERCENT_RIGHT "%right" %token PERCENT_NONASSOC "%nonassoc" %token PERCENT_PRECEDENCE "%precedence" %token PERCENT_PREC "%prec" %token PERCENT_DPREC "%dprec" %token PERCENT_MERGE "%merge" @g @@; @ We continue with the list of tokens below, following the layout of the original parser. @= @G %token PERCENT_CODE "%code" PERCENT_DEFAULT_PREC "%default-prec" PERCENT_DEFINE "%define" PERCENT_DEFINES "%defines" PERCENT_ERROR_VERBOSE "%error-verbose" PERCENT_EXPECT "%expect" PERCENT_EXPECT_RR "%expect-rr" PERCENT_FLAG "%" PERCENT_FILE_PREFIX "%file-prefix" PERCENT_GLR_PARSER "%glr-parser" PERCENT_INITIAL_ACTION "%initial-action" PERCENT_LANGUAGE "%language" PERCENT_NAME_PREFIX "%name-prefix" PERCENT_NO_DEFAULT_PREC "%no-default-prec" PERCENT_NO_LINES "%no-lines" PERCENT_NONDETERMINISTIC_PARSER "%nondeterministic-parser" PERCENT_OUTPUT "%output" PERCENT_REQUIRE "%require" PERCENT_SKELETON "%skeleton" PERCENT_START "%start" PERCENT_TOKEN_TABLE "%token-table" PERCENT_VERBOSE "%verbose" PERCENT_YACC "%yacc" ; %token BRACED_CODE "{...}" %token BRACED_PREDICATE "%?{...}" %token BRACKETED_ID "[identifier]" %token CHAR "char" %token EPILOGUE "epilogue" %token EQUAL "=" %token ID "identifier" %token ID_COLON "identifier:" %token PERCENT_PERCENT "%%" %token PIPE "|" %token PROLOGUE "%{...%}" %token SEMICOLON ";" %token TAG "" %token TAG_ANY "<*>" %token TAG_NONE "<>" %token INT "integer" %token PERCENT_PARAM "%param"; @g @ Extra tokens for typesetting \flex\ state declarations and options are declared in addition to the ones that a standard \bison\ parser recognizes. @= @G %token FLEX_OPTION FLEX_STATE_X FLEX_STATE_S @g @ We are ready to describe the top levels of the parse tree. The first `sub parser' we consider is a `full' parser, that is the parser that expects a full grammar file, complete with the prologue, declarations, etc. This parser can be used to extract information from the grammar that is otherwise absent from the executable code generated by \bison. This includes, for example, the `name' part of \.{\$}\.{[}{\rm name}\.{]}. This parser is therefore used to generate the `symbolic switch' to provide support for symbolic term names similar to `genuine' \bison's \.{\$}\.{[}$\ldots$\.{]} syntax. @= @t}\vb{\inline}{@> @G input: prologue_declarations "%%" grammar epilogue.opt {@> @ @=} ; @g @ The action of the parser in this case is simply to separate the accumulated `parse tree' from the auxiliary information carried by the parser on the stack. @= @[TeX_( "/getsecond{/yy(3)}/to/toksa" );@]@; /* extract grammar contents */ @[TeX_( "/yy0{/the/toksa}/table=/yy(0)" );@]@; @ Another subgrammar deals with the syntax of isolated \bison\ rules. This is the most commonly used `subparser' since a rules cluster is the most natural `unit' to include in a \CWEB\ file. @= @t}\vb{\inline}{@> @G input: grammar epilogue.opt {@> TeX_( "/getsecond{/yy(1)}/to/table" ); @=} ; @g @ The bootstrap parser has a very narrow set of goals: it is concerned with \prodstyle{\%token} and \prodstyle{\%nterm} declarations only in order to supply the token information to the lexer (since, as noted above, such information is not kept in the |yytname| array). It also extends the syntax of a \prodstyle{grammar\_declaration} by allowing a declaration with or without semicolon at the end (the latter is only allowed in the prologue). This works since the token declarations have been carefully separated from the rest of the grammar in different \CWEB\ sections. The range of tokens understood by the bootstrap parser is limited, hence most of the other rules are ignored. @= @t}\vb{\inline}{@> @G input: grammar_declarations {@> TeX_( "/table=/yy(1)" ); @=} ; @g @t}\vb{\resetf}{@> @G grammar_declarations: symbol_declaration semi.opt {@> @ @=} | flex_declaration semi.opt {@> @ @=} | grammar_declarations symbol_declaration semi.opt {@> TeX_( "/yy0{/the/yy(1)/the/yy(2)}" ); @=} | grammar_declarations flex_declaration semi.opt {@> TeX_( "/yy0{/the/yy(1)/the/yy(2)}" ); @=} ; @g @t}\vb{\inline\flatten}{@> @G semi.opt: {} | ";" {}; @g @ The following is perhaps the most common action performed by the parser. It is done automatically by the parser code but this feature is undocumented so we supply an explicit action in each case. @= @[TeX_( "/yy0{/the/yy(1)}" );@]@; @ Next, a subgrammar for processing prologue declarations. Finer differentiation is possible but the `subparsers' described here work pretty well and impose a mild style on the grammar writer. @= @t}\vb{\inline}{@> @G input: prologue_declarations epilogue.opt {@> TeX_( "/getsecond{/yy(1)}/to/table" ); @=} | prologue_declarations "%%" "%%" EPILOGUE {@> TeX_( "/getsecond{/yy(1)}/to/table" ); @=} | prologue_declarations "%%" "%%" {@> TeX_( "/getsecond{/yy(1)}/to/table" ); @=} ; @g @ {\it Declarations: before the first \prodstyle{\%\%}}. We are now ready to deal with the specifics of the declarations themselves. The \.{\\grammar} macro is a `structure', whose first `field' is the grammar itself, whereas the second carries the type of the last declaration added to the grammar. @= @G prologue_declarations: {@> TeX_( "/yy0{/nx/grammar{}{/nx/empty}}" ); @=} | prologue_declarations prologue_declaration {@> @ @=} ; @g @ @= @@; @ Here is a list of most kinds of declarations that can appear in the prologue. The scanner returns the `stream pointers' for all the keywords so the declaration `structures' pass on those pointers to the grammar list. The original syntax has been left intact even though for the purposes of this parser some of the inline rules are unnecessary. @= @G prologue_declaration: grammar_declaration {@> @ @=} | "%{...%}" {@> TeX_( "/yy0{/nx/prologuecode/the/yy(1)}" ); @=} | "%" {@> TeX_( "/yy0{/nx/optionflag/the/yy(1)}" ); @=} | "%define" variable value {@> TeX_( "/yy0{/nx/vardef{/the/yy(2)}{/the/yy(3)}/the/yy(1)}" ); @=} | "%defines" {@> TeX_( "/yy0{/nx/optionflag{defines}{}/the/yy(1)}" ); @=} | "%defines" STRING {@> @[TeX_( "/toksa{defines}" );@]@+@ @=} | "%error-verbose" {@> TeX_( "/yy0{/nx/optionflag{error verbose}{}/the/yy(1)}" ); @=} | "%expect" INT {@> @[TeX_( "/toksa{expect}" );@]@+@ @=} | "%expect-rr" INT {@> @[TeX_( "/toksa{expect-rr}" );@]@+@ @=} | "%file-prefix" STRING {@> @[TeX_( "/toksa{file prefix}" );@]@+@ @=} | "%glr-parser" {@> TeX_( "/yy0{/nx/optionflag{glr parser}{}/the/yy(1)}" ); @=} | "%initial-action" "{...}" {@> TeX_( "/yy0{/nx/initaction/the/yy(2)}" ); @=} | "%language" STRING {@> @[TeX_( "/toksa{language}" );@]@+@ @=} | "%name-prefix" STRING {@> @[TeX_( "/toksa{name prefix}" );@]@+@ @=} | "%no-lines" {@> TeX_( "/yy0{/nx/optionflag{no lines}{}/the/yy(1)}" ); @=} | "%nondeterministic-parser" {@> TeX_( "/yy0{/nx/optionflag{nondet. parser}{}/the/yy(1)}" ); @=} | "%output" STRING {@> @[TeX_( "/toksa{output}" );@]@+@ @=} @g @t}\vb{\flatten}{@> @G | "%param" {} params {@> TeX_( "/yy0{/nx/paramdef{/the/yy(3)}/the/yy(1)}" ); @=} @g @t}\vb{\fold}{@> @G | "%require" STRING {@> @[TeX_( "/toksa{require}" );@]@+@ @=} | "%skeleton" STRING {@> @[TeX_( "/toksa{skeleton}" );@]@+@ @=} | "%token-table" {@> TeX_( "/yy0{/nx/optionflag{token table}{}/the/yy(1)}" ); @=} | "%verbose" {@> TeX_( "/yy0{/nx/optionflag{verbose}{}/the/yy(1)}" ); @=} | "%yacc" {@> TeX_( "/yy0{/nx/optionflag{yacc}{}/the/yy(1)}" ); @=} | ";" {@> TeX_( "/yy0{/nx/empty}" ); @=} ; params: params "{...}" {@> TeX_( "/yy0{/the/yy(1)/nx/braceit/the/yy(2)}" ); @=} | "{...}" {@> TeX_( "/yy0{/nx/braceit/the/yy(1)}" ); @=} ; @g @ This is a typical parser action: encapsulate the `type' of the construct just parsed and attach some auxiliary info, in this case the stream pointers. @= @[TeX_( "/yy0{/nx/oneparametricoption{/the/toksa}{/the/yy(2)}/the/yy(1)}" );@]@; @ Some extra declarations to typeset \flex\ options and declarations. These are not part of the \bison\ syntax but their structure is similar enough that they can be included in the grammar. @= @G prologue_declaration: flex_declaration {@> @ @=} ; @g @<\flex\ options parser productions@>@; @ The syntax of \flex\ options was extracted from \flex\ documentation so it is not guaranteed to be correct. @<\flex\ options parser productions@>= @G flex_declaration: FLEX_OPTION flex_option_list {@> @ @=} | flex_state symbols.1 {@> @ @=} ; flex_state: FLEX_STATE_X {@> TeX_( "/yy0{/nx/flexxstatedecls/the/yy(1)}" ); @=} | FLEX_STATE_S {@> TeX_( "/yy0{/nx/flexsstatedecls/the/yy(1)}" ); @=} ; flex_option_list: flex_option {@> @ @=} | flex_option_list flex_option {@> @ @=} ; flex_option: ID {@> TeX_( "/yy0{/nx/flexoptionpair{/the/yy(1)}{}}" ); @=} | ID "=" symbol {@> TeX_( "/yy0{/nx/flexoptionpair{/the/yy(1)}{/the/yy(3)}}" ); @=} ; @g @ @= @[TeX_( "/yy0{/nx/flexoptiondecls{/the/yy(2)}/the/yy(1)}" );@]@; @ @= @[TeX_( "/getfirst{/yy(1)}/to/toksa" );@]@; @[TeX_( "/getsecond{/yy(1)}/to/toksb" );@]@; @[TeX_( "/getthird{/yy(1)}/to/toksc" );@]@; @[TeX_( "/yy0{/the/toksa{/the/yy(2)}{/the/toksb}{/the/toksc}}" );@]@; @ @= @[TeX_( "/getsecond{/yy(2)}/to/toksa" );@]@; /* the identifier */ @[TeX_( "/getfourth{/toksa}/to/toksb" );@]@; /* the format pointer */ @[TeX_( "/getfifth{/toksa}/to/toksc" );@]@; /* the stash pointer */ @[TeX_( "/yy0{/the/yy(1)/nx/hspace{/the/toksb}{/the/toksc}/the/yy(2)}" );@]@; @ {\it Grammar declarations}. These declarations can appear in both prologue and the rules sections. Their treatment is very similar to prologue-only options. @= @G grammar_declaration: precedence_declaration {@> @ @=} | symbol_declaration {@> @ @=} | "%start" symbol {@> @[TeX_( "/toksa{start}" );@]@+@ @=} | code_props_type "{...}" generic_symlist {@> @ @=} | "%default-prec" {@> TeX_( "/yy0{/nx/optionflag{default prec.}{}/the/yy(1)}" ); @=} | "%no-default-prec" {@> TeX_( "/yy0{/nx/optionflag{no default prec.}{}/the/yy(1)}" ); @=} | "%code" "{...}" {@> TeX_( "/yy0{/nx/codeassoc{code}{}/the/yy(2)/the/yy(1)}" ); @=} | "%code" ID "{...}" {@> TeX_( "/yy0{/nx/codeassoc{code}{/the/yy(2)}/the/yy(3)/the/yy(1)}" ); @=} ; code_props_type: "%destructor" {@> TeX_( "/yy0{{destructor}/the/yy(1)}" ); @=} | "%printer" {@> TeX_( "/yy0{{printer}/the/yy(1)}" ); @=} ; @g @ @= @[TeX_( "/getfirst{/yy(1)}/to/toksa" );@]@; /* name of the property */ @[TeX_( "/getfirst{/yy(2)}/to/toksb" );@]@; /* contents of the braced code */ @[TeX_( "/getsecond{/yy(2)}/to/toksc" );@]@; /* braced code format pointer */ @[TeX_( "/getthird{/yy(2)}/to/toksd" );@]@; /* braced code stash pointer */ @[TeX_( "/getsecond{/yy(1)}/to/tokse" );@]@; /* code format pointer */ @[TeX_( "/getthird{/yy(1)}/to/toksf" );@]@; /* code stash pointer */ @[TeX_( "/yy0{/nx/codepropstype{/the/toksa}{/the/toksb}{/the/yy(3)}{/the/toksc}{/the/toksd}{/the/tokse}{/the/toksf}}" );@]@; @ @= @G %token PERCENT_UNION "%union"; @g @ @= @t}\vb{\inline\flatten}{@> @G union_name: {@> TeX_( "/yy0{}" ); @=} | ID {@> @ @=} ; grammar_declaration: "%union" union_name "{...}" {@> @ @=} ; symbol_declaration: "%type" TAG symbols.1 {@> @ @=} ; @g @t}\vb{\resetf\flatten}{@> @G precedence_declaration: precedence_declarator tag.opt symbols.prec {@> @ @=} ; precedence_declarator: "%left" {@> TeX_( "/yy0{/nx/preckind{left}/the/yy(1)}" ); @=} | "%right" {@> TeX_( "/yy0{/nx/preckind{right}/the/yy(1)}" ); @=} | "%nonassoc" {@> TeX_( "/yy0{/nx/preckind{nonassoc}/the/yy(1)}" ); @=} | "%precedence" {@> TeX_( "/yy0{/nx/preckind{precedence}/the/yy(1)}" ); @=} ; @g @t}\vb{\inline}{@> @G tag.opt: {@> TeX_( "/yy0{}" ); @=} | TAG {@> @ @=} ; @g @ @= @[TeX_( "/yy0{/nx/codeassoc{union}{/the/yy(2)}/the/yy(3)/the/yy(1)}" );@]@; @ @= @[TeX_( "/yy0{/nx/typedecls{/the/yy(2)}{/the/yy(3)}/the/yy(1)}" );@]@; @ @= @[TeX_( "/getthird{/yy(1)}/to/toksa" );@]@; /* format pointer */ @[TeX_( "/getfourth{/yy(1)}/to/toksb" );@]@; /* stash pointer */ @[TeX_( "/getsecond{/yy(1)}/to/toksc" );@]@; /* kind of precedence */ @[TeX_( "/yy0{/nx/precdecls{/the/toksc}{/the/yy(2)}{/the/yy(3)}{/the/toksa}{/the/toksb}}" );@]@; @ The bootstrap grammar forms the smallest subset of the full grammar. @= @@; @ These are the two most important rules for the bootstrap parser. @= @t}\vb{\flatten}{@> @G symbol_declaration: "%nterm" {} symbol_defs.1 {@> TeX_( "/yy0{/nx/ntermdecls{/the/yy(3)}/the/yy(1)}" ); @=} @g @t}\vb{\fold\flatten}{@> @G | "%token" {} symbol_defs.1 {@> TeX_( "/yy0{/nx/tokendecls{/the/yy(3)}/the/yy(1)}" ); @=} ; @g @ {\it Just like \prodstyle{symbols.1} but accept \prodstyle{INT} for the sake of \POSIX}. Perhaps the only point worth mentioning here is the inserted separator (\.{\\hspace}). Like any other separator, it takes two parameters, stream pointers. In this case, however, both pointers are null since there seems to be no other meaningful assignment. If any formatting or stash information is needed, it can be extracted by the symbols themselves. @= @G symbols.prec: symbol.prec {@> @ @=} | symbols.prec symbol.prec {@> TeX_( "/yy0{/the/yy(1)/nx/hspace{0}{0}/the/yy(2)}" ); @=} ; symbol.prec: symbol {@> TeX_( "/yy0{/nx/symbolprec{/the/yy(1)}{}}" ); @=} | symbol INT {@> TeX_( "/yy0{/nx/symbolprec{/the/yy(1)}{/the/yy(2)}}" ); @=} ; @g @ {\it One or more symbols to be \prodstyle{\%type}'d}. @= @@; @ @= @G symbols.1: symbol {@> @ @=} | symbols.1 symbol {@> TeX_( "/yy0{/the/yy(1)/nx/hspace{0}{0}/the/yy(2)}" ); @=} ; @g @ @= @G generic_symlist: generic_symlist_item {@> @ @=} | generic_symlist generic_symlist_item {@> TeX_( "/yy0{/the/yy(1)/nx/hspace{0}{0}/the/yy(2)}" ); @=} ; @g @t}\vb{\flatten\inline}{@> @G generic_symlist_item: symbol {@> @ @=} | tag {@> @ @=} ; tag: TAG {@> @ @=} | "<*>" {@> @ @=} | "<>" {@> @ @=} ; @g @ {\it One token definition}. @= @G symbol_def: TAG {@> @ @=} @g @t}\vb{\flatten}{@> @G | id {@> TeX_( "/yy0{/nx/onesymbol{/the/yy(1)}{}{}}" ); @=} | id INT {@> TeX_( "/yy0{/nx/onesymbol{/the/yy(1)}{/the/yy(2)}{}}" ); @=} | id string_as_id {@> TeX_( "/yy0{/nx/onesymbol{/the/yy(1)}{}{/the/yy(2)}}" ); @=} | id INT string_as_id {@> TeX_( "/yy0{/nx/onesymbol{/the/yy(1)}{/the/yy(2)}{/the/yy(3)}}" ); @=} ; @g @ {\it One or more symbol definitions}. @= @G symbol_defs.1: symbol_def {@> @ @=} | symbol_defs.1 symbol_def {@> @ @=} ; @g @ @= @[TeX_( "/getsecond{/yy(2)}/to/toksa" );@]@; /* the identifier */ @[TeX_( "/getfourth{/toksa}/to/toksb" );@]@; /* the format pointer */ @[TeX_( "/getfifth{/toksa}/to/toksc" );@]@; /* the stash pointer */ @[TeX_( "/yy0{/the/yy(1)/nx/hspace{/the/toksb}{/the/toksc}/the/yy(2)}" );@]@; @ {\it The grammar section: between the two \prodstyle{\%\%}'s}. Finally, the following few short sections define the syntax of \bison's rules. @= @G grammar: rules_or_grammar_declaration {@> @ @=} | grammar rules_or_grammar_declaration {@> @ @=} ; @g @ {\it As a \bison\ extension, one can use the grammar declarations in the body of the grammar}. What follows is the syntax of the right hand side of a grammar rule. @= @G rules_or_grammar_declaration: rules {@> @ @=} | grammar_declaration ";" {@> @ @=} | error ";" {@> TeX_( "/errmessage{parsing error!}" ); @=} ; @g @t}\vb{\flatten\inline}{@> @G rules: id_colon named_ref.opt {@> TeX_( "/relax" ); @=} rhses.1 {@> @ @=} ; @g @t}\vb{\resetf}{@> @G rhses.1[o]: rhs {@> @ @=} | rhses.1[a] "|"[b] {@> @ @=}[c] rhs[d] {@> @ @=} | rhses.1 ";" {@> @ @=} ; @g @ The next few actions describe what happens when a left hand side is attached to a rule. @= @[TeX_( "/getfirst{/yy(1)}/to/toksa" );@]@; @[TeX_( "/yy0{/nx/grammar{/the/yy(1)}{/the/toksa}}" );@]@; @ @= @[TeX_( "/getthird{/yy(1)}/to/toksa" );@]@; /* type of the last rule */ @[TeX_( "/getsecond{/yy(1)}/to/toksc" );@]@; /* accumulated rules */ @[TeX_( "/getfirst{/yy(2)}/to/toksb" );@]@; /* type of the new rule */ @[TeX_( "/let/default/positionswitchdefault" );@]@; @[TeX_( "/switchon{/the/toksb}/in/positionswitch" );@]@; /* determine the position of the first token in the group */ @[TeX_( "/edef/next{/the/toksa}" );@]@; @[TeX_( "/edef/default{/the/toksb}" );@]@; /* reuse \.{\\default} */ @[TeX_( "/ifx/next/default" );@]@; @[TeX_( " /let/default/separatorswitchdefaulteq" );@]@; @[TeX_( " /switchon{/the/toksa}/in/separatorswitcheq" );@]@; @[TeX_( "/else" );@]@; @[TeX_( " /concat/toksa/toksb" );@]@; @[TeX_( " /let/default/separatorswitchdefaultneq" );@]@; @[TeX_( " /switchon{/the/toksa}/in/separatorswitchneq" );@]@; @[TeX_( "/fi" );@]@; @[TeX_( "/yy0{/nx/grammar{/the/toksc/the/postoks/the/toksd/the/yy(2)}{/the/toksb}}" );@]@; @ @= @[TeX_( "/getsecond{/yy(1)}/to/toksa" );@]@; /* \.{\\prodheader} */ @[TeX_( "/getsecond{/toksa}/to/toksb" );@]@; /* \.{\\idit} */ @[TeX_( "/getfourth{/toksb}/to/toksc" );@]@; /* format stream pointer */ @[TeX_( "/getfifth{/toksb}/to/toksd" );@]@; /* stash stream pointer */ @[TeX_( "/getthird{/yy(1)}/to/toksb" );@]@; /* \.{\\rules} */ @[TeX_( "/yy0{/nx/oneproduction{/the/toksa/the/toksb}{/the/toksc}{/the/toksd}}" );@]@; @ @= @[TeX_( "/getfourth{/yy(1)}/to/toksa" );@]@; /* format stream pointer */ @[TeX_( "/getfifth{/yy(1)}/to/toksb" );@]@; /* stash stream pointer */ @[TeX_( "/yy0{/nx/pcluster{/nx/prodheader{/the/yy(1)}{/the/yy(2)}" );@]@; @[TeX_( " {/the/toksa}{/the/toksb}}{/the/yy(4)}}" );@]@; @ It is important to format the right hand side properly, since we would like to indicate that an action is inlined by an indentation. The `format' of the \.{\\rhs} `structure' includes the stash pointers and a `boolean' to indicate whether the right hand side ends with an action. Since the action can be implicit, this decision has to be postponed until, say, a semicolon is seen. No formatting or stash pointers are added for such implicit action. @= @[TeX_( "/rhsbool{/yy(1)}/to/toksa /the/toksa" );@]@; @[TeX_( "/getthird{/yy(1)}/to/toksb" );@]@; /* the format pointer */ @[TeX_( "/getfourth{/yy(1)}/to/toksc" );@]@; /* the stash pointer */ @[TeX_( "/ifrhsfull" );@]@; @[TeX_( " /yy0{/nx/rules{/the/yy(1)}{/the/toksb}{/the/toksc}}" );@]@; @[TeX_( "/else" );@]@; /* it does not end with an action, fake one */ @[TeX_( " /rhscont{/yy(1)}/to/toksa" );@]@; /* rules */ @[TeX_( " /edef/next{/the/toksa}" );@]@; @[TeX_( " /ifx/next/empty" );@]@; @[TeX_( " /toksa{/emptyterm}" );@]@; @[TeX_( " /fi" );@]@; @[TeX_( " /yy0{/nx/rules{/nx/rhs{/the/toksa/nx/rarhssep{0}{0}" );@]@; @[TeX_( " /nx/actbraces{}{}{0}{0}/nx/bdend}{}{/nx/rhsfulltrue}}{/the/toksb}{/the/toksc}}" );@]@; @[TeX_( "/fi" );@]@; @ @= @[TeX_( "/rhscont{/yy(1)}/to{/yy(0)}" );@]@; @[TeX_( "/yy0{/the/yy(0)/nx/midf/the/yy(2)}" );@]@; @ No pointers are provided for an {\it implicit\/} action. @= @[TeX_( "/rhsbool{/yy(4)}/to/toksa /the/toksa" );@]@; @[TeX_( "/ifrhsfull" );@]@; @[TeX_( " /yy0{/nx/rules{/the/yy(3)/nx/rrhssep/the/yy(2)/the/yy(4)}/the/yy(2)}" );@]@; @[TeX_( "/else" );@]@; @[TeX_( " /rhscont{/yy(4)}/to/toksa" );@]@; @[TeX_( " /edef/next{/the/toksa}" );@]@; @[TeX_( " /ifx/next/empty" );@]@; @[TeX_( " /toksa{/emptyterm}" );@]@; @[TeX_( " /fi" );@]@; @[TeX_( " /yy0{/nx/rules{/the/yy(3)/nx/rrhssep/the/yy(2)" );@]@; @[TeX_( " /nx/rhs{/the/toksa/nx/rarhssep{0}{0}" );@]@; /* streams have already been grabbed */ @[TeX_( " /nx/actbraces{}{}{0}{0}/nx/bdend}{}{/nx/rhsfulltrue}}/the/yy(2)}" );@]@; @[TeX_( "/fi" );@]@; @ @= @@; @ @= @G %token PERCENT_EMPTY "%empty"; @g @ The centerpiece of the grammar is the syntax of the right hand side of a production. Various `precedence hints' must be attached to an appropriate portion of the rule, just before an action (which can be inline, implicit or both in this case). @= @G rhs: {@> @ @=} | rhs symbol named_ref.opt {@> @ @=} | rhs "{...}" named_ref.opt {@> @ @=} | rhs "%?{...}" {@> @ @=} | rhs "%empty" {@> @ @=} | rhs "%prec" symbol {@> @ @=} | rhs "%dprec" INT {@> @ @=} | rhs "%merge" TAG {@> @ @=} ; named_ref.opt: {@> @ @=} | BRACKETED_ID {@> @ @=} ; @g @ @= @[TeX_( "/yy0{/nx/rhs{}{}{/nx/rhsfullfalse}}" );@]@; @ @= @[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@; @[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@; @[TeX_( "/edef/next{/the/toksb}" );@]@; @[TeX_( "/ifx/next/empty" );@]@; @[TeX_( "/else" );@]@; @[TeX_( " /getfourth{/yy(2)}/to/toksc" );@]@; @[TeX_( " /getfifth{/yy(2)}/to/toksd" );@]@; @[TeX_( " /appendr/toksb{{/the/toksc}{/the/toksd}}" );@]@; @[TeX_( "/fi" );@]@; @[TeX_( "/yy0{/nx/rhs{/the/toksa/the/toksb" );@]@; @[TeX_( " /nx/termname{/the/yy(2)}{/the/yy(3)}}{/nx/hspace}{/nx/rhsfullfalse}}" );@]@; @ @= @[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@; @[TeX_( "/rhsbool{/yy(1)}/to/toksb /the/toksb" );@]@; @[TeX_( "/ifrhsfull" );@]@; /* the first half ends with an action */ @[TeX_( " /appendr/toksa{/nx/arhssep{0}{0}/nx/emptyterm}" );@]@; /* no pointers to streams */ @[TeX_( "/fi" );@]@; @[TeX_( "/edef/next{/the/toksa}" );@]@; @[TeX_( "/ifx/next/empty" );@]@; @[TeX_( " /toksa{/emptyterm}" );@]@; @[TeX_( "/fi" );@]@; @[TeX_( "/getfirst{/yy(2)}/to/toksb" );@]@; /* the contents of the braced code */ @[TeX_( "/getsecond{/yy(2)}/to/toksc" );@]@; /* the format stream pointer */ @[TeX_( "/getthird{/yy(2)}/to/toksd" );@]@; /* the stash stream pointer */ @[TeX_( "/yy0{/nx/rhs{/the/toksa/nx/rarhssep{/the/toksc}{/the/toksd}" );@]@; @[TeX_( " /nx/actbraces{/the/toksb}{/the/yy(3)}{/the/toksc}{/the/toksd}/nx/bdend}" );@]@; @[TeX_( " {/nx/arhssep}{/nx/rhsfulltrue}}" );@]@; @ @= @[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@; @[TeX_( "/rhsbool{/yy(1)}/to/toksb /the/toksb" );@]@; @[TeX_( "/ifrhsfull" );@]@; /* the first half ends with an action */ @[TeX_( " /appendr/toksa{/nx/arhssep{0}{0}/nx/emptyterm}" );@]@; /* no pointers to streams */ @[TeX_( "/fi" );@]@; @[TeX_( "/edef/next{/the/toksa}" );@]@; @[TeX_( "/ifx/next/empty" );@]@; @[TeX_( " /toksa{/emptyterm}" );@]@; @[TeX_( "/fi" );@]@; @[TeX_( "/getfirst{/yy(2)}/to/toksb" );@]@; /* the contents of the braced code */ @[TeX_( "/getsecond{/yy(2)}/to/toksc" );@]@; /* the format stream pointer */ @[TeX_( "/getthird{/yy(2)}/to/toksd" );@]@; /* the stash stream pointer */ @[TeX_( "/yy0{/nx/rhs{/the/toksa/nx/rarhssep{/the/toksc}{/the/toksd}" );@]@; @[TeX_( " /nx/bpredicate{/the/toksb}{}{/the/toksc}{/the/toksd}/nx/bdend}" );@]@; @[TeX_( " {/nx/arhssep}{/nx/rhsfulltrue}}" );@]@; @ @= @[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@; @[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@; @[TeX_( "/edef/next{/the/toksb}" );@]@; @[TeX_( "/ifx/next/empty" );@]@; @[TeX_( "/else" );@]@; @[TeX_( " /getfourth{/yy(2)}/to/toksc" );@]@; @[TeX_( " /getfifth{/yy(2)}/to/toksd" );@]@; @[TeX_( " /appendr/toksb{{/the/toksc}{/the/toksd}}" );@]@; @[TeX_( "/fi" );@]@; @[TeX_( "/yy0{/nx/rhs{/the/toksa/the/toksb" );@]@; @[TeX_( " /nx/emptyterm}{/nx/hspace}{/nx/rhsfullfalse}}" );@]@; @ @= @[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@; @[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@; @[TeX_( "/rhsbool{/yy(1)}/to/toksc /the/toksc" );@]@; @[TeX_( "/ifrhsfull" );@]@; @[TeX_( " /yy0{/nx/sprecop{/the/yy(3)}/the/yy(2)}" );@]@; /* reuse \.{\\yyval} */ @[TeX_( " /supplybdirective/toksa/yyval" );@]@; /* the directive is `absorbed' by the action */ @[TeX_( " /yy0{/nx/rhs{/the/toksa}{/the/toksb}{/nx/rhsfulltrue}}" );@]@; @[TeX_( "/else" );@]@; @[TeX_( " /yy0{/nx/rhs{/the/toksa" );@]@; @[TeX_( " /nx/sprecop{/the/yy(3)}/the/yy(2)}{/the/toksb}{/nx/rhsfullfalse}}" );@]@; @[TeX_( "/fi" );@]@; @ @= @[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@; @[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@; @[TeX_( "/rhsbool{/yy(1)}/to/toksc /the/toksc" );@]@; @[TeX_( "/ifrhsfull" );@]@; @[TeX_( " /yy0{/nx/dprecop{/the/yy(3)}/the/yy(2)}" );@]@; /* reuse \.{\\yyval} */ @[TeX_( " /supplybdirective/toksa/yyval" );@]@; /* the directive is `absorbed' by the action */ @[TeX_( " /yy0{/nx/rhs{/the/toksa}{/the/toksb}{/nx/rhsfulltrue}}" );@]@; @[TeX_( "/else" );@]@; @[TeX_( " /yy0{/nx/rhs{/the/toksa" );@]@; @[TeX_( " /nx/dprecop{/the/yy(3)}/the/yy(2)}{/the/toksb}{/nx/rhsfullfalse}}" );@]@; @[TeX_( "/fi" );@]@; @ @= @[TeX_( "/rhscont{/yy(1)}/to/toksa" );@]@; @[TeX_( "/rhscnct{/yy(1)}/to/toksb" );@]@; @[TeX_( "/rhsbool{/yy(1)}/to/toksc /the/toksc" );@]@; @[TeX_( "/ifrhsfull" );@]@; @[TeX_( " /yy0{/nx/mergeop{/the/yy(3)}/the/yy(2)}" );@]@; /* reuse \.{\\yyval} */ @[TeX_( " /supplybdirective/toksa/yyval" );@]@; /* the directive is `absorbed' by the action */ @[TeX_( " /yy0{/nx/rhs{/the/toksa}{/the/toksb}{/nx/rhsfulltrue}}" );@]@; @[TeX_( "/else" );@]@; @[TeX_( " /yy0{/nx/rhs{/the/toksa" );@]@; @[TeX_( " /nx/mergeop{/the/yy(3)}/the/yy(2)}{/the/toksb}{/nx/rhsfullfalse}}" );@]@; @[TeX_( "/fi" );@]@; @ @= @[TeX_( "/yy0{}" );@]@; @ @= @@; @ Identifiers. {\it Identifiers are returned as |uniqstr| values by the scanner. Depending on their use, we may need to make them genuine symbols}. We, on the other hand simply copy the values returned by the scanner. @= @G id: ID {@> @ @=} | CHAR {@> @ @=} ; @g @ @= @@; @ @= @G symbol: id {@> @ @=} | string_as_id {@> @ @=} ; @g @ @= @t}\vb{\inline}{@> @G id_colon: ID_COLON {@> @ @=} ; @g @ A string used as an \prodstyle{ID}. @= @t}\vb{\inline}{@> @G string_as_id: STRING {@> @ @=} ; @g @ The remainder of the action code is trivial but we reserved the placeholders for the appropriate actions in case the parser gains some sophistication in processing low level types (or starts expecting different types from the scanner). @= @@; @ @= @@; @ @= @@; @ @= @@; @ @= @@; @ @= @@; @ {\it Variable and value. The \prodstyle{STRING} form of variable is deprecated and is not \.{M4}-friendly. For example, \.{M4} fails for \.{\%define "[" "value"}.} @= @t}\vb{\flatten\inline}{@> @G variable: ID {@> @ @=} | STRING {@> @ @=} ; value: {@> TeX_( "/yy0{}" ); @=} | ID {@> @ @=} | STRING {@> @ @=} | "{...}" {@> TeX_( "/yy0{/nx/bracedvalue/the/yy(1)}" ); @=} ; @g @ @= @t}\vb{\flatten\inline}{@> @G epilogue.opt: {@> TeX_( "/yy0{}" ); @=} | "%%" EPILOGUE {} ; @g @ \Cee\ preamble for the grammar parser. In this case, there are no `real' actions that our grammar performs, only \TeX\ output, so this section is empty. @= @ \Cee\ postamble for the grammar parser. It is tricky to insert function definitions that use \bison's internal types, as they have to be inserted in a place that is aware of the internal definitions but before said definitions are used. @= #define YYPRINT(file, type, value) yyprint (file, type, value) static void yyprint (FILE *file, int type, YYSTYPE value){} @ @= @@; @@; @ @= void bootstrap_tokens( char *bootstrap_token_format ) { #define _register_token_d(name) fprintf( tables_out, bootstrap_token_format, #name, name, #name ); @@; #undef _register_token_d } @ \namedspot{bootstraptokens}Here is the minimal list of tokens needed to make the lexer operational just enough to extract the rest of the token information from the grammar. @= _register_token_d(INT)@; _register_token_d(ID)@; _register_token_d(CHAR)@; _register_token_d(STRING)@; _register_token_d(TAG)@; _register_token_d(SEMICOLON)@; _register_token_d(PERCENT_TOKEN)@; _register_token_d(PERCENT_NTERM)@; _register_token_d(FLEX_STATE_X)@; _register_token_d(FLEX_STATE_S)@; @ Union of types. @=