@q Copyright 2012-2014, Alexander Shibakov@> @q This file is part of SPLinT@> @q SPLinT is free software: you can redistribute it and/or modify@> @q it under the terms of the GNU General Public License as published by@> @q the Free Software Foundation, either version 3 of the License, or@> @q (at your option) any later version.@> @q SPLinT is distributed in the hope that it will be useful,@> @q but WITHOUT ANY WARRANTY; without even the implied warranty of@> @q MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the@> @q GNU General Public License for more details.@> @q You should have received a copy of the GNU General Public License@> @q along with SPLinT. If not, see .@> @** Forcing \bison\ and \flex\ to output \TeX. Instead of implementing a \bison\ (or \flex) `plugin' for outputting \TeX\ parser, the code that follows produces a separate executable that outputs all the required tables after the inclusion of an ordinary \Cee\ parser produced by \bison\ (or a scanner produced by \flex). The actions in both \bison\ parser and \flex\ scanner are assumed to be merely |printf()| statements that output the `real' \TeX\ actions. The code below simply cycles through all such actions to output an `action switch' appropriate for use with \TeX. In every other respect, the included parser or scanner can use any features allowed in `real' parsers and scanners. @*1 Common routines. The `top' level of the scanner and parser `drivers' is very similar, and is therefore separated into a few sections that are common to both drivers. The layout is fairly typical and follows a standard `initialize-input-process-output-clean up' scheme. The logic behind each section of the program will be explained in detail below. The section below is called |@<\Cee\ postamble@>| because the output of the tables can happen only after the \bison\ (or \flex) generated \.{.c} file is included and all the data structures are known. The actual `assembly' of each driver has to be done separately due to some `singularities' of the \CWEB\ system and the design of this software. All the essential routines are presented in the sections below, though. @<\Cee\ postamble@>= @; @@; @@; @@; int main( int argc, char **argv ) { @@; @@; @@; @@; switch( mode ) { @@; default: break; } if ( tables_out ) { @@; @@; } else { fprintf( stderr, "No output, exiting\n" ); exit(0); } @@; return 0; } @ Not all the code can be supplied at this stage (most of the routines here are at the `top' level so the specifics have to be `filled-in' by each driver), so many of the sections above are placeholders for the code provided by a specific driver. However, we still need to supply a trivial definition here to placate \CWEAVE\ whenever this portion of the code is used isolated in documentation. @= @ Standard library declarations for memory management routines, some syntactic sugar, command line processing, and variadic functions are all that is needed. @= #include #include #include #include #include @ This code snippet is a payment for some poor (in my view) philosophy on the part of the \bison\ and \flex\ developers. There used to be an option in \bison\ to output just the tables and the action code but it had never worked correctly and it was simply dropped in the latest version. Instead, one can only get access to \bison's goodies as part of a tangled mess of |@[#define@]|'s and error processing code. Had the tables and the parser function itself been considered separate, well isolated sections of \bison's output, there would simply be no reason for dirty tricks like the one below, one would be able to write custom error processing functions, unicorns would roam the Earth and pixies would hand open sourced tablets to everyone. At a minimum, it would have been a much cleaner, modular approach. There is also strange reluctance on the part of the \gcc\ team to output any intermediate code other than the results of preprocessing and assembly. I have seen an argument that involves some sort of appeal to making the code difficult to close source but the logic of it escaped me completely. Ideally, there should be no such thing as a parser generator, or a compiler, for that matter: all of these are just basic table driven rewriting routines. Tables are hard but table driven code should not be. If one had access to the tables themselves, and some canonical examples of code driven by such tables, like |yyparse()| and |yylex()|, the flexibility of these tools would improve tremendously. Barring that, this is what we have to do {\it now}. There are several ways to gain write access to the data declared |const| in \Cee, like passing its address to a function with no prototype. All these methods have one drawback: loopholes that make them possible have been steadily getting on the chopping block of the \Cee\ standards committee. Indeed, |const| data should be constant. Even if one succeeds in getting access, there is no reason to believe that the data is not allocated in a write-only region of the memory. The cleanest way to get write access then is to eliminate |const| altogether. The code should have the same semantics after that, and the trick is only marginally bad. The last two definitions are less innocent (and, at least the second one, are prohibited by the ISO standard (clause 6.10.8(2), see~\cite[ISO/C11])) but \gcc\ does not seem to mind, and it gets rid of warnings about dropping a |const| qualifier whenever an |assert| is encountered. Since the macro is not recursively expanded, this will only work if $\ldots$|FUNCTION__| is treated as a pseudo-variable, as it is in \gcc, not a macro. @d const @d __PRETTY_FUNCTION__ @[(char *)__PRETTY_FUNCTION__@] @d __FUNCTION__ @[(char *)__FUNCTION__@] @ The output file has to be known to both parts of the code, so it is declared at the very beginning of the program. We also add some syntactic sugar for loops. @q The line below is simply irresistible, one should put it on a T-shirt@> @s FOREVER TeX @d FOREVER for(;;) @= #include FILE *tables_out; @ The clean-up portion of the code can be left empty, as all it does is close the output file, which can be left to the operating system but we take care of it ourselves to keep out code `clean'\footnote{In case the reader has not noticed yet, this is a weak attempt at humor to break the monotony of going through the lines of \CTANGLE'd code}. @= fclose( tables_out ); @ There is a descriptor controlling the output of the program as a whole. The code below is an example of a literate programming technique that will be used repeatedly to maintain large structures that can grow during the course of the program design. Note that the name of each table is only mentioned once, the rest of the code is generic. Technically speaking, all of this can be done with \Cee\ preprocessor macros of moderate complexity, taking advantage of its expansion rules but it is not nearly as transparent as the \CWEB\ approach. @= struct output_d { @@; }; struct output_d output_desc = { @ }; @ To declare each table field in the global output descriptor, all one has to do is to provide a general pattern. @= #define _register_table_d(name) @[bool output_##name:1;@] @@; #undef _register_table_d @ Same for assigning default values to each field. @= #define _register_table_d(name) @[.output_##name = 0,@] /* do not output any tables by default */ @
@; #undef _register_table_d @ Each descriptor is populated using the same approach. @= #define _register_table_d(name) @[struct table_d name##_desc = {0};@] @
@; #undef _register_table_d @ The reason to implement the table output routine as a macro is to avoid writing separate functions for tables of different types of data (stings as well as integers). The output is controlled by each table's {\it descriptor\/} defined below. A more sophisticated approach is possible but this code is merely a `patch' so we are not after full generality\footnote{A somewhat cleaner way to achieve the same effect is to use the \.{\_Generic} facility of \Cee11.}. @d output_table(table_desc, table_name, stream) if ( output_desc.output_##table_name ) { int i, j = 0; fprintf( stream, table_desc.preamble, table_desc.name ); for( i = 0; i < sizeof(table_name)/sizeof(table_name[0]) - 1; i++) { if ( table_desc.formatter ) { j += table_desc.formatter( stream, i ); } else { if ( table_name[i] ) { j += fprintf( stream, table_desc.separator, table_name[i] ); } else { j += fprintf( stream, "%s", table_desc.null ); } } if ( j > MAX_PRETTY_LINE && table_desc.prettify ) { fprintf( stream, "\n" ); j = 0; } } if ( table_desc.formatter ) { table_desc.formatter( stream, -i ); } else { if ( table_name[i] ) { fprintf( stream, table_desc.postamble, table_name[i] ); } else { fprintf( stream, "%s", table_desc.null_postamble ); } } if ( table_desc.cleanup ) { table_desc.cleanup( &table_desc ); } } @= struct table_d { @@; }; @ @= char *name; char *preamble; char *separator; char *postamble; char *null_postamble; char *null; bool prettify; int (*formatter)( FILE *, int ); void (*cleanup)( struct table_d * ); @ Tables are output first. The action output code must come last since it changes the values of the tables to achieve its goals. Again, a different approach is possible, that saves the data first but simplicity was deemed more important than total generality at this point. @= @@; @ One more application of `gather the names first then process' technique. @= #define _register_table_d(name) @[output_table(name##_desc, name, tables_out);@] @
@; #undef _register_table_d @ Tables will be output by each driver. Placeholder here, for \CWEAVE's piece of mind. @
= @ Action output invokes a totally new level of dirty code. If tables, constants, and tokens are just data structures, actions are actually code. We can only hope to cycle through all the actions which is enough to use \bison\ or \flex\ successfully with \TeX. The |switch| statement containing the actions is embedded in the parser code so to get access to each action we have to coerce |yyparse()| to jump to each case. This is where we need the table manipulation. This code is highly specific to the program used (since \bison\ and \flex\ code have to be `reverse engineered' to make the parser and scanner functions do what we want), here we only declare the options controlling the level of detail and the type of actions output. @= static int bare_actions = 0; /* (|static| for local variables) and |int| to pacify the compiler (for a constant initializer and compatible type) */ static int optimize_actions = 0; @ The first of the following options allows one to output an action switch without the actions themselves. It is useful when one needs to output a \TeX\ parser for a grammar file that is written in \Cee. In this case it will be impossible to cycle through actions (as no setup code has been executed), so the parser invocation is omitted. The second option splits the action switch into several macros to speed up the processing of the action code. The last argument of the `flexible' macro below is supposed to be an extended description of each option which can be later utilized by a |usage()| function. @= _register_option("bare-actions", no_argument, &bare_actions, 1, "")@; _register_option("optimize-actions", no_argument, &optimize_actions, 1, "")@; @ The rest of the action output code mimics that for table output, starting with the descriptor. To make the output format more flexible, this descriptor should probably be turned into a specialized routine. @= struct action_d { char *preamble; char *act_setup; char *act_suffix; char *action1; char *actionn; char *postamble; void (*print_rule)( int ); void (*cleanup)( struct action_d * ); }; @ @= bool output_actions:1; @ Nothing is output by default, including actions. @= @[@].output_actions = 0,@[@]@; @ @= struct action_d action_desc = {0}; @ The function below outputs the \TeX\ code of each action when the appropriate action is `run' by the action output switch. The main concern in designing these functions is to make the code easier to look at. Further explanation is given in the grammar file. If the parser is doing its job, this is the only place where one would actually see these as functions (or, rather, macros). In compliance with paragraph 6.10.8(2)\footnote{[$\ldots$] {\it Any other predefined macro names shall begin with a leading underscore followed by an uppercase letter or a second underscore.}} of the \ISO\ \Cee11 standard the names of these macros do not start with an underscore, since the first letter of \.{TeX} is uppercase\footnote{One might wonder why one of these functions is defined as a \CWEB\ macro while the other is put into the preamble `by hand'. It really makes no difference, however, the reason the second macro is defined explicitly is \CWEB's lack of awareness of `variadic' macros which produces undesirable typesetting artefacts.}. \def\TeXx{\hbox{\.{TeX\_}}} \def\TeXxx{\hbox{\.{TeX\_\_}}} @s TeX__ TeX @d TeX_( string ) fprintf( tables_out, " %s%%\n", string ) @q \CWEB\ is not aware of variadic macros, so this has to be done the old way@> @<\Cee\ preamble@>= #define TeX__( string, ... ) @[fprintf( tables_out, " " string "%s\n", __VA_ARGS__, "%" )@] @*2 \TeX\ tables. We begin with a few macros to facilitate the output of tables in the format that \TeX\ can understand. There is really no good way to represent an array in \TeX\ so a rather weak compromise was chosen. Further explanation of this choice is given in the \TeX\ file that implements the \TeX\ parser for the \bison\ input grammar. Some tables require name adjustments due to \TeX's reluctance to treat digits as part of a name. @d tex_table_generic(table_name) table_name##_desc.preamble = "\\newtable{%s}{%%\n"; table_name##_desc.separator = "%d\\or "; table_name##_desc.postamble = "%d}%%\n"; table_name##_desc.null_postamble = "0}%%\n"; table_name##_desc.null = "0\\or "; table_name##_desc.prettify = true; table_name##_desc.formatter = NULL; table_name##_desc.cleanup = NULL; output_desc.output_##table_name = 1; @d tex_table(table_name) tex_table_generic(table_name); table_name##_desc.name = #table_name; @ Outputting constants. An approach paralleling the table output scheme is taken with constants. Since constants are \Cee\ {\it macros\/} one has to be careful to avoid the temptation of using constant {\it names\/} directly as names for fields in structures. They will simply be replaced by the constants' values. When the names are concatenated with other tokens, however, the \Cee\ preprocessor postpones the macro expansion until the concatenation is complete (see clauses 6.10.3.1, 6.10.3.2, and 6.10.3.3 of the \ISO\ \Cee\ Standard, \cite[ISO/C11]). Unless the result of the concatenation is still expandable, the expansion will halt. @= struct const_d { char *format; char *name; }; @ @= #define _register_const_d(c_name) @[struct const_d c_name##_desc;@] @@; #undef _register_const_d @ @= #define _register_const_d(c_name) @[bool output_##c_name:1;@] @@; #undef _register_const_d @ @= #define _register_const_d(c_name) @[@[@].output_##c_name = 0,@[@]@] @@; #undef _register_const_d @ @= fprintf( tables_out, "%%\n%% constant definitions\n%%\n" ); @@; @ @= { int any_constants = 0; #define _register_const_d(c_name) \ \ if ( output_desc.output_##c_name ) { \ const_out( tables_out, c_name##_desc, c_name)@; \ any_constants = 1; \ } @@; #undef _register_const_d if ( any_constants ); /* this is merely a placeholder statement */ } @ Constants are very driver specific, so to make \CWEAVE\ happy $\ldots$ @= @ A macro to help with constant output. @d const_out(stream, c_desc, c_name) fprintf(stream, c_desc.format, c_desc.name, c_name); @ Action switch output routines modify the automata tables and therefore have to be output last. Since action output is highly automaton specific, we leave this section blank here, to pacify \CWEAVE\ in case this file is typeset by itself. @= @*2 Error codes. @= enum err_codes{ @@, LAST_ERROR }; @ @= NO_MEMORY, BAD_STRING, BAD_MIX_FORMAT,@[@] @ A lot more care is necessary to output the token table. A number of precautions are taken to ensure that a maximum possible range of names can be passed safely to \TeX. This involves some manipulation of \.{\\catcode}'s and control characters. The complicated part is left to \TeX\ so the output code can be kept simple. The helper function below is used to `combine' two strings. @d MAX_PRETTY_LINE 100 @= char *mix_string( char *format, ... ); @ @= char *mix_string( char *format, ... ) { char *buffer; size_t size = 0; int length = 0; int written = 0; char *formatp = format; va_list ap, ap_save; va_start( ap, format ); va_copy( ap_save, ap ); size = strnlen( format, MAX_PRETTY_LINE * 5 ); if ( size >= MAX_PRETTY_LINE * 5 ) { fprintf( stderr, "%s: runaway string?\n", __func__ ); exit(BAD_STRING); } while ( (formatp = strstr( formatp, "%" )) ) { switch( formatp[1] ) { case 's':@; length = strnlen( va_arg( ap, char * ), MAX_PRETTY_LINE * 5 ); if ( length >= MAX_PRETTY_LINE * 5 ) { fprintf( stderr, "%s: runaway string?\n", __func__ ); exit(BAD_STRING); } size += length; size -=2; formatp++; break; case '%':@; size--; formatp +=2; default: printf( "%s: cannot handle %%%c in mix string format\n", __func__, formatp[1] ); exit( BAD_MIX_FORMAT ); } } buffer = (char *)malloc( sizeof(char) * size + 1 ); if ( buffer ) { written = vsnprintf( buffer, size + 1, format, ap_save ); if ( written < 0 || written > size ) { fprintf( stderr, "%s: runaway string?\n", __func__ ); exit(BAD_STRING); } } else { fprintf( stderr, "%s: failed to allocate memory for the output string\n", __func__ ); exit(NO_MEMORY); } va_end( ap ); va_end( ap_save ); return buffer; } @*2Initial setup. Depending on the output mode (right now only \TeX\ and `tokens only' (in the \bison\ `driver') are supported) the format of each table, action field and token has to be set up. @= enum output_mode {@@, LAST_OUT}; @ And to calm down \CWEAVE\ $\ldots$ @= @ \TeX\ is the main output mode. @= enum output_mode mode = TEX_OUT; @*2Command line processing. This program uses a standard way of parsing the command line, based on |getopt_long|. At the heart of the setup are the array below with a couple of supporting variables. @= #include #include #include @ @= const char *usage = "%s [options] output_file\n"; @ @= int c, option_index = 0;@# enum higher_options{NON_OPTION = 0xFF, @@, LAST_HIGHER_OPTION}; static struct option long_options[] = { @/@[ @@;@/ {0, 0, 0, 0} @] };@# @ The main loop of the command line option processing follows. This can be used as a template for setting up the option processing. The specific cases are added to in the course of adding new features. @= opterr = 0; /* we do our own error reporting */ FOREVER { c = getopt_long (argc, argv, ":" @, long_options, &option_index); if (c == -1) break; switch (c) { case 0:@; /* it is a flag, the name is kept in |long_options[option_index].name|, and the value can be found in |long_options[option_index].val| */ break; @t}\4{@>@; @t}\4{@>@; case '?':@; fprintf (stderr, "Unknown option: `%s', see `Usage' below\n\n", argv[optind - 1]); fprintf(stderr, usage, argv[0]); exit(1); break; case ':':@; fprintf (stderr, "Missing argument for `%s'\n\n", argv[optind - 1]); fprintf(stderr, usage, argv[0]); exit(1); break; default:@; printf ("warning: feature `%c' is not yet implemented\n", c); } } if (optind >= argc) { fprintf( stderr, "No output file specified!\n" ); } else { tables_out = fopen( argv[optind++], "w" ); } if (optind < argc) { printf ("script files to be loaded: "); while (optind < argc) printf ("%s ", argv[optind++]); putchar ('\n'); } @ @= #define _register_option(name, arg_flag, loc, val, exp) @[{name, arg_flag, loc, val},@[@]@] @@; #undef _register_option @ In addition to spelling out the full command line option name (such as \.{--help}) |getopt_long| gives the user a choice of using a shortcut (say, \.{-h}). As individual options are treated in drivers themselves, there are no shortcuts to supply at this point. We leave this section (and a number of others) empty to be filled in with the driver specific code to pacify \CWEAVE. @= @ Some options have one-letter `shortcuts', whereas others only exist in `fully spelled-out' form. To easily keep track of the latter, a special enumerated list is declared. To add to this list, simply add to the \CWEB\ section below. @= @ @= @ @=