Text Mode Formula Formatting

Purpose:

Enhance mmj2 by improving the rendering of Metamath formulas when using plain ASCII text.

Although the Metamath program, metamath.exe, provides a mechanism to generate Latex and either .gif or Unicode HTML output files that display the modern math equivalents of the user-defined ASCII symbols, this typesetting mechanism is not available within mmj2 and its Proof Assistant. It will be very helpful if the mmj2 Proof Assistant displays ASCII text Metamath formulas using improvements in formula layout similar to those shown here: mdsymlem5Example.html

Constraints:

A rendered formula's sequence of Metamath Constants and Variables must be unchanged, and be valid for input to Metamath; therefore, the rendering options available for use consist of ways to employ whitespace and newline tokens to improve readability.
The mmj2 code, including the mmj2 Proof Assistant, uses a single fixed-width font. Alternative designs are feasible, including use of variable fonts and size/emphasis/italic/color options, as well as .gif images, MathML output, etc. Those alternatives are out of the scope of this enhancement by definition (but saved for a subsequent project!)

Background Items:

A Metamath ".mm" file is coded in 7-bit ASCII text according to a notation scheme invented by the author(s) of the file. For example, logical implication may be represented by the (constant) symbol string "->" .
Syntax (builder) Axioms define the syntax rules of a file and allow complex formulas to be built up by recursive application of the rules. For example "wi" is the label (name) of a Syntax Axiom in set.mm that defines the syntax rule for logical implication: "( ph -> ps )" is defined as a wff, given any two variables, "ph" and "ps" that are themselves wffs.
"Formula" in this document (and mmj2) refers to a complete Metamath formula consisting of a constant followed by an "Expression" consisting of zero or more constants and variables. A "Sub-Expression" is a substring within an Expression, generally used as a substitution for a variable within a formula. For example: "|- ( ph -> ( ps -> ph )" is a Formula, "( ph -> ( ps -> ph )" is an Expression, and "( ps -> ph )" is a Sub-Expression (of the previous Expression).
"RPN" means Reverse Polish Notation. In Metamath/mmj2 a list or array of statement labels in RPN format can be used to construct an expression or to specify a proof. A valid theorem's formula has both a proof RPN as well as a parse RPN; they are equivalent ways to generate an Expression, with proofs using logical axioms and assertions, and parses using Syntax Axioms and variable hypotheses. Note that RPN's are convertible to trees, and vice-versa.For example, the parse tree for Expression" ( ph -> ( ps -> ph )" can be converted to RPN as "ph ps ph wi wi" (using the set.mm file's definitions), and the proof of a theorem "|- ( ph -> ( ps -> ph )" could be written in RPN as "ph ps ax-1" (which is a proof tree converted to RPN format.)
The mmj2 Grammar and Proof Assistant packages validate the syntax of every input formula and expression using the grammar specified by the .mm file's author (via Syntax Axioms, Variable Hypotheses, etc.) For this reason we can rely on the syntactic correctness of every formula and expression that will be typeset. In fact, we will be typesetting the parse trees as these contain the syntactic structure of expressions and therefore provide us helpful typsetting clues.
There are Four basic types of Syntax Axioms that can be created using the Metamath .mm file format: 1) Type Conversions -- example, formula "class x", specifying that every set is a class; 2) Named Typed Constants -- example formula "class 1", specifying that the constant "1" can be substituted for any class variable ( i.e. "1" signifies a class object, although Metamath would theoretically allow an eccentric author to subsequently use "1" as punctuation); 3) Nulls Permitted -- example formula "class" specifying that a null string -- length zero string -- can be substituted for any class variable; and 4) Notation Axioms -- example formula "wff ( ph -> ps )", which has already been discussed.
There are Three basic types of notation schemes: 1) Infix (example expression "3 * 2 + 1"); Prefix (example expression " + * 3 2 1"; and 3) Postfix (example expression "3 2 * 1 +"). Combinations of these schemes are possible, for example prefix and infix might be used in a single grammar without ambiguity. Also, notation schemes involving optional parentheses and operator precedence rules are commonly used in modern programming languages (these seem to require the use of Type Conversion Syntax Axioms, as described on Page 7 in Chapter 1 of David A. Schmidt's "Denotational Semantics" ).

Note: "Pretty Printing" Metamath formulas is not completely dissimilar to the problem faced by computer programmers wishing to format complicated "if" statements. Computer programmers often use conventions such as indenting nested "if"s and aligning and grouping various symbols in order to facilitate comprehension by readers; instead of writing one long, continuous stream of program code, programmers often try to break down a statement into manageable chunks that exhibit the logical structure of the code. We can attempt the same thing with Metamath formulas, though with variations that take into account the differences between a Metamath formula and say, Java code.

The Problem: user difficulties in reading and comprehending Metamath formulas --

User unfamiliarity with a .mm file's Syntax Axioms and symbol schemes.
Metamath formulas are linear, one-dimensional objects -- contrary to standard mathematical typesetting, which employs two dimensional shapes (i.e. powers and roots), as well as varying font sizes, font families and text styles. families
Absence of "dummy variables" (or named sub-expressions) to represent repeated sub-expressions within a formula or array of formulas; these dummy variables are common in texts to manage notational complexity, and even though the metamath.exe program employs them in its Proof Assistant, once a proof is completed the dummy variables used during proof creation disappear.
A profusion of parentheses
Lengthy, complex sub-expressions that often mask the hierarchical syntactical structure of a formula.
Unintelligent line breaks and inter-symbol spacing (note: an author of a .mm file may write the "source" file however she pleases, but the author's line breaks and spacing are discarded during output generation -- this holds also for proof step formulas, which lose all formatting during the round trip into Metamath RPN -- Reverse Polish Notation -- and back.)

Difficulties 1 through 3 are beyond the scope of this enhancement. Although much can be said about the first 3 difficulties listed above (and the discussion is interesting), we drop those items from further discussion in this document.

Difficulties 4 through 6 are within the scope of this enhancement, and certain minor improvements in the user experience will result from the new code. Again, please have a look at: mdsymlem5Example.html

The Solutions:

No Free Lunch

The "No Free Lunch" saying (or "TANSTAAFL") holds here too. The only possible way to perfectly render every possible notation scheme for every person reading formulas is to allow customization of formatting at the level of the individual syntax axiom by the user -- and provide other parameters and program "smarts" to handle all of the special cases. But perfection in this matter is not likely unattainable because the average user is unwilling to spend the hours needed to customize processing at the level of detail that will provide a perfect personal viewing experience. It just won't happen. So what we need in code is something that is just good enough without lots of tweaking and customization -- something that isn't terrible -- but that provides for the possibility of some customization and extension by .mm file authors and users.

Here is a basic outline of the solution:

Initially, at least, formatting specifications are not customizable at the level of the individual syntax axiom.
Instead a single Text Mode Formula Formatting Method ("TMFF Method") in use at any one time.
A TMFF Method is customized via mmj2 RunParm parameters/options, and together, the specifically chosen parameter/option values with the specific TMFF Method make up a "TMFF Scheme". (A subsequent enhancement might allow a TMFF Scheme to be specified for use with a list of Syntax Axioms, thus overriding the default Scheme -- this would allow, for example, a "prefix" Scheme for certain Syntax Axioms to be used alongside a default "infix" style Scheme for the rest of the syntax.)
A "TMFF Format" is assigned a TMFF Scheme. Up to 10 TMFF Formats can be defined: Format 0, Format 1, Format 2, ..., Format 9. However, an unlimited number of TMFF Schemes can be defined, and reassigned during processing via RunParms. By design, "Format 0" is reserved and signifies "TMFF Disabled" or "Unformatted" -- rendering is performed using the old algorithm (the old, unformatted output is used if any errors are encountered by TMFF when rendering a formula.)
Help with the problem of user difficulties reading nested parentheses is provided by a "maxDepth" parameter at the TMFF Scheme level. This specifies the maximum depth of a parse sub-tree, not counting leaf nodes -- Variable Hypotheses -- or Nulls Permitted, Type Conversion or Named Typed Constants. Thus, "( ph -> ( ps -> ph ) )" is defined as having depth = 2. If the depth of an expression's parse tree exceeds maxDepth then the expression is split across multiple lines and broken down according to the TMFF Scheme's specifications instead of being output on a single line.

Sample RunParms for TMFF processing:

1. No change to existing processing -- i.e. no special formula formatting -- using defaults:

* blah blah - final RunParm to run the Proof Assistant GUI:

RunProofAsstGUI

2. Default TMFF parameters coded explicitly to specify the default values -- except that by default TMFF is Off/Disabled (see RunParm "TMFFUseFormat"):

* Define TMFF Schemes: 

* Method names are hardcoded and fixed: "AlignColumn" and "Flat";

* "Unformatted" is RESERVED for internal use.

* Scheme Names are assigned by the user; must be unique and non-blank.

* RunParmName,SchemeName,MethodName,

MaxDepth,

AlignByValue,AlignAtNbr,AlignByValue
TMFFDefineScheme,AlignVarDepth1,AlignColumn,1,Var,1,Var

TMFFDefineScheme,AlignVarDepth2,AlignColumn,2,Var,1,Var

TMFFDefineScheme,AlignVarDepth3,AlignColumn,3,Var,1,Var

TMFFDefineScheme,AlignVarDepth4,AlignColumn,4,Var,1,Var

TMFFDefineScheme,AlignVarDepth5,AlignColumn,5,Var,1,Var

TMFFDefineScheme,AlignVarDepth99,AlignColumn,99,Var,1,Var

TMFFDefineScheme,Flat,Flat

TMFFDefineScheme,PrefixDepth3,AlignColumn,3,Sym,2,Sym

TMFFDefineScheme,PostfixDepth3,AlignColumn,3,Sym,1,Sym

TMFFDefineScheme,TwoColumnAlignmentDepth1,TwoColumnAlignment,1

TMFFDefineScheme,TwoColumnAlignmentDepth2,TwoColumnAlignment,2

TMFFDefineScheme,TwoColumnAlignmentDepth3,TwoColumnAlignment,3

TMFFDefineScheme,TwoColumnAlignmentDepth4,TwoColumnAlignment,4

TMFFDefineScheme,TwoColumnAlignmentDepth5,TwoColumnAlignment,5

TMFFDefineScheme,TwoColumnAlignmentDepth99,TwoColumnAlignment,99


* Define TMFF Formats:

* "Format 0" is RESERVED for internal use - uses Method "Unformatted".

* RunParmName,FormatNbr,SchemeName

TMFFDefineFormat,1,AlignVarDepth1

TMFFDefineFormat,2,AlignVarDepth2

TMFFDefineFormat,3,AlignVarDepth3

TMFFDefineFormat,4,AlignVarDepth4

TMFFDefineFormat,5,AlignVarDepth5

TMFFDefineFormat,6,AlignVarDepth99

TMFFDefineFormat,7,Flat

TMFFDefineFormat,8,PrefixDepth3

TMFFDefineFormat,9,PostfixDepth3

TMFFDefineFormat,10,TwoColumnAlignmentDepth99

TMFFDefineFormat,11,TwoColumnAlignmentDepth1

TMFFDefineFormat,12,TwoColumnAlignmentDepth2

TMFFDefineFormat,13,TwoColumnAlignmentDepth3

TMFFDefineFormat,14,TwoColumnAlignmentDepth4

TMFFDefineFormat,15,TwoColumnAlignmentDepth5

      

* To turn on/enable TMFF, use Format Nbr >= 1 (off/disabled = Format 0):

* RunParmName,FormatNbr

TMFFUseFormat,3

      

* Proof Assistant RunParms that affect TMFF formatting:

ProofAsstFormulaLeftCol,20

ProofAsstFormulaRightCol,79

ProofAsstTextColumns,80


* NOTE: formulas are now output using TMFF because TMFF is enabled!

*

* NOTE: TMFF relies upon grammatical parsing of .mm statements! Input

*       the "LoadFile" and "Parse" RunParms before the TMFF RunParms

*       for best results.

PrintSyntaxDetails,aceq1

      

* ... doit!
RunProofAsstGUI

TMFF Method Processing Overview:

The basic idea in TMFF Method processing is to use a formula's parse tree to determine where to insert line breaks. Take, for example, a formula, "X" = "|- ( ( ph -> ( ps -> ch ) ) -> ( ( ph -> ps ) -> ( ph -> ch ) ) )".

The parse tree for X looks like this (in set.mm notation):

                       
****

                       
*wi*

                 
*     ****    *

  
****      
****               
****            ****
  
*ph*      
*wi*               
*wi*           
*wi*
  
****      
****               
****            ****

          
****  ****         
****  ****      ****  ****

           *ps* 
*ch*          *ph* 
*ps*      *ph*  *ch*

           **** 
****          **** 
****      ****  ****

( ( ph -> ( ps -> ch ) ) -> ( ( ph -> ps ) -> ( ph -> ch ) ) )

Conceptually, the TMFF Method begins at the root node of formula X's parse tree. But a formula's parse tree does not provide the first symbol of the formula, which is always a constant (i.e. "|-" or "wff" or other user-defined constant symbol), so the first symbol of the formula must be input with the parse tree, and is output as given (updating current row/column in the output area).

Each node of the parse tree, starting with the root node is formatted in turn as a sub-expression using a function we will call "formatSubExpr". The current node's sub-expression is checked against the input Method option/parameters, such as maxDepth, current column number, etc. If the sub-expression's attributes exceed any of the given values then the sub-expression must be broken up using a line break and space characters. In this case, the constants and the variables of the sub-expression are formatted and output in turn, from left to right; each variable corresponds to a child node of the current node, and is formatted and output using a recursive call to formatSubExpr.

The main difference between TMFF Methods (as initially foreseen) is the way that processing deals with line breaks inside sub-expressions. The AlignColumn method arranges the variable or constant portions of a sub-expression in a single column based on the input option/parameters.

Examples:

Here is how formula X would be formatted with the Format 1 above (using Scheme "AlignVarDepth3" w/ maxDepth=3):

1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890

                   |- ( ( ph -> ( ps -> ch ) ) -> 
                       
( ( ph -> ps ) -> ( ph -> ch ) ) )

Here is X again but with Format 3 above (using Scheme "AlignVarDepth1", which uses maxDepth=1):

1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890

                   |- ( ( ph -> 

                         
( ps -> 

                           
ch ) ) -> 
                       
( ( ph -> 

                           
ps ) -> 

                         
( ph -> 

                           
ch ) ) )

A few tricky points to mention about Method processing and the above examples:

Column-ization begins in the output line at the "At" position -- and tokens to the left of the "At" position are output sequentially, without line breaks.
In the examples above the variables such as "ph", "ps" and "ch" should be understood to represent sub-expressions that might themselves require formatting.
In the event that the Method formatting logic runs out of room on a line and cannot break down a sub-expression further (for example, room remains for only 1 output character and the symbol to be printed is "ph") -- or if any other error is encountered, then the entire formula is output unformatted. For this reason, not to mention screen/page width, it is expected that "AtNbr" will never exceed 2, or 3 at the very most.
Type Conversion Syntax Axioms are ignored -- bypassed. The logic ignores these nodes and goes down the parse tree to the Type Conversion's child, which may be a Variable Hypothesis node, a Named Typed Constant Syntax Axiom Node, or a Nulls Permitted Syntax Axiom node.
Named Typed Constants and Nulls Permitted Syntax Axioms are treated as variables -- specific instances of variables -- for the purposes of formatting a sub-expression. That means that they are not sub-formatted (broken down), and "At=Cnst" will not capture (trigger on) them. Thus, if you have, for example, a prefix notation such as "operatorVar op1 op2" (see peano.mm at metamath.org), alignment on the actual operator variables should be coded with option/parameters alignAtValue = Sym,alignAtNbr = 2,alignByValue = Sym", which will result in the following output:

          operatorVar op1

                      op2