GrammarRuleTreeNotes

GrammarRuleTreeNotes.html

Red-Black Tree sample java code, public domain:
douglea\RBTreeDocIndex.html
douglea\RBCell.java
douglea\RBTree.java

An examination of RBCell.java reveals that it can be customized and specialized to replace C:\astevel3\java\src\mmj\lang\SynNode.java.

Each Grammar Rule will be identified by a unique sequence of Cnst objects -- an Expression with Variables converted into their Type codes. For example, "-. ph" is converted to "-." followed by "wff", a sequence of two Cnst objects.
All Grammar Rules that begin with a certain Cnst will form a Red-Black binary tree whose root is stored in the Cnst object corresponding to that starting Cnst symbol. (The root node will have null Left and Right pointers.)
Each node of the Rule trees will contain Up, Down and GrammarRule instance variables, where Down points to the next Cnst in the Expression and Up points to the previous Cnst in the Expression. In the example above with Constant sequence "-.", "wff", the Constant object for "-." will have a reference (pointer) to the root of a tree whose key is "-.", and that root down will have a Down reference/pointer whose key is "wff" -- assuming that this is the initial insert.
If the Grammar Rule instance in a node is null that means that no Grammar Rule that begins with the original Constant terminates with the node's Constant.
Let's make this more specific with an example. Suppose we have a sequence "A B C D". That sequence will correspond to 4 trees, with "A" pointing to a sub-tree with keys derived from the 2nd Constant symbol in expressions that begin with "A". And likewise, "B" -- in the 2nd level of trees points to a subtree derived from the 3rd Constant symbol in expressions that begin with "A B", and so on. Finally, the "D" node in the 4th level tree will contain a reference/pointer to the Grammar Rule corresponding to the sequence of Constants, "A B C D".
Thus, the set of all Grammar Rules will be represented by a forest of trees, one top level tree for each unique starting Constant (the most popular initial Constant in Syntax Axioms is, no doubt, "("...which will have its own tree pointed to by the "(" Cnst object.

Using Red-Black trees achieves well-balanced binary trees at minimal insert cost. In the initial coding effort for MMJ, SynNode.java was used in a similar scenario except that Cnst did not point to the root of the Cnst's syntax sub-tree -- instead, a SynTree.java was used, meaning that a lookup was needed to find the node for the initial Constant of an expression. Not only can that initial lookup be eliminated, but the SynTree insert process involved a bulk add of sorted objects that was designed to achieve a balanced tree -- and also precluded dynamic adds later! So, with Red-Black binary trees we can avoid sorting the Grammar Rules and we can load them incrementally, if desired.

Another important change will be made involving the Grammar Rule trees: no duplicates will be allowed! The original coding effort allowed duplicate nodes pointing to Stmt objects with duplicate Expressions. In the new scheme of things, duplicate Grammar Rules will be purposely and skillfully dropped -- with our new theoretical understanding of Unique Readability and Ambiguity, the elimination of duplicates is actually a key objective!

As mentioned RBCell.java will be "stolen" from the public domain. All of the unnecessary functions and code will be stripped out, including code involved with deletions (not allowed at this time.) The RBCell code was developed by Doug Lea as a pre-Sun java Collections framework, and it is very sophisticated. In principle, the Sun Collections framework could be used, but instead of building hierarchies of trees using only RBCell type nodes, each level of Trees would be an explicitly separate Collection, with its own root, and so on; and a reference to a "payload" element would be stored in each node, meaning at least one extra layer of abstration/inefficiency at every level, not counting use of Comparators. With this "off the shelf" approach, the backing representation wouldn't necessarily have to be a tree -- a hashmap might work just as well. There are benefits to using the off the shelf approach, especially in an online update environment with concurrent processings underway. But for now, because I am stubborn and a jerk, I will go for the custom approach now, striving for speed (the idea of just putting all Expression sequences into a big Collection would be feasible, except for the fact that we need to be able to search for overlaps in initial segments of sequences and a tree structure is ideal for that purpose -- i.e. we could have "A E G H I", "A E G" and "A E" as (dubious) Rules and searching the "A" tree followed by a search of the "E" subtree seems simpler and faster.