nLab
formal grammar

Contents

Contents

 Idea

Formal grammars are sets of rules which describe which strings of words are considered grammatical. They can be used to characterize the syntax of both natural language and programming languages.

Unrestricted grammar

Fix a finite set VV called the vocabulary, also called the terminal symbols. A language is a subset of the free monoid V *V^*, i.e. strings of words. Let V +V^+ denote the free semigroup, i.e. non-empty strings. X+YX + Y and X×YX \times Y denote the disjoint union and the Cartesian product respectively.

An unrestricted grammar is given by a tuple G=(V,X,R,s)G = (V, X, R, s) where XX is a finite set of nonterminal symbols with a start symbol sXs \in X and R(V+X) +×(V+X) *R \subseteq (V + X)^+ \times (V + X)^* is finite a set of rewrite rules. The language L(G)L(G) generated by the grammar is usually defined as follows:

  1. Define the “rewrite step” relation ( R)(V+X) *×(V+X) *(\to_R) \subseteq (V + X)^* \times (V + X)^* where (αβγ) R(αδγ)(\alpha \beta \gamma) \to_R (\alpha \delta \gamma) for all (β,δ)R(\beta, \delta) \in R and α,γ(V+X) *\alpha, \gamma \in (V + X)^*.
  2. Take its reflexive transitive closure ( R *)(\to_R^*), i.e. s R *ts \to_R^* t iff there is some sequence of rewrite steps s Rs 1 Rs 2 R... Rts \to_R s_1 \to_R s_2 \to_R ... \to_R t. This sequence is called a grammatical derivation.
  3. Let L(G)={w 1...w nV *|s R *w 1...w n}L(G) = \{ w_1 ... w_n \in V^* | s \to_R^* w_1 ... w_n\} for sXs \in X the start symbol.

The construction of R *\to_R^* can be seen as the free preordered monoid generated by the rewrite rules RR. Grammatical derivations can be seen as string diagrams in the free monoidal category with the symbols V+XV + X as generating objects and the rewrite rules RR as generating arrows.

Other frameworks

Context free grammars are defined in the same way as unrestricted grammars, but they only allow rules with a single non-terminal symbol as domain. In that particular case, the grammatical derivations are trees with the start symbol as root and the words as leaves.

In categorial grammars, monoidal categories are replaced by biclosed monoidal category. The derivations are trees with two kinds of rules: 1) a finite language-dependent dictionary of rules wtw \to t with wVw \in V and tF(X)t \in F(X), for F(X)F(X) the free residuated monoid (see Lambek (1958)) and 2) an infinite number of language-agnostic rules, for the application of each higher order type.

Similarly, pregroup grammars are constructed from a dictionary with types coming from a free pregroup P(X)P(X). The rewrite rules are the counit maps x *x1x^* \otimes x \to 1 and x *x1x \otimes {}^*x \to 1, canceling a non-terminal symbol xP(X)x \in P(X) with its left and right adjoints x *x^* and *x{}^*x. The derivations are planar string diagrams in a rigid monoidal category, the counit maps are drawn as cup-shaped wires connecting the words in a sentence.

History

The origins of grammar go back to 4th century BC India: the Sanskrit philologist Pānini showed ancient mastery of context-sensitive grammars to analyse the syntax of Sanskrit.

Unrestricted grammars were introduced by Noam Chomsky as the level zero of his hierarchy, they are equivalent to Turing machines in expressive power.

The similar concept of semi-Thue system, also known as a string rewriting system, was introduced thirty years before Chomsky’s work on grammar, see Thue (1914). It makes no distinction between terminal and non-terminal symbols, and allows rules with empty left-hand sides.

A third, equivalent, formulation was proved to be undecidable independently by Emil Post? and Andrey Markov: the word problem for semigroups.

References

  • Saroja Bhate and Subhash Kak. Pānini’s Grammar and Computer Science. Annals of the Bhandarkar Oriental Research Institute (1993)
  • Axel Thue. Probleme Uber Veränderungen von Zeichenreihen Nach Gegebenen Regeln. (1914)
  • Emil L. Post. Recursive Unsolvability of a problem of Thue. Journal of Symbolic Logic (1947)
  • A Markov. On certain insoluble problems concerning matrices. In Doklady Akad (1947)
  • J. Lambek. The mathematics of sentence structure. American Mathematics Monthly (1958)
  • J. Lambek. Type grammar revisited. Logical Aspects of Computational Linguistics (1999)

Last revised on November 24, 2020 at 14:09:21. See the history of this page for a list of all contributions to it.