top of page

Lexical Analysis and Parsing

By combining a Unicode-aware lexer that tokenizes raw text with a parser that applies a formal grammar, custom parsing libraries and application languages can be built in C without relying on a fixed, general-purpose implementation.

The provided JSON, XML, and calculator examples show how the same approach can be used for data formats, markup languages, and Unicode-aware application syntax. Each example can be modified, extended, and optimized to match the application’s data model, memory strategy, validation rules, and performance requirements.

Lexical Scanner Utility

Designed for building Unicode-aware scanners in C, this tool converts raw text into token streams for higher-level parsers. It is compatible with Flex syntax and adds direct Unicode character-range support, making it possible to define identifiers, operators, symbols, and text classes in terms of the characters that appear in the input.

The scanner also supports lexer start conditions, making it useful for custom languages, data-format parsers, configuration files, markup processing, command interpreters, formula languages, source-code tools, and other internationalized text-processing applications.

Parser Generator Utility

Designed for building Unicode-aware parsers in C, this tool generates parsing code from formal grammar specifications and works with tokens produced by a Unicode-aware lexical scanner.

Identifiers, operators, symbols, and language constructs can be defined using Unicode character ranges, then transformed into data structures defined by the application. It is compatible with Bison syntax and supports maintainable grammars, semantic actions, validation rules, error handling, and parse-tree construction.

JSON Library

This JSON and JSON5 library is built from a formal grammar and generated parsing code, showing how to implement a customizable parser in C. It is fully functional as a standalone library and can be adapted for different memory models, ownership rules, validation strategies, and application-specific extensions.

XML Library

This XML library is built from a formal grammar, lexer start conditions, and generated parsing code, showing how to implement a customizable XML parser in C. It provides a compact foundation that can be extended or specialized for application-specific document models, memory ownership rules, and performance requirements.

Unicode Example

This calculator example is built from a Unicode-aware scanner, formal grammar, and generated parsing code, making it a compact teaching example for implementing application languages in C. It demonstrates how international identifiers and Unicode operators can be recognized directly from character ranges and handled through ordinary parser actions.

The same pattern can be used as a starting point for other lexer and parser applications, including configuration languages, expression evaluators, command interpreters, formula languages, domain-specific languages, and tools that need to process international text.

bottom of page