Skip to content

Move time parsing from lexer to parser

Volker Krause requested to merge work/new-time-parsing into master

This allows us to handle e.g. spaces inside times since we have enough context to distinguish this from other places where a time-like token sequence can occur. This also adds support for a few Unicode separators found in use.

The am/pm handling is a bit messy so we retain the ability to distinguish "10a" and "10 a 12" (with 'a' being an ASCII-fied version of 'à', as a French alternative range separator).

However, I failed to do that with the LALR(1) parsing mode, as the correct interpretation of especially the colon and number tokens depends on more look-ahead. Switching to GLR mode helps with that. That however is a very heavy tool due to its (in theory) exponential complexity. We seem to have sufficiently few different parsing paths to check and typically very short inputs so that the cost impact on the entire 493k OSM opening hours corpus is only about 900ms, which is a little less than doubling the previous parsing cost.

With this another ~850 expressions in the full corpus get accepted now.

Merge request reports