Stuck no more

I found me a solution to yesterday’s parsing problem.

The problem: parsing input of a the street address field to check the street against a database.

The input has to start with the streetnumber (unless it’s a rural route address) which may consist of several tokens, ranges and the numbers may contain several letters. The number is followed by the streetname which can be split into an optional prefix containing up to two directionals, the actual name and a suffix containing an optional street descriptor and up to two directionals. The name may be followed by the ‘unit’ which consists either of a number or of a unit descriptor and a number. The street name may contain numbers and may be equivalent to a directional or a street descriptor. The user may use abbreviations for directionals, street descriptors and unit descriptors.

It turned out pretty simple. My compiler construction classes and exam have been too long ago and I do not know wether I would have found a solution easier back then.

I got a couple of helpful hints from Uli

die Grammatik kannst Du (einfach) mit rekursiven Abstieg implementieren, wenn sie SLL(1) ist. SLL(1) ist die dann, wenn die First(..)-Mengen paarweise disjunkt sind. Jetzt frag mich allerdings nicht nach der exakten Definition der First Mengen. [Ulrich Mohr]

and David

Die quick-and-dirty Antwort ist: mach es einfach, es wird funktionieren. Spätestens, wenn Du rekursiven Abstieg mit Zurücksetzen nimmst, funktioniert es garantiert (kostet dann evtl. allerdings etwas Laufzeit wenn die Grammatik ungünstig ist). Falls Laufzeit kritisch ist, must Du halt noch lokale Vorschau dazubasteln. Wenn Du das ganze Ausprogrammieren willst, kannst Du aber auf jeden Fall damit durchkommen. Spannend ist das ganze nur, wenn Du Generatoren einsetzten willst. Falls Du JavaCC/JJTree einsetzen willst, kann ich Dir ein Beispiel aus meiner SA/DA schicken (wenn auch JavaCC/JJTree vom softwaretechnischen Standpunkt noch Mängel hat, besser als handprogrammiert ist es auf jeden Fall).

[…]

Ok, das Hauptproblem scheint ja zu sein, das es keine eindeutigen Trennzeichen gibt weil Leerzeichen auch innerhalb von Strasse und Ortsname vorkommen. Wenn Du durch genauere Analyse Strassenprefix und -suffix erkennen kannst(und letztere nicht als Wohneinheit gültig wären), sollte es eindeutig machbar sein. [David Metzler]

They helped in a way. Somewhere David reminded me of what I really need to do. I’ll try to write it up.

One of the real problems was that tokens have semantics, which isn’t quite that unusual. But it took me three days to figure out what that meant for the problem at hand.

1) First things first: I’ll be constructing some kind of lexer that splits the input into tokens and classifies them. I will simply split by spaces for that and match the tokens against some regular expressions and hashmaps. That is completely sufficient in this case.

2) Then I need a ‘parser’ – big surprise. Since the grammer is a bit simple, the parser will be as well. The bigger problem is the data structure, the syntax tree, that the parser will create, and that structure was the true problem. In my first approach I used a class that would build up the syntax tree from a point of view that implied the semantics problem was already solved at that point which it surely was not. I am not sure yet but I’ll probably be using a ‘real’ tree instead of something homebrewed that looked like a shortcut but wasn’t.

3) Semantic analysis. This is the really fun part. No actually I know pretty well what’s going to happen because I know the semantics of the input by now. I have had enough bugs in the current code to know all kinds of possible problems that might occur. This is where the actual ‘sorting’ will happen, the class I originally used as a syntax tree might come in handy at this point. Too bad I never quite grasped all the tree grammars needed for efficient semantic analysis 🙁

4) Code generation in this case means: writing the obtained data into the classes I use for the rest of the program. Sounds easy – may not be.

Now my real problem wasn’t the algorithms, those are rather straightforward once you have the data structures needed and those are rather complex despite the seeming simplicity of the problem. Todays design goal must therefore be the data structures.

What I need to think about now is a rather ‘lenient’ error handling because the whole problem is about transparency for the customer.