I found me a solution to yesterday’s parsing problem.
The problem: parsing input of a the street address field to check the street against a database.
It turned out pretty simple. My compiler construction classes and exam have been too long ago and I do not know wether I would have found a solution easier back then.
I got a couple of helpful hints from Uli
and David
[…]
Ok, das Hauptproblem scheint ja zu sein, das es keine eindeutigen Trennzeichen gibt weil Leerzeichen auch innerhalb von Strasse und Ortsname vorkommen. Wenn Du durch genauere Analyse Strassenprefix und -suffix erkennen kannst(und letztere nicht als Wohneinheit gültig wären), sollte es eindeutig machbar sein. [David Metzler]
They helped in a way. Somewhere David reminded me of what I really need to do. I’ll try to write it up.
One of the real problems was that tokens have semantics, which isn’t quite that unusual. But it took me three days to figure out what that meant for the problem at hand.
1) First things first: I’ll be constructing some kind of lexer that splits the input into tokens and classifies them. I will simply split by spaces for that and match the tokens against some regular expressions and hashmaps. That is completely sufficient in this case.
2) Then I need a ‘parser’ – big surprise. Since the grammer is a bit simple, the parser will be as well. The bigger problem is the data structure, the syntax tree, that the parser will create, and that structure was the true problem. In my first approach I used a class that would build up the syntax tree from a point of view that implied the semantics problem was already solved at that point which it surely was not. I am not sure yet but I’ll probably be using a ‘real’ tree instead of something homebrewed that looked like a shortcut but wasn’t.
3) Semantic analysis. This is the really fun part. No actually I know pretty well what’s going to happen because I know the semantics of the input by now. I have had enough bugs in the current code to know all kinds of possible problems that might occur. This is where the actual ‘sorting’ will happen, the class I originally used as a syntax tree might come in handy at this point. Too bad I never quite grasped all the tree grammars needed for efficient semantic analysis 🙁
4) Code generation in this case means: writing the obtained data into the classes I use for the rest of the program. Sounds easy – may not be.
Now my real problem wasn’t the algorithms, those are rather straightforward once you have the data structures needed and those are rather complex despite the seeming simplicity of the problem. Todays design goal must therefore be the data structures.
What I need to think about now is a rather ‘lenient’ error handling because the whole problem is about transparency for the customer.