Mega Code Archive

Building an Easy to Use ParserParsing Framework (Part I)

Title: Building an Easy-to-Use Parser/Parsing Framework (Part I) Question: How to create a simple parsing framework to parse any kind of data? Answer: A second article was released on 29.01.2002 with a more detailed example: Building an Easy-to-Use Parser/Parsing Framework (Part II) Today, we wonna speak about "how to create a simple parser framework" in Delphi. Our goal will be a class solutions which helps up to parse any kind of data and store all relevant informations in an easy-to- access object model. At the end of this article, we've developed a small utility, which generates a simple object model of a .dtd file and output it's xml pendant from a given starting node. In other words, we're using the parsing framework to create a parser, which is able to parse a .dtd file, extract all neccessary tags, store them in the object model and generates the xml output. Note: This utility uses a simply dtd- parser model, which don't include all routines to parse all kinds of dtd datas - it's up to you to include those features. Our claims to the framework and object model are: - easy to use. - save/loadable object trees. - integrated error reporting. - expandable. Okay, now let's start to develope the main parsing engine. Delphi comes with a unit called CopyPrsr which includes the simple stream parser object TCopyParser. Try to take a look into that file to understand how it works - it's located under $(DELPHI)\Source\Internet\CopyPrsr.pas. Our framework parser is derived from that idea, but uses a simple string instead of the stream and includes some additional functions: The boiler plate: ========================================================= unit StringParser; interface uses Classes; const { Additional Parser special tokens } toEOL = char(6); toBOF = char(7); type { TSysCharSet } TSysCharSet = set of Char; { TStringParser } TStringParser = class private { Private declarations } FParseString: string; FLineTokens: Integer; FSourceLine: Integer; FSourcePos: Integer; FTokenPos: Integer; FToken: Char; procedure SkipBlanks; function GetParseString: string; function GetSourcePos: Integer; function GetTokenString: string; protected { Protected declarations } public { Public declarations } constructor Create; function LoadFromFile(const FileName: string): Boolean; function LoadFromStream(const Stream: TStream): Boolean; function SkipToEOF: string; function SkipToEOL: string; function SkipToken: Char; function SkipTokenString: string; function SkipToToken(const AToken: Char): string; overload; function SkipToToken(const AToken: TSysCharSet): string; overload; function SkipToTokenString(const ATokenString: string): string; property ParseString: string read GetParseString; property SourceLine: Integer read FSourceLine; property SourcePos: Integer read GetSourcePos; property Token: Char read FToken; property TokenString: string read GetTokenString; end; As you can see, there are many public helper functions which you can use to parse the data. The main functions are LoadFromFile and LoadFromStream, which needs the name of the file to be parsed as the only parameter. Both functions loads the content of the file and store it to the string FParseString which can be accessed through the denominator property: LoadFromFile/LoadFromStream: ========================================================= function TStringParser.LoadFromFile(const FileName: string): Boolean; var Stream: TMemoryStream; begin Result := False; if not FileExists(FileName) then Exit; Stream := TMemoryStream.Create; try Stream.LoadFromFile(FileName); Result := LoadFromStream(Stream); finally Stream.Free; end; end; function TStringParser.LoadFromStream(const Stream: TStream): Boolean; var MemStream: TMemoryStream; begin Result := False; if not(assigned(Stream)) then Exit; MemStream := TMemoryStream.Create; try Stream.Seek(0, soFromBeginning); MemStream.CopyFrom(Stream, Stream.Size); FParseString := StrPas(MemStream.Memory); SetLength(FParseString, MemStream.Size); FParseString := Concat(FParseString, toEOF); FToken := toBOF; Result := True; finally MemStream.Free; end; end; The main functionality of the parsing engine is the extraction of so- called tokens. A token can be a seperator (like CR, LF or EOF) or a symbol, which can be a keyword if you plan to parse a programing language. The following functions are used to skip blank characters (which are used to seperate symbols and aren't relevant) and to extract/skip the next token symbol: Token related functions (pullout only): ========================================================= procedure TStringParser.SkipBlanks; begin while True do begin FToken := FParseString[FTokenPos]; case FToken of #10: begin Inc(FSourceLine); FLineTokens := FTokenPos; end; toEOF, #33..#255: Exit; end; Inc(FTokenPos); end; end; function TStringParser.SkipToken: Char; const KeySet = ['A'..'Z', 'a'..'z', '0'..'9', '_']; begin SkipBlanks; FSourcePos := FTokenPos; if FParseString[FTokenPos] = toEOF then FToken := toEOF else if FParseString[FTokenPos] in KeySet then begin while FParseString[FTokenPos] in KeySet do Inc(FTokenPos); FToken := toSymbol; end else begin FToken := FParseString[FTokenPos]; Inc(FTokenPos); end; Result := FToken; end; function TStringParser.SkipToToken(const AToken: TSysCharSet): string; begin FSourcePos := FTokenPos; while not ((FParseString[FTokenPos] = toEOF) or (FParseString[FTokenPos] in AToken)) do begin if FParseString[FTokenPos] = #10 then begin Inc(FSourceLine); FLineTokens := FTokenPos; end; Inc(FTokenPos); end; if FParseString[FTokenPos] = toEOF then FToken := toEOF else FToken := FParseString[FTokenPos]; Result := GetTokenString; if FToken toEOF then SkipToken; end; The absent functions includes alternativ possibilities to extract or skip the tokens, like SkipToTokenString or SkipToEof. Well, the next step is to create the object model, which holds all our parsed informations. As I mentioned earlier, the goal of this article it to create a simple dtd parser, so we'll create an object model to store dtd informations. A dtd file is used to descripe the syntax rules of a xml file, like: DTD example: ========================================================= key CDATA #REQUIRED value CDATA #REQUIRED This example demonstrated the simplest way of a dtd structure. It's not the purpose of this article to develope a highly flexible dtd parser which spots all dtd grammas, so we only include the weightly ones. Root of each object model is the document, which holds all other objects as collections: The Root Document: ========================================================= { TDTDDocument } TDTDDocument = class(TPersistent) private { Private declarations } FEntities: TDTDEntities; FElements: TDTDElements; procedure SetEntities(Value: TDTDEntities); procedure SetElements(Value: TDTDElements); public { Public declarations } constructor Create; destructor Destroy; override; procedure Assign(Source: TPersistent); override; published { Published declarations } property Entities: TDTDEntities read FEntities write SetEntities; property Elements: TDTDElements read FElements write SetElements; end; As you can see, our document gives us the access of some other types of data: Entities and Elements. Entities are very hard to parse, so it's a good lesson for you to include that feature. Parsing elements is quite easier, so this type of data is better to explain here. Look at the dtd example some rows above this. You can see, that a dtd element is descripted as followed: Our object model needs some extra fields to store such element informations. If you are not familiar with dtd or xml, look at W3CSchools - it's a good starting point to learn more about that topic. So, take a look at the following object structure: TDTDDocument | o--TDTDEntities | o--TDTElements | o--TDTDElementTyp | o--TDTDAttributes | o--TDTDAttributeTyp o--TDTDAttributeStatus o--Default: string o--TDTDEnums o--TDTDChild | o--TDTDTyp o--TDTDChilds I've tried to "pack" the dtd grammars into an easy object model as you can see above: Each document contains a collection of elements. Each element is descripted by an elementtype and containes in turn a collection of attributes and childs. Each attribute again is descripted by an attributetype and contains a collection of enum(erations) and so forth. Followed a code cantle from the element implementation, it's not suggestive to show you the whole code here - it's quit long and a little bit more complex: TDTDElement: ========================================================= unit DTD_Document; interface uses Classes; type ... { TDTDElementTyp } TDTDElementTyp = (etAny, etEmpty, etData, etContainer); { TDTDElementStatus } TDTDElementStatus = (esRequired, esRequiredSeq, esOptional, esOptionalSeq); { TDTDItem } TDTDItem = class(TCollectionItem) private { Private declarations } FName: string; public { Public declarations } procedure Assign(Source: TPersistent); override; published { Published declarations } property Name: string read FName write FName; end; { TDTDItems } TDTDItems = class(TCollection) private { Private declarations } function GetItem(Index: Integer): TDTDItem; procedure SetItem(Index: Integer; Value: TDTDItem); public { Public declarations } function Add: TDTDItem; function Find(const Name: string): TDTDItem; property Items[Index: Integer]: TDTDItem read GetItem write SetItem; default; end; ... { TDTDElement } TDTDElement = class(TDTDProperty) private { Private declarations } FTyp: TDTDElementTyp; FAttributes: TDTDAttributes; FChilds: TDTDChilds; procedure SetAttributes(Value: TDTDAttributes); procedure SetChilds(Value: TDTDChilds); public { Public declarations } constructor Create(Collection: TCollection); override; destructor Destroy; override; procedure Assign(Source: TPersistent); override; published { Published declarations } property Typ: TDTDElementTyp read FTyp write FTyp; property Attributes: TDTDAttributes read FAttributes write SetAttributes; property Childs: TDTDChilds read FChilds write SetChilds; end; { TDTDElements } TDTDElements = class(TDTDProperties) private { Private declarations } function GetItem(Index: Integer): TDTDElement; procedure SetItem(Index: Integer; Value: TDTDElement); public { Public declarations } function Add: TDTDElement; function Find(const Name: string): TDTDElement; property Items[Index: Integer]: TDTDElement read GetItem write SetItem; default; end; ... implementation ... { TDTDItem } procedure TDTDItem.Assign(Source: TPersistent); begin if Source is TDTDItem then begin Name := TDTDItem(Source).Name; Exit; end; inherited Assign(Source); end; { TDTDItems } function TDTDItems.Add: TDTDItem; begin Result := TDTDItem(inherited Add); end; function TDTDItems.Find(const Name: string): TDTDItem; var i: Integer; begin Result := nil; for i := 0 to Count - 1 do if CompareStr(Items[i].Name, Name) = 0 then begin Result := Items[i]; Break; end; end; function TDTDItems.GetItem(Index: Integer): TDTDItem; begin Result := TDTDItem(inherited GetItem(Index)); end; procedure TDTDItems.SetItem(Index: Integer; Value: TDTDItem); begin inherited SetItem(Index, Value); end; ... { TDTDElement } constructor TDTDElement.Create(Collection: TCollection); begin inherited Create(Collection); FAttributes := TDTDAttributes.Create(TDTDAttribute); FChilds := TDTDChilds.Create(TDTDChild); end; destructor TDTDElement.Destroy; begin FAttributes.Free; FChilds.Free; inherited Destroy; end; procedure TDTDElement.Assign(Source: TPersistent); begin if Source is TDTDElement then begin Typ := TDTDElement(Source).Typ; Attributes.Assign(TDTDElement(Source).Attributes); Childs.Assign(TDTDElement(Source).Childs); end; inherited Assign(Source); end; procedure TDTDElement.SetAttributes(Value: TDTDAttributes); begin FAttributes.Assign(Value); end; procedure TDTDElement.SetChilds(Value: TDTDChilds); begin FChilds.Assign(Value); end; { TDTDElements } function TDTDElements.Add: TDTDElement; begin Result := TDTDElement(inherited Add); end; function TDTDElements.Find(const Name: string): TDTDElement; begin Result := TDTDElement(inherited Find(Name)); end; function TDTDElements.GetItem(Index: Integer): TDTDElement; begin Result := TDTDElement(inherited GetItem(Index)); end; procedure TDTDElements.SetItem(Index: Integer; Value: TDTDElement); begin inherited SetItem(Index, Value); end; ... The advantage of this object model is, that you're able to easily add the document to a standard component and use Delphi's internal streaming technology to load and save the object contents of a parsed file. The next step will be the developing of the real dtd parser. Do you remember the TStringParser descripted at the top of this article? We'll using this class to build up our parser. But, we don't want to create a parser from scratch each time we're about to parse a new kind of data - it's not mind of a framework. So first, we'll develope a small parser class from which we will inherit our dtd parser. This parent class should include the error reporting and accessable fields to some other informations: The Private Parser class: ========================================================= unit PrivateParser; interface uses Classes, SysUtils, StringParser; type { TParserError } TParserError = class(TCollectionItem) private { Private declarations } FFileName: string; FLine: Integer; FMessage: string; FPosition: Integer; public { Public declarations } procedure Assign(Source: TPersistent); override; published { Published declarations } property FileName: string read FFileName write FFileName; property Line: Integer read FLine write FLine; property Message: string read FMessage write FMessage; property Position: Integer read FPosition write FPosition; end; { TParserErrors } TParserErrors = class(TCollection) private { Private declarations } function GetItem(Index: Integer): TParserError; procedure SetItem(Index: Integer; Value: TParserError); public { Public declarations } function Add: TParserError; property Items[Index: Integer]: TParserError read GetItem write SetItem; default; end; { TValidationParser } TValidationParser = class private { Private declarations } FErrors: TParserErrors; procedure SetErrors(const Value: TParserErrors); public { Public declarations } constructor Create; destructor Destroy; override; procedure AddError(const AMessage: string; Parser: TStringParser; const AFileName: string = ''); procedure AddErrorFmt(const AMessage: string; Params: array of const; Parser: TStringParser; const AFileName: string = ''); property Errors: TParserErrors read FErrors write SetErrors; end; implementation { TParserError } procedure TParserError.Assign(Source: TPersistent); begin if Source is TParserError then begin Line := TParserError(Source).Line; Message := TParserError(Source).Message; Position := TParserError(Source).Position; Exit; end; inherited Assign(Source); end; { TParserErrors } function TParserErrors.Add: TParserError; begin Result := TParserError(inherited Add); end; function TParserErrors.GetItem(Index: Integer): TParserError; begin Result := TParserError(inherited GetItem(Index)); end; procedure TParserErrors.SetItem(Index: Integer; Value: TParserError); begin inherited SetItem(Index, Value); end; { TValidationParser } constructor TValidationParser.Create; begin inherited Create; FErrors := TParserErrors.Create(TParserError); end; destructor TValidationParser.Destroy; begin FErrors.Free; inherited Destroy; end; procedure TValidationParser.SetErrors(const Value: TParserErrors); begin FErrors.Assign(Value); end; procedure TValidationParser.AddErrorFmt(const AMessage: string; Params: array of const; Parser: TStringParser; const AFileName: string = ''); begin with FErrors.Add do begin FileName := AFileName; Line := Parser.SourceLine; Message := Format(AMessage, Params); Position := Parser.SourcePos; end; end; procedure TValidationParser.AddError(const AMessage: string; Parser: TStringParser; const AFileName: string = ''); begin AddErrorFmt(AMessage, [], Parser, AFileName); end; end. Now we can start developing the real parser by inheriting it from the TValidationParser. Again, I don't want to show you the whole sourcecode here, so I pick up only the sapid one. Our dtd parser is a so- called two-way parser, i.e. it uses the first pass to parse the elements and the second pass to parse the attributes. This is useful, because an attibute can refer to an element which is placed below it and otherwise we'll get some unneeded errors. The main method of our parse is Parse, which needs the name of the file to be parsed as the first parameter, and a pre-initialized TDTDDocument as the second parameter. A sample call should looks like: Sample Call: ========================================================= // Create DTDDocument. DTDDocument := TDTDDocument.Create; try // Create DTDParser. DTDParser := TDTDParser.Create; try // Parse File. DTDParser.Parse(FileName, DTDDocument); // Display possible Errors. if DTDParser.Errors.Count 0 then begin for i := 0 to DTDParser.Errors.Count - 1 do with DTDParser.Errors[i] do WriteLn(Format('Error in Line %d, Pos %d: %s...', [Line, Position, Message])); Exit; end; ... // Free DTDParser. finally DTDParser.Free; end; // Free DTDDocument. finally DTDDocument.Free; end; But now, let's take a look at some sourcecode lines of the parser implementation. The first think we had to do is to inherited our parser from the parent class: Parser Implementation (Snippet): ========================================================= type { EDTDParser } EDTDParser = class(Exception); { TDTDParser } TDTDParser = class(TValidationParser) private { Private declarations } procedure ParseElement(Parser: TStringParser; Document: TDTDDocument; const Pass: Integer); procedure ParseAttlist(Parser: TStringParser; Document: TDTDDocument); procedure ParseFile(const FileName: string; Document: TDTDDocument; const Pass: Integer = 0); public { Public declarations } procedure Parse(const FileName: string; var Document: TDTDDocument); end; Afterwards we implement the Parse method which calls the internal method ParseFile on her part: Method "Parse": ========================================================= procedure TDTDParser.Parse(const FileName: string; var Document: TDTDDocument); var TmpDocument: TDTDDocument; begin if not assigned(Document) then raise EDTDParser.Create('Document not assigned!'); TmpDocument := TDTDDocument.Create; try ParseFile(FileName, TmpDocument); if Errors.Count = 0 then Document.Assign(TmpDocument); finally TmpDocument.Free; end; end; As you can see, we create a special temporar document to store the parsed objects in. I've done this because I don't want to return the document if it is full of errors - I assign a exact copy of the objects only, if no errors occured. The method ParseFile implements the proper parsing calls to the StringParser and creates the real objects. Followed a code snippet of the method body: Method "ParseFile": ========================================================= procedure TDTDParser.ParseFile(const FileName: string; Document: TDTDDocument; const Pass: Integer = 0); var Parser: TStringParser; begin Parser := TStringParser.Create; try if not Parser.LoadFromFile(FileName) then begin AddErrorFmt('File "%s" not found', [FileName], Parser); Exit; end; while True do begin while not (Parser.Token in [toEOF, ' Parser.SkipToken; if Parser.Token = toEOF then Break; Parser.SkipToken; if Parser.Token '!' then begin if not(Parser.Token in ['?']) and (Pass = 1) then AddError('InvalidToken', Parser); Continue; end; if Parser.SkipToken toSymbol then begin if (Parser.Token '-') and (Pass = 1) then AddError('InvalidToken', Parser); Continue; end; if UpperCase(Parser.TokenString) = 'ENTITY' then Continue; if UpperCase(Parser.TokenString) = 'ELEMENT' then ParseElement(Parser, Document, Pass) else if UpperCase(Parser.TokenString) = 'ATTLIST' then begin if Pass = 1 then ParseAttlist(Parser, Document); end else if Pass = 1 then AddErrorFmt('Invalid Symbol "%s"', [Parser.TokenString], Parser); end; if Pass = 0 then ParseFile(FileName, Document, 1); finally Parser.Free; end; end; This method calls some other functions (ParseElement and ParseAttlist) which parses the internal structures of an element or an attribute. Look at the whole sourceode to understand. What's next?? Well, this article has shown you how easy it is to write a customizeable parser which can parse any kind of data - it's up to you, how complex it should be. The main benefit in using this kind of parsing is, that you don't need to incorporate in complex systems like LexParser. Continue reading my second article: Building an Easy-to-Use Parser/Parsing Framework (Part II) Thank you very much for you regard. M. Hoffmann