Mega Code Archive

 
Categories / Delphi / Forms
 

Extract HTML Tag information, Such as Links, Images

Title: Extract HTML Tag information, Such as Links, Images.. Question: How can i extract HTML's tag information such as Links/Images/Frames in a certain page? Answer: Extracting tag information from html was always a tricky thing to do, as you have to predict most errors/broken-text or mispelling of the coder. I always looked for a precise and accurate function to extract links from html page, but all the functions i found had their Disadvantages concerning broken lines or misplaced spaces or even slow routines. So, as most programmers say, after a long search i decidec to write MY OWN function to handle everything in the best way i know, and hopefully make an end to all those silly/bad-written functions out-there who claim to do the job the best way. Please mind that the function i came up with may NOT be the fastest out there, but it sure does the job on the best side and can be expanded to extract even more information in the future. Here is the function, i noted some comments, and will not elaborate on what the function acctually does as its written in a simple manner so even the beginner can understand and expand it for its own use: function ExtractHtmlTagValues(const HtmlText: string; TagName, AttribName: string; var Values: TStrings): integer; function FindFirstCharAfterSpace(const Line: string; StartPos: integer): Integer; var i: integer; begin Result := -1; for i := StartPos to Length(Line) do begin if (Line[i] ' ') then begin Result := i; exit; end; end; end; function FindFirstSpaceAfterChars(const Line: string; StartPos: integer): Integer; begin Result := PosEx(' ', Line, StartPos); end; function FindFirstSpaceBeforeChars(const Line: string; StartPos: integer): Integer; var i: integer; begin Result := 1; for i := StartPos downto 1 do begin if (Line[i] = ' ') then begin Result := i; exit; end; end; end; var InnerTag: string; LastPos, LastInnerPos: Integer; SPos, LPos, RPos: Integer; AttribValue: string; ClosingChar: char; TempAttribName: string; begin Result := 0; LastPos := 1; while (true) do begin // find outer tags '' LPos := PosEx(' if (LPos RPos := PosEx('', HtmlText, LPos+1); if (RPos LastPos := LPos + 1 else LastPos := RPos + 1; // get inner tag InnerTag := Copy(HtmlText, LPos+1, RPos-LPos-1); InnerTag := Trim(InnerTag); // remove spaces if (Length(InnerTag) // check tag name if (SameText(Copy(InnerTag, 1, Length(TagName)), TagName)) then begin // found tag AttribValue := ''; LastInnerPos := Length(TagName)+1; while (LastInnerPos begin // find first '=' after LastInnerPos RPos := PosEx('=', InnerTag, LastInnerPos); if (RPos // this way you can check for multiple attrib names and not a specific attrib SPos := FindFirstSpaceBeforeChars(InnerTag, RPos); TempAttribName := Trim(Copy(InnerTag, SPos, RPos-SPos)); if (true) then begin // found correct tag LPos := FindFirstCharAfterSpace(InnerTag, RPos+1); if (LPos begin LastInnerPos := RPos + 1; continue; end; LPos := FindFirstCharAfterSpace(InnerTag, LPos); // get to first char after '=' if (LPos if ((InnerTag[LPos] '"') and (InnerTag[LPos] '''')) then begin // AttribValue is not between '"' or ''' so get it RPos := FindFirstSpaceAfterChars(InnerTag, LPos+1); if (RPos AttribValue := Copy(InnerTag, LPos, Length(InnerTag)-LPos+1) else AttribValue := Copy(InnerTag, LPos, RPos-LPos+1); end else begin // get url between '"' or ''' ClosingChar := InnerTag[LPos]; RPos := PosEx(ClosingChar, InnerTag, LPos+1); if (RPos AttribValue := Copy(InnerTag, LPos+1, Length(InnerTag)-LPos-1) else AttribValue := Copy(InnerTag, LPos+1, RPos-LPos-1) end; if (SameText(TempAttribName, AttribName)) and (AttribValue '') then begin Values.Add(AttribValue); inc(Result); end; end; if (RPos LastInnerPos := Length(InnerTag) else LastInnerPos := RPos+1; end; end; end; end; For eg. you want to extract all links in a page, just do: var Links: TStrings; begin Links := TStrings.Create; try LinksFound := ExtractHtmlTagValues(HtmlText, 'A', 'HREF', Links); Showmessage(Links.Text); finally Links.Free; end; end; If you find any bugs or comments, or even improvements please drop an email. thanks Yaniv