Hyper Text Markup Language Signature Format: Specification & HTML Recovery Example
Hyper Text Markup Language (HTML) files must have a signature (tag) <html> somewhere at the beginning of the document. Another tag <head> is usually present in HTML document and it contains web page title, encoding, keywords, scripts and other meta-data. HTML file does not have pre-defined internal structure and data fields, including file size. If HTML document is properly formatted, the closing tag </html> should also be present, and it points to the file and data end.
Let's examine the example
When inspecting example.html file's text data using any Hex Viewer, like Active@ Disk Editor we can see it starts with a tag <html (hex: 3C, 68, 74, 6D, 6C). Next to it there is a tag <head> (hex: 3C, 68, 65, 61, 64, 3E). Final signature </html> (hex: 3C, 2F, 68, 74, 6D, 6C, 3E) can be found at offset 285. Thus total file size is 285+7=292 bytes, and reading of all 292 consecutive bytes starting from the position of detected <html header provide us with all HTML file data, provided that file is not fragmented.
Active@ File Recovery Custom Scripting Example
This example just specifies HTML start and final signatures, no additional calculations required. Syntax of the signature definition language you can read here.
[PRIMITIVE_HTML] DESCRIPTION = Primitive HTML Signature EXTENSION = html BEGIN=HTML_BEGIN FOOTER=HTML_FOOTER MAX_SIZE = 655360 [HTML_BEGIN] <html = 0 | 512 <head = 0 | 1024 [HTML_FOOTER] </html> = 2