HTML Signature Format Specification & Recovery Example

Hyper Text Markup Language Signature Format: Specification & HTML Recovery Example

Hyper Text Markup Language (HTML) files must have a signature (tag) <html> somewhere at the beginning of the document. Another tag <head> is usually present in HTML document and it contains web page title, encoding, keywords, scripts and other meta-data. HTML file does not have pre-defined internal structure and data fields, including file size. If HTML document is properly formatted, the closing tag </html> should also be present, and it points to the file and data end.

Let's examine the example

When inspecting example.html file's text data using any Hex Viewer, like Active@ Disk Editor we can see it starts with a tag <html (hex: 3C, 68, 74, 6D, 6C). Next to it there is a tag <head> (hex: 3C, 68, 65, 61, 64, 3E). Final signature </html> (hex: 3C, 2F, 68, 74, 6D, 6C, 3E) can be found at offset 285. Thus total file size is 285+7=292 bytes, and reading of all 292 consecutive bytes starting from the position of detected <html header provide us with all HTML file data, provided that file is not fragmented.

HTML Signature inspection

More info:

W3C HTML Specifications

Active@ File Recovery Custom Scripting Example

This example just specifies HTML start and final signatures, no additional calculations required.
Syntax of the signature definition language you can read here.

[PRIMITIVE_HTML]
DESCRIPTION = Primitive HTML Signature
EXTENSION = html
BEGIN=HTML_BEGIN
FOOTER=HTML_FOOTER
MAX_SIZE = 655360


[HTML_BEGIN]
<html = 0 | 512
<head = 0 | 1024

[HTML_FOOTER]
</html> = 2