Hyper Text Markup Language (HTML) files must have a signature (tag) <html> somewhere at the beginning of the document. Another tag <head> is usually present in HTML document and it contains web page title, encoding, keywords, scripts and other meta-data. HTML file does not have pre-defined internal structure and data fields, including file size. If HTML document is properly formatted, the closing tag </html> should also be present, and it points to the file and data end.
Let's examine the example
When inspecting example.html file's text data using any Hex Viewer, like Active@ Disk Editor we can see it starts with a tag <html (hex: 3C, 68, 74, 6D, 6C). Next to it there is a tag <head> (hex: 3C, 68, 65, 61, 64, 3E). Final signature </html> (hex: 3C, 2F, 68, 74, 6D, 6C, 3E) can be found at offset 285. Thus total file size is 285+7=292 bytes, and reading of all 292 consecutive bytes starting from the position of detected <html header provide us with all HTML file data, provided that file is not fragmented.
This example just specifies HTML start and final signatures, no additional calculations required. Syntax of the signature definition language you can read here.
[PRIMITIVE_HTML] DESCRIPTION = Primitive HTML Signature EXTENSION = html BEGIN=HTML_BEGIN FOOTER=HTML_FOOTER MAX_SIZE = 655360 [HTML_BEGIN] <html = 0 | 512 <head = 0 | 1024 [HTML_FOOTER] </html> = 2
This document is available in PDF format,
which requires Adobe® Acrobat® Reader