Web page transformer for structure information extraction
Topics: Data Mining, Indexing, LLMO / GEO, Probably in use
The Google patent describes a transformer-based model, termed “WebFormer,” that jointly encodes both the structural layout (e.g., HTML DOM) and the textual content of web pages to extract specified fields (like title, price, date) as text spans. By introducing three token types—field tokens, structured-document-type tokens, and text tokens—and applying four distinct attention patterns (structure-to-structure, structure-to-text, text-to-structure, and local text-to-text), the model efficiently captures both local syntax and global layout information. Extracted spans are stored as contextual representations, enabling scalable, zero-shot extraction across diverse domains without per-domain templates.