Omni Text

Metadata Extraction

Understanding the rich metadata automatically extracted from your documents

Overview

Every document processed through the API automatically includes comprehensive metadata extraction. This rich metadata provides valuable insights about your documents without requiring additional API calls or configuration.

Types of Metadata Extracted

Document Properties

Standard document metadata fields extracted from all file types:

  • Title: Document title from metadata or content analysis
  • Author: Document creator information
  • Subject: Document subject or description
  • Keywords: Extracted keywords and tags
  • Creation Date: When the document was originally created
  • Modification Date: Last modification timestamp
  • Application: Software used to create the document

Content Analysis

Intelligent analysis of document content:

  • Page Count: Total number of pages
  • Word Count: Estimated word count
  • Character Count: Total character count including spaces
  • Language Detection: Primary language(s) detected in the document
  • Content Type: Classification of document type (report, article, form, etc.)

Technical Specifications

File format and technical details:

  • File Size: Document size in bytes
  • Format Version: Specific format version (e.g., PDF 1.7, DOCX)
  • Encoding: Text encoding used in the document
  • Compression: Compression methods applied
  • Security: Password protection or encryption status

Structure Information

Document layout and formatting analysis:

  • Headings: Hierarchy of document headings and sections
  • Tables: Number and structure of tables detected
  • Images: Count and basic properties of embedded images
  • Links: Internal and external links found in the document
  • Formatting: Text styling and layout information