Metadata Extraction
Understanding the rich metadata automatically extracted from your documents
Overview
Every document processed through the API automatically includes comprehensive metadata extraction. This rich metadata provides valuable insights about your documents without requiring additional API calls or configuration.
Types of Metadata Extracted
Document Properties
Standard document metadata fields extracted from all file types:
- Title: Document title from metadata or content analysis
- Author: Document creator information
- Subject: Document subject or description
- Keywords: Extracted keywords and tags
- Creation Date: When the document was originally created
- Modification Date: Last modification timestamp
- Application: Software used to create the document
Content Analysis
Intelligent analysis of document content:
- Page Count: Total number of pages
- Word Count: Estimated word count
- Character Count: Total character count including spaces
- Language Detection: Primary language(s) detected in the document
- Content Type: Classification of document type (report, article, form, etc.)
Technical Specifications
File format and technical details:
- File Size: Document size in bytes
- Format Version: Specific format version (e.g., PDF 1.7, DOCX)
- Encoding: Text encoding used in the document
- Compression: Compression methods applied
- Security: Password protection or encryption status
Structure Information
Document layout and formatting analysis:
- Headings: Hierarchy of document headings and sections
- Tables: Number and structure of tables detected
- Images: Count and basic properties of embedded images
- Links: Internal and external links found in the document
- Formatting: Text styling and layout information
Related Information
- File Formats - Supported document types
- Best Practices - Optimization guidelines
- API Reference - Technical metadata field specifications