Read Webpage Tool
Purpose
Fetch and convert web content to markdown format
Description
The Read Webpage tool extracts content from web pages and converts it to clean, readable markdown format. It can handle both direct URLs and search queries with automatic fallback.
Key Features
- URL validation and processing
- Automatic search fallback for non-URLs
- Clean markdown conversion
- Jina.ai integration for reliable content extraction
- Google search integration for query handling
Parameters
- url (string): Web URL to fetch or search query
Functionality
Direct URL Processing
When provided with a valid URL:
- Validates the URL format
- Fetches content via Jina.ai service
- Converts HTML to clean markdown
- Returns formatted content
Search Query Handling
When provided with a search query:
- Detects non-URL input
- Automatically creates Google search URL
- Fetches search results page
- Converts to readable format
Content Processing
- HTML to Markdown: Clean conversion preserving structure
- Content Extraction: Removes ads, navigation, and clutter
- Text Formatting: Maintains headings, links, and formatting
- Return Format: Prefixed with source URL for reference
Technical Implementation
- Jina.ai API:
https://r.jina.ai/endpoint for content extraction - Authentication: Bearer token authentication
- Response Format: Plain text markdown
- Error Handling: Comprehensive HTTP and parsing error management
Use Cases
- Content Research: Extract information from articles and web pages
- Documentation Gathering: Pull content for analysis or summarization
- News Monitoring: Convert news articles to readable format
- Academic Research: Extract content from research publications
- Competitive Analysis: Analyze competitor websites and content
- Data Collection: Gather web-based information for processing
Response Format
Returns formatted string with:
Content fetched from [URL]:
[Markdown formatted content]
Limitations
- Requires active internet connection
- Subject to target website's robots.txt and rate limiting
- Some dynamic content may not be captured
- Authentication-required content cannot be accessed
Error Handling
- HTTP status error reporting
- Network timeout management
- Invalid URL format detection
- Service availability checking