Read Webpage Tool

Purpose

Fetch and convert web content to markdown format

Description

The Read Webpage tool extracts content from web pages and converts it to clean, readable markdown format. It can handle both direct URLs and search queries with automatic fallback.

Key Features

  • URL validation and processing
  • Automatic search fallback for non-URLs
  • Clean markdown conversion
  • Jina.ai integration for reliable content extraction
  • Google search integration for query handling

Parameters

  • url (string): Web URL to fetch or search query

Functionality

Direct URL Processing

When provided with a valid URL:

  1. Validates the URL format
  2. Fetches content via Jina.ai service
  3. Converts HTML to clean markdown
  4. Returns formatted content

Search Query Handling

When provided with a search query:

  1. Detects non-URL input
  2. Automatically creates Google search URL
  3. Fetches search results page
  4. Converts to readable format

Content Processing

  • HTML to Markdown: Clean conversion preserving structure
  • Content Extraction: Removes ads, navigation, and clutter
  • Text Formatting: Maintains headings, links, and formatting
  • Return Format: Prefixed with source URL for reference

Technical Implementation

  • Jina.ai API: https://r.jina.ai/ endpoint for content extraction
  • Authentication: Bearer token authentication
  • Response Format: Plain text markdown
  • Error Handling: Comprehensive HTTP and parsing error management

Use Cases

  • Content Research: Extract information from articles and web pages
  • Documentation Gathering: Pull content for analysis or summarization
  • News Monitoring: Convert news articles to readable format
  • Academic Research: Extract content from research publications
  • Competitive Analysis: Analyze competitor websites and content
  • Data Collection: Gather web-based information for processing

Response Format

Returns formatted string with:

Content fetched from [URL]:

[Markdown formatted content]

Limitations

  • Requires active internet connection
  • Subject to target website's robots.txt and rate limiting
  • Some dynamic content may not be captured
  • Authentication-required content cannot be accessed

Error Handling

  • HTTP status error reporting
  • Network timeout management
  • Invalid URL format detection
  • Service availability checking