Core API Reference¶
This page documents the main API functions and classes provided by Webdown.
Main Functions¶
HTML to Markdown and Claude XML conversion functionality.
This module serves as the main entry point for the webdown package, providing the primary functions for converting web content to Markdown and Claude XML formats.
The conversion process involves multiple steps: 1. Fetch or read HTML content (from URL or local file) 2. Convert HTML to Markdown 3. Optionally convert Markdown to Claude XML format
Key functions: - convert_url: Convert web content to Markdown or XML - convert_file: Convert local HTML file to Markdown or XML
Functions¶
html_to_markdown(html: str, config: WebdownConfig) -> str
¶
Convert HTML to Markdown with formatting options.
This function takes HTML content and converts it to Markdown format based on the provided configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
str
|
HTML content to convert |
required |
config
|
WebdownConfig
|
Configuration options for the conversion |
required |
Returns:
| Type | Description |
|---|---|
str
|
Converted Markdown content |
Examples:
>>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
>>> config = WebdownConfig()
>>> print(html_to_markdown(html, config))
# Title
Content with link
Content with link
Source code in webdown/markdown_converter.py
markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str
¶
Convert Markdown content to Claude XML format.
This function converts Markdown content to a structured XML format suitable for use with Claude AI models. It handles elements like headings, paragraphs, and code blocks, organizing them into a hierarchical XML document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
markdown
|
str
|
Markdown content to convert |
required |
source_url
|
Optional[str]
|
Source URL for the content (for metadata) |
None
|
include_metadata
|
bool
|
Whether to include metadata section (title, source, date) |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Claude XML formatted content |
Source code in webdown/xml_converter.py
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | |
Configuration Classes¶
Configuration options for HTML to Markdown conversion.
This class centralizes all configuration options for the conversion process, focusing on the most useful options for LLM documentation processing.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
Optional[str]
|
URL of the web page to convert |
file_path |
Optional[str]
|
Path to local HTML file to convert |
include_links |
bool
|
Whether to include hyperlinks (True) or plain text (False) |
include_images |
bool
|
Whether to include images (True) or exclude them |
css_selector |
Optional[str]
|
CSS selector to extract specific content |
show_progress |
bool
|
Whether to display a progress bar during download |
format |
OutputFormat
|
Output format (Markdown or Claude XML) |
document_options |
DocumentOptions
|
Document structure configuration |
Source code in webdown/config.py
Attributes¶
css_selector: Optional[str] = None
class-attribute
instance-attribute
¶
document_options: DocumentOptions = field(default_factory=DocumentOptions)
class-attribute
instance-attribute
¶
file_path: Optional[str] = None
class-attribute
instance-attribute
¶
format: OutputFormat = OutputFormat.MARKDOWN
class-attribute
instance-attribute
¶
include_images: bool = True
class-attribute
instance-attribute
¶
include_links: bool = True
class-attribute
instance-attribute
¶
show_progress: bool = False
class-attribute
instance-attribute
¶
url: Optional[str] = None
class-attribute
instance-attribute
¶
Functions¶
__init__(url: Optional[str] = None, file_path: Optional[str] = None, show_progress: bool = False, include_links: bool = True, include_images: bool = True, css_selector: Optional[str] = None, format: OutputFormat = OutputFormat.MARKDOWN, document_options: DocumentOptions = DocumentOptions()) -> None
¶
Configuration for document output structure.
This class contains settings that affect the structure of the generated document, independent of the output format.
Attributes:
| Name | Type | Description |
|---|---|---|
include_toc |
bool
|
Whether to generate a table of contents |
compact_output |
bool
|
Whether to remove excessive blank lines |
body_width |
int
|
Maximum line length for wrapping (0 for no wrapping) |
include_metadata |
bool
|
Include metadata section with title, source URL, date (only applies to Claude XML format) |
Source code in webdown/config.py
Attributes¶
body_width: int = 0
class-attribute
instance-attribute
¶
compact_output: bool = False
class-attribute
instance-attribute
¶
include_metadata: bool = True
class-attribute
instance-attribute
¶
include_toc: bool = False
class-attribute
instance-attribute
¶
Functions¶
__init__(include_toc: bool = False, compact_output: bool = False, body_width: int = 0, include_metadata: bool = True) -> None
¶
HTML Parsing¶
HTML parsing and fetching functionality.
This module handles fetching web content and basic HTML parsing: - URL validation and verification - HTML fetching with proper error handling and progress tracking - HTML file reading from local filesystem - Content extraction with CSS selectors - Streaming support for large web pages
The primary functions are fetch_url() for retrieving HTML content from web, read_html_file() for reading HTML from local files, and extract_content_with_css() for selecting specific parts of HTML.
Functions¶
fetch_url(url: str, show_progress: bool = False) -> str
¶
Fetch HTML content from URL with optional progress bar.
This is a simplified wrapper around fetch_url_with_progress with default parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL to fetch |
required |
show_progress
|
bool
|
Whether to display a progress bar during download |
False
|
Returns:
| Type | Description |
|---|---|
str
|
HTML content as string |
Raises:
| Type | Description |
|---|---|
WebdownError
|
If URL is invalid or content cannot be fetched |
Source code in webdown/html_parser.py
fetch_url_with_progress(url: str, show_progress: bool = False, chunk_size: int = 1024, timeout: int = 10) -> str
¶
Fetch content from URL with streaming and optional progress bar.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL to fetch |
required |
show_progress
|
bool
|
Whether to display a progress bar during download |
False
|
chunk_size
|
int
|
Size of chunks to read in bytes |
1024
|
timeout
|
int
|
Request timeout in seconds |
10
|
Returns:
| Type | Description |
|---|---|
str
|
Content as string |
Raises:
| Type | Description |
|---|---|
WebdownError
|
If content cannot be fetched |
Source code in webdown/html_parser.py
extract_content_with_css(html: str, css_selector: str) -> str
¶
Extract specific content from HTML using a CSS selector.
CSS selector is assumed to be already validated before this function is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
str
|
HTML content |
required |
css_selector
|
str
|
CSS selector to extract content (pre-validated) |
required |
Returns:
| Type | Description |
|---|---|
str
|
HTML content of selected elements |
Raises:
| Type | Description |
|---|---|
WebdownError
|
If there is an error applying the selector |
Source code in webdown/html_parser.py
Markdown Conversion¶
HTML to Markdown conversion functionality.
This module handles conversion of HTML content to Markdown with optional features: - HTML to Markdown conversion using html2text - Table of contents generation - Content selection with CSS selectors - Compact output mode - Removal of invisible characters
The main function is html_to_markdown(), but this module also provides helper functions for each conversion step.
Functions¶
html_to_markdown(html: str, config: WebdownConfig) -> str
¶
Convert HTML to Markdown with formatting options.
This function takes HTML content and converts it to Markdown format based on the provided configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
str
|
HTML content to convert |
required |
config
|
WebdownConfig
|
Configuration options for the conversion |
required |
Returns:
| Type | Description |
|---|---|
str
|
Converted Markdown content |
Examples:
>>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
>>> config = WebdownConfig()
>>> print(html_to_markdown(html, config))
# Title
Content with link
Content with link
Source code in webdown/markdown_converter.py
XML Conversion¶
Markdown to Claude XML conversion functionality.
This module handles conversion of Markdown content to Claude XML format: - Processes code blocks directly (no placeholders) - Handles headings, sections, and paragraphs - Generates metadata when requested - Creates a structured XML document for use with Claude
The main function is markdown_to_claude_xml(), which converts Markdown content to a format suitable for Claude AI models.
Functions¶
markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str
¶
Convert Markdown content to Claude XML format.
This function converts Markdown content to a structured XML format suitable for use with Claude AI models. It handles elements like headings, paragraphs, and code blocks, organizing them into a hierarchical XML document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
markdown
|
str
|
Markdown content to convert |
required |
source_url
|
Optional[str]
|
Source URL for the content (for metadata) |
None
|
include_metadata
|
bool
|
Whether to include metadata section (title, source, date) |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Claude XML formatted content |
Source code in webdown/xml_converter.py
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | |
process_section(match: Match[str], level: int) -> List[str]
¶
Process a section (heading + content) into XML.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
match
|
Match[str]
|
Regex match containing heading and content |
required |
level
|
int
|
Indentation level |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of XML strings for the section |
Source code in webdown/xml_converter.py
Crawler¶
Web crawler for converting multiple pages to Markdown or Claude XML.
This module provides the core crawling functionality for webdown, allowing users to crawl multiple pages from a website and convert them to Markdown or Claude XML format.
Classes¶
CrawlerConfig
dataclass
¶
Configuration for web crawling.
Attributes:
| Name | Type | Description |
|---|---|---|
seed_urls |
list[str]
|
List of URLs to start crawling from. |
output_dir |
str
|
Directory to save converted files. |
max_depth |
int
|
Maximum link depth from seed URLs (default: 3). |
delay_seconds |
float
|
Delay between requests in seconds (default: 1.0). |
scope |
ScopeType
|
Type of scope filtering to apply (default: SAME_SUBDOMAIN). |
path_prefix |
str | None
|
Optional path prefix for PATH_PREFIX scope. |
conversion_config |
WebdownConfig
|
Configuration for page conversion. |
verbose |
bool
|
Whether to print progress messages. |
max_pages |
int
|
Maximum number of pages to crawl (0 for unlimited). |
Source code in webdown/crawler.py
Functions¶
crawl(config: CrawlerConfig) -> CrawlResult
¶
Execute a crawl operation starting from seed URLs.
Uses breadth-first search to discover and convert pages within the configured scope. Respects rate limiting with configurable delays between requests.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
CrawlerConfig
|
Configuration for the crawl operation. |
required |
Returns:
| Type | Description |
|---|---|
CrawlResult
|
CrawlResult containing metadata for all crawled pages. |
Raises:
| Type | Description |
|---|---|
WebdownError
|
If the output directory cannot be created or accessed. |
Source code in webdown/crawler.py
crawl_from_sitemap(sitemap_url: str, config: CrawlerConfig) -> CrawlResult
¶
Crawl pages listed in a sitemap.xml file.
Instead of following links, this function parses a sitemap and converts all URLs listed in it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sitemap_url
|
str
|
URL of the sitemap.xml file. |
required |
config
|
CrawlerConfig
|
Configuration for the crawl operation. |
required |
Returns:
| Type | Description |
|---|---|
CrawlResult
|
CrawlResult containing metadata for all crawled pages. |
Raises:
| Type | Description |
|---|---|
WebdownError
|
If the sitemap cannot be fetched or parsed. |
Source code in webdown/crawler.py
Link Extraction¶
Link extraction and URL handling utilities for the crawler.
This module provides functions for extracting links from HTML content, normalizing URLs for deduplication, filtering links by scope, and parsing sitemap.xml files.
Classes¶
ScopeType
¶
Functions¶
extract_links(html: str, base_url: str) -> list[str]
¶
Extract and resolve all links from HTML content.
Finds all links in the HTML and resolves them to absolute URLs using the base URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
str
|
The HTML content to extract links from. |
required |
base_url
|
str
|
The base URL for resolving relative links. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of absolute URLs found in the HTML. |
Source code in webdown/link_extractor.py
normalize_url(url: str) -> str
¶
Normalize a URL for deduplication.
Normalization includes: - Lowercasing the scheme and domain - Removing fragments (#anchor) - Removing trailing slashes (except for root paths) - Sorting and keeping query parameters
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL to normalize. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The normalized URL string. |
Source code in webdown/link_extractor.py
filter_links_by_scope(links: list[str], seed_url: str, scope: ScopeType, path_prefix: str | None = None) -> list[str]
¶
Filter links to only those within the configured scope.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
links
|
list[str]
|
List of URLs to filter. |
required |
seed_url
|
str
|
The original seed URL used to determine scope. |
required |
scope
|
ScopeType
|
The type of scope filtering to apply. |
required |
path_prefix
|
str | None
|
Optional path prefix for PATH_PREFIX scope. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
Filtered list of URLs that match the scope criteria. |
Source code in webdown/link_extractor.py
parse_sitemap(sitemap_url: str, timeout: int = 30) -> list[str]
¶
Parse a sitemap.xml file and return the list of URLs.
Supports standard sitemap.xml format with
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sitemap_url
|
str
|
URL of the sitemap.xml file. |
required |
timeout
|
int
|
Request timeout in seconds. |
30
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of URLs found in the sitemap. |
Raises:
| Type | Description |
|---|---|
WebdownError
|
If the sitemap cannot be fetched or parsed. |
Source code in webdown/link_extractor.py
Output Management¶
Output file management for the crawler.
This module handles converting URLs to file paths, managing the output directory structure, and writing the crawl manifest (index.json).
Classes¶
CrawlResult
dataclass
¶
Result of a crawl operation.
Attributes:
| Name | Type | Description |
|---|---|---|
pages |
list[CrawledPage]
|
List of all crawled pages with their metadata. |
start_time |
datetime
|
When the crawl started. |
end_time |
datetime
|
When the crawl completed. |
seed_urls |
list[str]
|
The original seed URLs used to start the crawl. |
max_depth |
int
|
The maximum depth setting used. |
output_format |
str
|
The output format used (markdown or claude_xml). |
Source code in webdown/output_manager.py
CrawledPage
dataclass
¶
Metadata for a crawled page.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
str
|
The original URL that was crawled. |
output_path |
str
|
Relative path to the output file from the output directory. |
title |
str | None
|
The page title extracted from the content, if available. |
crawled_at |
datetime
|
Timestamp when the page was crawled. |
depth |
int
|
The crawl depth from the seed URL (0 for seed URLs). |
status |
str
|
The crawl status ("success", "error", or "skipped"). |
error_message |
str | None
|
Error message if status is "error", None otherwise. |
Source code in webdown/output_manager.py
Functions¶
url_to_filepath(url: str, output_dir: str, output_format: OutputFormat = OutputFormat.MARKDOWN) -> str
¶
Convert a URL to an output file path.
Creates a path structure that mirrors the URL structure: - https://example.com/docs/page -> output_dir/example.com/docs/page.md - https://example.com/docs/ -> output_dir/example.com/docs/index.md - https://example.com/ -> output_dir/example.com/index.md
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL to convert to a file path. |
required |
output_dir
|
str
|
The base output directory. |
required |
output_format
|
OutputFormat
|
The output format (determines file extension). |
MARKDOWN
|
Returns:
| Type | Description |
|---|---|
str
|
The full file path for the converted content. |
Source code in webdown/output_manager.py
write_manifest(result: CrawlResult, output_dir: str) -> str
¶
Write the crawl manifest (index.json) to the output directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
CrawlResult
|
The crawl result containing all page metadata. |
required |
output_dir
|
str
|
The output directory to write the manifest to. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The path to the written manifest file. |
Source code in webdown/output_manager.py
Error Handling¶
Error handling utilities for webdown.
This module provides centralized error handling utilities used throughout the webdown package.
Functions¶
handle_validation_error(message: str, code: str = ErrorCode.VALIDATION_ERROR) -> NoReturn
¶
Handle a validation error and raise a WebdownError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
Error message |
required |
code
|
str
|
Error code |
VALIDATION_ERROR
|
Raises:
| Type | Description |
|---|---|
WebdownError
|
Always raised with appropriate message |
Source code in webdown/error_utils.py
get_friendly_error_message(error: Exception) -> str
¶
Get a user-friendly error message for an exception.
This function is intended for CLI and user-facing interfaces.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error
|
Exception
|
The exception to get a message for |
required |
Returns:
| Type | Description |
|---|---|
str
|
A user-friendly error message |
Source code in webdown/error_utils.py
format_error_for_cli(error: Exception) -> str
¶
Format an error message for CLI output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error
|
Exception
|
The exception to format |
required |
Returns:
| Type | Description |
|---|---|
str
|
A formatted error message for CLI output |
Source code in webdown/error_utils.py
handle_request_exception(exception: Exception, url: str) -> NoReturn
¶
Handle a request exception and raise a WebdownError with appropriate message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
exception
|
Exception
|
The exception to handle |
required |
url
|
str
|
The URL that was requested |
required |
Raises:
| Type | Description |
|---|---|
WebdownError
|
Always raised with appropriate message |
Source code in webdown/error_utils.py
Exceptions¶
Bases: Exception
Exception for webdown errors.
This exception class is used for all errors raised by the webdown package. The error type is indicated by a descriptive message and an error code, allowing programmatic error handling.
Error types include
URL format errors: When the URL doesn't follow standard format Network errors: Connection issues, timeouts, HTTP errors Parsing errors: Issues with processing the HTML content Validation errors: Invalid parameters or configuration
Attributes:
| Name | Type | Description |
|---|---|---|
code |
str
|
Error code for programmatic error handling |
Source code in webdown/config.py
Validation¶
Validation utilities for webdown.
This module provides centralized validation functions for various inputs used throughout the webdown package.
Functions¶
validate_url(url: str) -> str
¶
Validate a URL and return it if valid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL to validate |
required |
Returns:
| Type | Description |
|---|---|
str
|
The validated URL |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the URL is invalid |
Source code in webdown/validation.py
validate_css_selector(selector: str) -> str
¶
Validate a CSS selector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
selector
|
str
|
The CSS selector to validate |
required |
Returns:
| Type | Description |
|---|---|
str
|
The validated CSS selector |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the selector is invalid |
Source code in webdown/validation.py
validate_body_width(width: Optional[int]) -> Optional[int]
¶
Validate body width parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
width
|
Optional[int]
|
The body width to validate, or None for no width limit |
required |
Returns:
| Type | Description |
|---|---|
Optional[int]
|
The validated body width or None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the width is invalid |
Source code in webdown/validation.py
validate_numeric_parameter(name: str, value: Optional[int], min_value: Optional[int] = None, max_value: Optional[int] = None) -> Optional[int]
¶
Validate a numeric parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the parameter (for error messages) |
required |
value
|
Optional[int]
|
The value to validate |
required |
min_value
|
Optional[int]
|
Optional minimum value |
None
|
max_value
|
Optional[int]
|
Optional maximum value |
None
|
Returns:
| Type | Description |
|---|---|
Optional[int]
|
The validated value |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the value is invalid |
Source code in webdown/validation.py
validate_string_parameter(name: str, value: Optional[str], allowed_values: Optional[list] = None) -> Optional[str]
¶
Validate a string parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the parameter (for error messages) |
required |
value
|
Optional[str]
|
The value to validate |
required |
allowed_values
|
Optional[list]
|
Optional list of allowed values |
None
|
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The validated value |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the value is invalid |
Source code in webdown/validation.py
validate_boolean_parameter(name: str, value: Optional[bool]) -> Optional[bool]
¶
Validate a boolean parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the parameter (for error messages) |
required |
value
|
Optional[bool]
|
The value to validate |
required |
Returns:
| Type | Description |
|---|---|
Optional[bool]
|
The validated value |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the value is invalid |