Core API Reference¶
This page documents the main API functions and classes provided by Webdown.
Main Functions¶
HTML to Markdown and Claude XML conversion functionality.
This module serves as the main entry point for the webdown package, providing the primary functions for converting web content to Markdown and Claude XML formats.
The conversion process involves multiple steps: 1. Fetch or read HTML content (from URL or local file) 2. Convert HTML to Markdown 3. Optionally convert Markdown to Claude XML format
Key functions: - convert_url: Convert web content to Markdown or XML - convert_file: Convert local HTML file to Markdown or XML
Functions¶
html_to_markdown(html: str, config: WebdownConfig) -> str
¶
Convert HTML to Markdown with formatting options.
This function takes HTML content and converts it to Markdown format based on the provided configuration object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
html
|
str
|
HTML content to convert |
required |
config
|
WebdownConfig
|
Configuration options for the conversion |
required |
Returns:
Type | Description |
---|---|
str
|
Converted Markdown content |
Examples:
>>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
>>> config = WebdownConfig()
>>> print(html_to_markdown(html, config))
# Title
Content with link
Content with link
Source code in webdown/markdown_converter.py
markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str
¶
Convert Markdown content to Claude XML format.
This function converts Markdown content to a structured XML format suitable for use with Claude AI models. It handles elements like headings, paragraphs, and code blocks, organizing them into a hierarchical XML document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
markdown
|
str
|
Markdown content to convert |
required |
source_url
|
Optional[str]
|
Source URL for the content (for metadata) |
None
|
include_metadata
|
bool
|
Whether to include metadata section (title, source, date) |
True
|
Returns:
Type | Description |
---|---|
str
|
Claude XML formatted content |
Source code in webdown/xml_converter.py
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 |
|
Configuration Classes¶
Configuration options for HTML to Markdown conversion.
This class centralizes all configuration options for the conversion process, focusing on the most useful options for LLM documentation processing.
Attributes:
Name | Type | Description |
---|---|---|
url |
Optional[str]
|
URL of the web page to convert |
file_path |
Optional[str]
|
Path to local HTML file to convert |
include_links |
bool
|
Whether to include hyperlinks (True) or plain text (False) |
include_images |
bool
|
Whether to include images (True) or exclude them |
css_selector |
Optional[str]
|
CSS selector to extract specific content |
show_progress |
bool
|
Whether to display a progress bar during download |
format |
OutputFormat
|
Output format (Markdown or Claude XML) |
document_options |
DocumentOptions
|
Document structure configuration |
Source code in webdown/config.py
Attributes¶
css_selector: Optional[str] = None
class-attribute
instance-attribute
¶
document_options: DocumentOptions = field(default_factory=DocumentOptions)
class-attribute
instance-attribute
¶
file_path: Optional[str] = None
class-attribute
instance-attribute
¶
format: OutputFormat = OutputFormat.MARKDOWN
class-attribute
instance-attribute
¶
include_images: bool = True
class-attribute
instance-attribute
¶
include_links: bool = True
class-attribute
instance-attribute
¶
show_progress: bool = False
class-attribute
instance-attribute
¶
url: Optional[str] = None
class-attribute
instance-attribute
¶
Functions¶
__init__(url: Optional[str] = None, file_path: Optional[str] = None, show_progress: bool = False, include_links: bool = True, include_images: bool = True, css_selector: Optional[str] = None, format: OutputFormat = OutputFormat.MARKDOWN, document_options: DocumentOptions = DocumentOptions()) -> None
¶
Configuration for document output structure.
This class contains settings that affect the structure of the generated document, independent of the output format.
Attributes:
Name | Type | Description |
---|---|---|
include_toc |
bool
|
Whether to generate a table of contents |
compact_output |
bool
|
Whether to remove excessive blank lines |
body_width |
int
|
Maximum line length for wrapping (0 for no wrapping) |
include_metadata |
bool
|
Include metadata section with title, source URL, date (only applies to Claude XML format) |
Source code in webdown/config.py
Attributes¶
body_width: int = 0
class-attribute
instance-attribute
¶
compact_output: bool = False
class-attribute
instance-attribute
¶
include_metadata: bool = True
class-attribute
instance-attribute
¶
include_toc: bool = False
class-attribute
instance-attribute
¶
Functions¶
__init__(include_toc: bool = False, compact_output: bool = False, body_width: int = 0, include_metadata: bool = True) -> None
¶
HTML Parsing¶
HTML parsing and fetching functionality.
This module handles fetching web content and basic HTML parsing: - URL validation and verification - HTML fetching with proper error handling and progress tracking - HTML file reading from local filesystem - Content extraction with CSS selectors - Streaming support for large web pages
The primary functions are fetch_url() for retrieving HTML content from web, read_html_file() for reading HTML from local files, and extract_content_with_css() for selecting specific parts of HTML.
Functions¶
fetch_url(url: str, show_progress: bool = False) -> str
¶
Fetch HTML content from URL with optional progress bar.
This is a simplified wrapper around fetch_url_with_progress with default parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
URL to fetch |
required |
show_progress
|
bool
|
Whether to display a progress bar during download |
False
|
Returns:
Type | Description |
---|---|
str
|
HTML content as string |
Raises:
Type | Description |
---|---|
WebdownError
|
If URL is invalid or content cannot be fetched |
Source code in webdown/html_parser.py
fetch_url_with_progress(url: str, show_progress: bool = False, chunk_size: int = 1024, timeout: int = 10) -> str
¶
Fetch content from URL with streaming and optional progress bar.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
URL to fetch |
required |
show_progress
|
bool
|
Whether to display a progress bar during download |
False
|
chunk_size
|
int
|
Size of chunks to read in bytes |
1024
|
timeout
|
int
|
Request timeout in seconds |
10
|
Returns:
Type | Description |
---|---|
str
|
Content as string |
Raises:
Type | Description |
---|---|
WebdownError
|
If content cannot be fetched |
Source code in webdown/html_parser.py
extract_content_with_css(html: str, css_selector: str) -> str
¶
Extract specific content from HTML using a CSS selector.
CSS selector is assumed to be already validated before this function is called.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
html
|
str
|
HTML content |
required |
css_selector
|
str
|
CSS selector to extract content (pre-validated) |
required |
Returns:
Type | Description |
---|---|
str
|
HTML content of selected elements |
Raises:
Type | Description |
---|---|
WebdownError
|
If there is an error applying the selector |
Source code in webdown/html_parser.py
Markdown Conversion¶
HTML to Markdown conversion functionality.
This module handles conversion of HTML content to Markdown with optional features: - HTML to Markdown conversion using html2text - Table of contents generation - Content selection with CSS selectors - Compact output mode - Removal of invisible characters
The main function is html_to_markdown(), but this module also provides helper functions for each conversion step.
Functions¶
html_to_markdown(html: str, config: WebdownConfig) -> str
¶
Convert HTML to Markdown with formatting options.
This function takes HTML content and converts it to Markdown format based on the provided configuration object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
html
|
str
|
HTML content to convert |
required |
config
|
WebdownConfig
|
Configuration options for the conversion |
required |
Returns:
Type | Description |
---|---|
str
|
Converted Markdown content |
Examples:
>>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
>>> config = WebdownConfig()
>>> print(html_to_markdown(html, config))
# Title
Content with link
Content with link
Source code in webdown/markdown_converter.py
XML Conversion¶
Markdown to Claude XML conversion functionality.
This module handles conversion of Markdown content to Claude XML format: - Processes code blocks directly (no placeholders) - Handles headings, sections, and paragraphs - Generates metadata when requested - Creates a structured XML document for use with Claude
The main function is markdown_to_claude_xml(), which converts Markdown content to a format suitable for Claude AI models.
Functions¶
markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str
¶
Convert Markdown content to Claude XML format.
This function converts Markdown content to a structured XML format suitable for use with Claude AI models. It handles elements like headings, paragraphs, and code blocks, organizing them into a hierarchical XML document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
markdown
|
str
|
Markdown content to convert |
required |
source_url
|
Optional[str]
|
Source URL for the content (for metadata) |
None
|
include_metadata
|
bool
|
Whether to include metadata section (title, source, date) |
True
|
Returns:
Type | Description |
---|---|
str
|
Claude XML formatted content |
Source code in webdown/xml_converter.py
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 |
|
process_section(match: Match[str], level: int) -> List[str]
¶
Process a section (heading + content) into XML.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
match
|
Match[str]
|
Regex match containing heading and content |
required |
level
|
int
|
Indentation level |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List of XML strings for the section |
Source code in webdown/xml_converter.py
Error Handling¶
Error handling utilities for webdown.
This module provides centralized error handling utilities used throughout the webdown package.
Functions¶
handle_validation_error(message: str, code: str = ErrorCode.VALIDATION_ERROR) -> NoReturn
¶
Handle a validation error and raise a WebdownError.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
str
|
Error message |
required |
code
|
str
|
Error code |
VALIDATION_ERROR
|
Raises:
Type | Description |
---|---|
WebdownError
|
Always raised with appropriate message |
Source code in webdown/error_utils.py
get_friendly_error_message(error: Exception) -> str
¶
Get a user-friendly error message for an exception.
This function is intended for CLI and user-facing interfaces.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
error
|
Exception
|
The exception to get a message for |
required |
Returns:
Type | Description |
---|---|
str
|
A user-friendly error message |
Source code in webdown/error_utils.py
format_error_for_cli(error: Exception) -> str
¶
Format an error message for CLI output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
error
|
Exception
|
The exception to format |
required |
Returns:
Type | Description |
---|---|
str
|
A formatted error message for CLI output |
Source code in webdown/error_utils.py
handle_request_exception(exception: Exception, url: str) -> NoReturn
¶
Handle a request exception and raise a WebdownError with appropriate message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
exception
|
Exception
|
The exception to handle |
required |
url
|
str
|
The URL that was requested |
required |
Raises:
Type | Description |
---|---|
WebdownError
|
Always raised with appropriate message |
Source code in webdown/error_utils.py
Exceptions¶
Bases: Exception
Exception for webdown errors.
This exception class is used for all errors raised by the webdown package. The error type is indicated by a descriptive message and an error code, allowing programmatic error handling.
Error types include
URL format errors: When the URL doesn't follow standard format Network errors: Connection issues, timeouts, HTTP errors Parsing errors: Issues with processing the HTML content Validation errors: Invalid parameters or configuration
Attributes:
Name | Type | Description |
---|---|---|
code |
str
|
Error code for programmatic error handling |
Source code in webdown/config.py
Validation¶
Validation utilities for webdown.
This module provides centralized validation functions for various inputs used throughout the webdown package.
Functions¶
validate_url(url: str) -> str
¶
Validate a URL and return it if valid.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL to validate |
required |
Returns:
Type | Description |
---|---|
str
|
The validated URL |
Raises:
Type | Description |
---|---|
ValueError
|
If the URL is invalid |
Source code in webdown/validation.py
validate_css_selector(selector: str) -> str
¶
Validate a CSS selector.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
selector
|
str
|
The CSS selector to validate |
required |
Returns:
Type | Description |
---|---|
str
|
The validated CSS selector |
Raises:
Type | Description |
---|---|
ValueError
|
If the selector is invalid |
Source code in webdown/validation.py
validate_body_width(width: Optional[int]) -> Optional[int]
¶
Validate body width parameter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
width
|
Optional[int]
|
The body width to validate, or None for no width limit |
required |
Returns:
Type | Description |
---|---|
Optional[int]
|
The validated body width or None |
Raises:
Type | Description |
---|---|
ValueError
|
If the width is invalid |
Source code in webdown/validation.py
validate_numeric_parameter(name: str, value: Optional[int], min_value: Optional[int] = None, max_value: Optional[int] = None) -> Optional[int]
¶
Validate a numeric parameter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the parameter (for error messages) |
required |
value
|
Optional[int]
|
The value to validate |
required |
min_value
|
Optional[int]
|
Optional minimum value |
None
|
max_value
|
Optional[int]
|
Optional maximum value |
None
|
Returns:
Type | Description |
---|---|
Optional[int]
|
The validated value |
Raises:
Type | Description |
---|---|
ValueError
|
If the value is invalid |
Source code in webdown/validation.py
validate_string_parameter(name: str, value: Optional[str], allowed_values: Optional[list] = None) -> Optional[str]
¶
Validate a string parameter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the parameter (for error messages) |
required |
value
|
Optional[str]
|
The value to validate |
required |
allowed_values
|
Optional[list]
|
Optional list of allowed values |
None
|
Returns:
Type | Description |
---|---|
Optional[str]
|
The validated value |
Raises:
Type | Description |
---|---|
ValueError
|
If the value is invalid |
Source code in webdown/validation.py
validate_boolean_parameter(name: str, value: Optional[bool]) -> Optional[bool]
¶
Validate a boolean parameter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the parameter (for error messages) |
required |
value
|
Optional[bool]
|
The value to validate |
required |
Returns:
Type | Description |
---|---|
Optional[bool]
|
The validated value |
Raises:
Type | Description |
---|---|
ValueError
|
If the value is invalid |