Skip to content

Core API Reference

This page documents the main API functions and classes provided by Webdown.

Main Functions

HTML to Markdown and Claude XML conversion functionality.

This module serves as the main entry point for the webdown package, providing the primary functions for converting web content to Markdown and Claude XML formats.

The conversion process involves multiple steps: 1. Fetch or read HTML content (from URL or local file) 2. Convert HTML to Markdown 3. Optionally convert Markdown to Claude XML format

Key functions: - convert_url: Convert web content to Markdown or XML - convert_file: Convert local HTML file to Markdown or XML

Functions

html_to_markdown(html: str, config: WebdownConfig) -> str

Convert HTML to Markdown with formatting options.

This function takes HTML content and converts it to Markdown format based on the provided configuration object.

Parameters:

Name Type Description Default
html str

HTML content to convert

required
config WebdownConfig

Configuration options for the conversion

required

Returns:

Type Description
str

Converted Markdown content

Examples:

>>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
>>> config = WebdownConfig()
>>> print(html_to_markdown(html, config))
# Title

Content with link

>>> config = WebdownConfig(include_links=False)
>>> print(html_to_markdown(html, config))
# Title

Content with link

Source code in webdown/markdown_converter.py
def html_to_markdown(
    html: str,
    config: WebdownConfig,
) -> str:
    """Convert HTML to Markdown with formatting options.

    This function takes HTML content and converts it to Markdown format
    based on the provided configuration object.

    Args:
        html: HTML content to convert
        config: Configuration options for the conversion

    Returns:
        Converted Markdown content

    Examples:
        >>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
        >>> config = WebdownConfig()
        >>> print(html_to_markdown(html, config))
        # Title

        Content with [link](#)

        >>> config = WebdownConfig(include_links=False)
        >>> print(html_to_markdown(html, config))
        # Title

        Content with link
    """
    # Validate all configuration parameters
    _validate_config(config)

    # Extract specific content by CSS selector if provided
    if config.css_selector:
        html = extract_content_with_css(html, config.css_selector)

    # Configure and run html2text
    converter = _configure_html2text(config)
    markdown = converter.handle(html)

    # Clean up the markdown
    markdown = clean_markdown(markdown, config.document_options.compact_output)

    # Add table of contents if requested
    if config.document_options.include_toc:
        markdown = generate_table_of_contents(markdown)

    return str(markdown)

markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str

Convert Markdown content to Claude XML format.

This function converts Markdown content to a structured XML format suitable for use with Claude AI models. It handles elements like headings, paragraphs, and code blocks, organizing them into a hierarchical XML document.

Parameters:

Name Type Description Default
markdown str

Markdown content to convert

required
source_url Optional[str]

Source URL for the content (for metadata)

None
include_metadata bool

Whether to include metadata section (title, source, date)

True

Returns:

Type Description
str

Claude XML formatted content

Source code in webdown/xml_converter.py
def markdown_to_claude_xml(
    markdown: str,
    source_url: Optional[str] = None,
    include_metadata: bool = True,
) -> str:
    """Convert Markdown content to Claude XML format.

    This function converts Markdown content to a structured XML format
    suitable for use with Claude AI models. It handles elements like
    headings, paragraphs, and code blocks, organizing them into a
    hierarchical XML document.

    Args:
        markdown: Markdown content to convert
        source_url: Source URL for the content (for metadata)
        include_metadata: Whether to include metadata section (title, source, date)

    Returns:
        Claude XML formatted content
    """
    xml_parts = []

    # Use a fixed document tag - simplifying configuration
    doc_tag = "claude_documentation"

    # Root element
    xml_parts.append(f"<{doc_tag}>")

    # Extract title
    title = extract_markdown_title(markdown)

    # Add metadata if requested
    if include_metadata:
        xml_parts.extend(generate_metadata_xml(title, source_url))

    # Begin content section
    xml_parts.append(indent_xml("<content>", 1))

    # Process all content by section

    # Extract all section headings
    section_matches = list(re.finditer(r"^(#+\s+)(.+?)$", markdown, re.MULTILINE))

    if section_matches:
        # Process each section including content following the heading
        for i, match in enumerate(section_matches):
            heading_start = match.start()
            heading = match.group(0)
            # If this is the last heading, content goes to the end
            if i == len(section_matches) - 1:
                content = markdown[heading_start + len(heading) :].strip()
            else:
                # Otherwise content goes until the next heading
                next_heading_start = section_matches[i + 1].start()
                content = markdown[
                    heading_start + len(heading) : next_heading_start
                ].strip()

            # Create section with heading and content
            section_xml = []
            section_xml.append(indent_xml("<section>", 2))
            section_xml.append(
                indent_xml(
                    f"<heading>{escape_xml(match.group(2).strip())}</heading>", 3
                )
            )

            # Process content inside this section
            if content:
                section_xml.extend(_process_paragraphs(content, 3))

            section_xml.append(indent_xml("</section>", 2))
            xml_parts.extend(section_xml)

        # Process content before the first heading (if any)
        if section_matches[0].start() > 0:
            pre_content = markdown[: section_matches[0].start()].strip()
            if pre_content:
                # Add pre-heading content at the beginning
                pre_parts = _process_paragraphs(pre_content, 2)

                xml_parts = xml_parts[:2] + pre_parts + xml_parts[2:]
    else:
        # No headings - just process all content
        xml_parts.extend(_process_paragraphs(markdown, 2))

    # Close content and root
    xml_parts.append(indent_xml("</content>", 1))
    xml_parts.append(f"</{doc_tag}>")

    return "\n".join(xml_parts)

Configuration Classes

Configuration options for HTML to Markdown conversion.

This class centralizes all configuration options for the conversion process, focusing on the most useful options for LLM documentation processing.

Attributes:

Name Type Description
url Optional[str]

URL of the web page to convert

file_path Optional[str]

Path to local HTML file to convert

include_links bool

Whether to include hyperlinks (True) or plain text (False)

include_images bool

Whether to include images (True) or exclude them

css_selector Optional[str]

CSS selector to extract specific content

show_progress bool

Whether to display a progress bar during download

format OutputFormat

Output format (Markdown or Claude XML)

document_options DocumentOptions

Document structure configuration

Source code in webdown/config.py
@dataclass
class WebdownConfig:
    """Configuration options for HTML to Markdown conversion.

    This class centralizes all configuration options for the conversion process,
    focusing on the most useful options for LLM documentation processing.

    Attributes:
        url (Optional[str]): URL of the web page to convert
        file_path (Optional[str]): Path to local HTML file to convert
        include_links (bool): Whether to include hyperlinks (True) or plain text (False)
        include_images (bool): Whether to include images (True) or exclude them
        css_selector (Optional[str]): CSS selector to extract specific content
        show_progress (bool): Whether to display a progress bar during download
        format (OutputFormat): Output format (Markdown or Claude XML)
        document_options (DocumentOptions): Document structure configuration
    """

    # Source options
    url: Optional[str] = None
    file_path: Optional[str] = None
    show_progress: bool = False

    # Content options
    include_links: bool = True
    include_images: bool = True
    css_selector: Optional[str] = None

    # Output options
    format: OutputFormat = OutputFormat.MARKDOWN

    # We need to use field with default_factory to avoid mutable default value
    document_options: DocumentOptions = field(default_factory=DocumentOptions)

Attributes

css_selector: Optional[str] = None class-attribute instance-attribute

document_options: DocumentOptions = field(default_factory=DocumentOptions) class-attribute instance-attribute

file_path: Optional[str] = None class-attribute instance-attribute

format: OutputFormat = OutputFormat.MARKDOWN class-attribute instance-attribute

include_images: bool = True class-attribute instance-attribute

show_progress: bool = False class-attribute instance-attribute

url: Optional[str] = None class-attribute instance-attribute

Functions

__init__(url: Optional[str] = None, file_path: Optional[str] = None, show_progress: bool = False, include_links: bool = True, include_images: bool = True, css_selector: Optional[str] = None, format: OutputFormat = OutputFormat.MARKDOWN, document_options: DocumentOptions = DocumentOptions()) -> None

Configuration for document output structure.

This class contains settings that affect the structure of the generated document, independent of the output format.

Attributes:

Name Type Description
include_toc bool

Whether to generate a table of contents

compact_output bool

Whether to remove excessive blank lines

body_width int

Maximum line length for wrapping (0 for no wrapping)

include_metadata bool

Include metadata section with title, source URL, date (only applies to Claude XML format)

Source code in webdown/config.py
@dataclass
class DocumentOptions:
    """Configuration for document output structure.

    This class contains settings that affect the structure of the generated document,
    independent of the output format.

    Attributes:
        include_toc (bool): Whether to generate a table of contents
        compact_output (bool): Whether to remove excessive blank lines
        body_width (int): Maximum line length for wrapping (0 for no wrapping)
        include_metadata (bool): Include metadata section with title, source URL, date
            (only applies to Claude XML format)
    """

    include_toc: bool = False
    compact_output: bool = False
    body_width: int = 0
    include_metadata: bool = True

Attributes

body_width: int = 0 class-attribute instance-attribute

compact_output: bool = False class-attribute instance-attribute

include_metadata: bool = True class-attribute instance-attribute

include_toc: bool = False class-attribute instance-attribute

Functions

__init__(include_toc: bool = False, compact_output: bool = False, body_width: int = 0, include_metadata: bool = True) -> None

HTML Parsing

HTML parsing and fetching functionality.

This module handles fetching web content and basic HTML parsing: - URL validation and verification - HTML fetching with proper error handling and progress tracking - HTML file reading from local filesystem - Content extraction with CSS selectors - Streaming support for large web pages

The primary functions are fetch_url() for retrieving HTML content from web, read_html_file() for reading HTML from local files, and extract_content_with_css() for selecting specific parts of HTML.

Functions

fetch_url(url: str, show_progress: bool = False) -> str

Fetch HTML content from URL with optional progress bar.

This is a simplified wrapper around fetch_url_with_progress with default parameters.

Parameters:

Name Type Description Default
url str

URL to fetch

required
show_progress bool

Whether to display a progress bar during download

False

Returns:

Type Description
str

HTML content as string

Raises:

Type Description
WebdownError

If URL is invalid or content cannot be fetched

Source code in webdown/html_parser.py
def fetch_url(url: str, show_progress: bool = False) -> str:
    """Fetch HTML content from URL with optional progress bar.

    This is a simplified wrapper around fetch_url_with_progress with default parameters.

    Args:
        url: URL to fetch
        show_progress: Whether to display a progress bar during download

    Returns:
        HTML content as string

    Raises:
        WebdownError: If URL is invalid or content cannot be fetched
    """
    # Validate URL for backward compatibility with tests
    # In normal usage, URL is already validated by _get_normalized_config
    try:
        validate_url(url)
    except ValueError as e:
        raise WebdownError(str(e), code=ErrorCode.URL_INVALID)

    return fetch_url_with_progress(url, show_progress, chunk_size=1024, timeout=10)

fetch_url_with_progress(url: str, show_progress: bool = False, chunk_size: int = 1024, timeout: int = 10) -> str

Fetch content from URL with streaming and optional progress bar.

Parameters:

Name Type Description Default
url str

URL to fetch

required
show_progress bool

Whether to display a progress bar during download

False
chunk_size int

Size of chunks to read in bytes

1024
timeout int

Request timeout in seconds

10

Returns:

Type Description
str

Content as string

Raises:

Type Description
WebdownError

If content cannot be fetched

Source code in webdown/html_parser.py
def fetch_url_with_progress(
    url: str, show_progress: bool = False, chunk_size: int = 1024, timeout: int = 10
) -> str:
    """Fetch content from URL with streaming and optional progress bar.

    Args:
        url: URL to fetch
        show_progress: Whether to display a progress bar during download
        chunk_size: Size of chunks to read in bytes
        timeout: Request timeout in seconds

    Returns:
        Content as string

    Raises:
        WebdownError: If content cannot be fetched
    """
    # Note: URL validation is now centralized in _get_normalized_config
    # We assume URL is already validated when this function is called

    try:
        # Make a GET request with stream=True for both cases
        response = requests.get(url, timeout=timeout, stream=True)
        response.raise_for_status()

        # Try to handle small responses without streaming for performance
        small_response = _handle_small_response(response, show_progress)
        if small_response is not None:
            return small_response

        # For larger responses or when progress is requested, use streaming
        total_size = int(response.headers.get("content-length", 0))
        with _create_progress_bar(url, total_size, show_progress) as progress_bar:
            return _process_response_chunks(response, progress_bar, chunk_size)

    except (
        requests.exceptions.Timeout,
        requests.exceptions.ConnectionError,
        requests.exceptions.HTTPError,
        requests.exceptions.RequestException,
    ) as e:
        # This function raises a WebdownError with appropriate message
        handle_request_exception(e, url)
        # The line below is never reached but needed for type checking
        raise RuntimeError("This should never be reached")  # pragma: no cover

extract_content_with_css(html: str, css_selector: str) -> str

Extract specific content from HTML using a CSS selector.

CSS selector is assumed to be already validated before this function is called.

Parameters:

Name Type Description Default
html str

HTML content

required
css_selector str

CSS selector to extract content (pre-validated)

required

Returns:

Type Description
str

HTML content of selected elements

Raises:

Type Description
WebdownError

If there is an error applying the selector

Source code in webdown/html_parser.py
def extract_content_with_css(html: str, css_selector: str) -> str:
    """Extract specific content from HTML using a CSS selector.

    CSS selector is assumed to be already validated before this function is called.

    Args:
        html: HTML content
        css_selector: CSS selector to extract content (pre-validated)

    Returns:
        HTML content of selected elements

    Raises:
        WebdownError: If there is an error applying the selector
    """
    import warnings

    # Note: No validation here - validation is now centralized in html_to_markdown

    try:
        soup = BeautifulSoup(html, "html.parser")
        selected = soup.select(css_selector)
        if selected:
            return "".join(str(element) for element in selected)
        else:
            # Warning - no elements matched
            warnings.warn(f"CSS selector '{css_selector}' did not match any elements")
            return html
    except Exception as e:
        raise WebdownError(
            f"Error applying CSS selector '{css_selector}': {str(e)}",
            code=ErrorCode.CSS_SELECTOR_INVALID,
        )

Markdown Conversion

HTML to Markdown conversion functionality.

This module handles conversion of HTML content to Markdown with optional features: - HTML to Markdown conversion using html2text - Table of contents generation - Content selection with CSS selectors - Compact output mode - Removal of invisible characters

The main function is html_to_markdown(), but this module also provides helper functions for each conversion step.

Functions

html_to_markdown(html: str, config: WebdownConfig) -> str

Convert HTML to Markdown with formatting options.

This function takes HTML content and converts it to Markdown format based on the provided configuration object.

Parameters:

Name Type Description Default
html str

HTML content to convert

required
config WebdownConfig

Configuration options for the conversion

required

Returns:

Type Description
str

Converted Markdown content

Examples:

>>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
>>> config = WebdownConfig()
>>> print(html_to_markdown(html, config))
# Title

Content with link

>>> config = WebdownConfig(include_links=False)
>>> print(html_to_markdown(html, config))
# Title

Content with link

Source code in webdown/markdown_converter.py
def html_to_markdown(
    html: str,
    config: WebdownConfig,
) -> str:
    """Convert HTML to Markdown with formatting options.

    This function takes HTML content and converts it to Markdown format
    based on the provided configuration object.

    Args:
        html: HTML content to convert
        config: Configuration options for the conversion

    Returns:
        Converted Markdown content

    Examples:
        >>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
        >>> config = WebdownConfig()
        >>> print(html_to_markdown(html, config))
        # Title

        Content with [link](#)

        >>> config = WebdownConfig(include_links=False)
        >>> print(html_to_markdown(html, config))
        # Title

        Content with link
    """
    # Validate all configuration parameters
    _validate_config(config)

    # Extract specific content by CSS selector if provided
    if config.css_selector:
        html = extract_content_with_css(html, config.css_selector)

    # Configure and run html2text
    converter = _configure_html2text(config)
    markdown = converter.handle(html)

    # Clean up the markdown
    markdown = clean_markdown(markdown, config.document_options.compact_output)

    # Add table of contents if requested
    if config.document_options.include_toc:
        markdown = generate_table_of_contents(markdown)

    return str(markdown)

XML Conversion

Markdown to Claude XML conversion functionality.

This module handles conversion of Markdown content to Claude XML format: - Processes code blocks directly (no placeholders) - Handles headings, sections, and paragraphs - Generates metadata when requested - Creates a structured XML document for use with Claude

The main function is markdown_to_claude_xml(), which converts Markdown content to a format suitable for Claude AI models.

Functions

markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str

Convert Markdown content to Claude XML format.

This function converts Markdown content to a structured XML format suitable for use with Claude AI models. It handles elements like headings, paragraphs, and code blocks, organizing them into a hierarchical XML document.

Parameters:

Name Type Description Default
markdown str

Markdown content to convert

required
source_url Optional[str]

Source URL for the content (for metadata)

None
include_metadata bool

Whether to include metadata section (title, source, date)

True

Returns:

Type Description
str

Claude XML formatted content

Source code in webdown/xml_converter.py
def markdown_to_claude_xml(
    markdown: str,
    source_url: Optional[str] = None,
    include_metadata: bool = True,
) -> str:
    """Convert Markdown content to Claude XML format.

    This function converts Markdown content to a structured XML format
    suitable for use with Claude AI models. It handles elements like
    headings, paragraphs, and code blocks, organizing them into a
    hierarchical XML document.

    Args:
        markdown: Markdown content to convert
        source_url: Source URL for the content (for metadata)
        include_metadata: Whether to include metadata section (title, source, date)

    Returns:
        Claude XML formatted content
    """
    xml_parts = []

    # Use a fixed document tag - simplifying configuration
    doc_tag = "claude_documentation"

    # Root element
    xml_parts.append(f"<{doc_tag}>")

    # Extract title
    title = extract_markdown_title(markdown)

    # Add metadata if requested
    if include_metadata:
        xml_parts.extend(generate_metadata_xml(title, source_url))

    # Begin content section
    xml_parts.append(indent_xml("<content>", 1))

    # Process all content by section

    # Extract all section headings
    section_matches = list(re.finditer(r"^(#+\s+)(.+?)$", markdown, re.MULTILINE))

    if section_matches:
        # Process each section including content following the heading
        for i, match in enumerate(section_matches):
            heading_start = match.start()
            heading = match.group(0)
            # If this is the last heading, content goes to the end
            if i == len(section_matches) - 1:
                content = markdown[heading_start + len(heading) :].strip()
            else:
                # Otherwise content goes until the next heading
                next_heading_start = section_matches[i + 1].start()
                content = markdown[
                    heading_start + len(heading) : next_heading_start
                ].strip()

            # Create section with heading and content
            section_xml = []
            section_xml.append(indent_xml("<section>", 2))
            section_xml.append(
                indent_xml(
                    f"<heading>{escape_xml(match.group(2).strip())}</heading>", 3
                )
            )

            # Process content inside this section
            if content:
                section_xml.extend(_process_paragraphs(content, 3))

            section_xml.append(indent_xml("</section>", 2))
            xml_parts.extend(section_xml)

        # Process content before the first heading (if any)
        if section_matches[0].start() > 0:
            pre_content = markdown[: section_matches[0].start()].strip()
            if pre_content:
                # Add pre-heading content at the beginning
                pre_parts = _process_paragraphs(pre_content, 2)

                xml_parts = xml_parts[:2] + pre_parts + xml_parts[2:]
    else:
        # No headings - just process all content
        xml_parts.extend(_process_paragraphs(markdown, 2))

    # Close content and root
    xml_parts.append(indent_xml("</content>", 1))
    xml_parts.append(f"</{doc_tag}>")

    return "\n".join(xml_parts)

process_section(match: Match[str], level: int) -> List[str]

Process a section (heading + content) into XML.

Parameters:

Name Type Description Default
match Match[str]

Regex match containing heading and content

required
level int

Indentation level

required

Returns:

Type Description
List[str]

List of XML strings for the section

Source code in webdown/xml_converter.py
def process_section(match: Match[str], level: int) -> List[str]:
    """Process a section (heading + content) into XML.

    Args:
        match: Regex match containing heading and content
        level: Indentation level

    Returns:
        List of XML strings for the section
    """
    heading_text = match.group(2).strip()
    content = match.group(3).strip() if match.group(3) else ""

    result = []

    # Open section
    result.append(indent_xml("<section>", level))

    # Add heading
    result.append(
        indent_xml(f"<heading>{escape_xml(heading_text)}</heading>", level + 1)
    )

    # Process content
    if content:
        result.extend(_process_paragraphs(content, level + 1))

    # Close section
    result.append(indent_xml("</section>", level))

    return result

Error Handling

Error handling utilities for webdown.

This module provides centralized error handling utilities used throughout the webdown package.

Functions

handle_validation_error(message: str, code: str = ErrorCode.VALIDATION_ERROR) -> NoReturn

Handle a validation error and raise a WebdownError.

Parameters:

Name Type Description Default
message str

Error message

required
code str

Error code

VALIDATION_ERROR

Raises:

Type Description
WebdownError

Always raised with appropriate message

Source code in webdown/error_utils.py
def handle_validation_error(
    message: str, code: str = ErrorCode.VALIDATION_ERROR
) -> NoReturn:
    """Handle a validation error and raise a WebdownError.

    Args:
        message: Error message
        code: Error code

    Raises:
        WebdownError: Always raised with appropriate message
    """
    raise WebdownError(message, code=code)

get_friendly_error_message(error: Exception) -> str

Get a user-friendly error message for an exception.

This function is intended for CLI and user-facing interfaces.

Parameters:

Name Type Description Default
error Exception

The exception to get a message for

required

Returns:

Type Description
str

A user-friendly error message

Source code in webdown/error_utils.py
def get_friendly_error_message(error: Exception) -> str:
    """Get a user-friendly error message for an exception.

    This function is intended for CLI and user-facing interfaces.

    Args:
        error: The exception to get a message for

    Returns:
        A user-friendly error message
    """
    # For WebdownError, we already have a good message
    if isinstance(error, WebdownError):
        # Handle URL validation errors specially for better UX
        message = str(error)
        if hasattr(error, "code") and error.code == ErrorCode.URL_INVALID:
            message += (
                "\nPlease make sure the URL includes a valid protocol "
                "and domain (like https://example.com)."
            )
        return message

    # For other exceptions, provide a generic message
    return f"An unexpected error occurred: {str(error)}"

format_error_for_cli(error: Exception) -> str

Format an error message for CLI output.

Parameters:

Name Type Description Default
error Exception

The exception to format

required

Returns:

Type Description
str

A formatted error message for CLI output

Source code in webdown/error_utils.py
def format_error_for_cli(error: Exception) -> str:
    """Format an error message for CLI output.

    Args:
        error: The exception to format

    Returns:
        A formatted error message for CLI output
    """
    friendly_message = get_friendly_error_message(error)

    # For CLI, prefix with "Error: " and format nicely
    lines = friendly_message.split("\n")
    if len(lines) == 1:
        return f"Error: {friendly_message}"

    # For multi-line messages, format with indentation
    result = ["Error:"]
    for line in lines:
        result.append(f"  {line}")

    return "\n".join(result)

handle_request_exception(exception: Exception, url: str) -> NoReturn

Handle a request exception and raise a WebdownError with appropriate message.

Parameters:

Name Type Description Default
exception Exception

The exception to handle

required
url str

The URL that was requested

required

Raises:

Type Description
WebdownError

Always raised with appropriate message

Source code in webdown/error_utils.py
def handle_request_exception(exception: Exception, url: str) -> NoReturn:
    """Handle a request exception and raise a WebdownError with appropriate message.

    Args:
        exception: The exception to handle
        url: The URL that was requested

    Raises:
        WebdownError: Always raised with appropriate message
    """
    if isinstance(exception, requests.exceptions.Timeout):
        raise WebdownError(
            f"Timeout error fetching {url}. The server took too long to respond.",
            code=ErrorCode.NETWORK_TIMEOUT,
        )
    elif isinstance(exception, requests.exceptions.ConnectionError):
        raise WebdownError(
            f"Connection error fetching {url}. Please check your internet connection.",
            code=ErrorCode.NETWORK_CONNECTION,
        )
    elif isinstance(exception, requests.exceptions.HTTPError):
        # Extract status code if available
        status_code = None
        if hasattr(exception, "response") and hasattr(
            exception.response, "status_code"
        ):
            status_code = exception.response.status_code

        status_msg = f" (Status code: {status_code})" if status_code else ""
        raise WebdownError(
            f"HTTP error fetching {url}{status_msg}. The server returned an error.",
            code=ErrorCode.HTTP_ERROR,
        )
    else:
        # Generic RequestException or any other exception
        raise WebdownError(
            f"Error fetching {url}: {str(exception)}",
            code=ErrorCode.REQUEST_ERROR,
        )

Exceptions

Bases: Exception

Exception for webdown errors.

This exception class is used for all errors raised by the webdown package. The error type is indicated by a descriptive message and an error code, allowing programmatic error handling.

Error types include

URL format errors: When the URL doesn't follow standard format Network errors: Connection issues, timeouts, HTTP errors Parsing errors: Issues with processing the HTML content Validation errors: Invalid parameters or configuration

Attributes:

Name Type Description
code str

Error code for programmatic error handling

Source code in webdown/config.py
class WebdownError(Exception):
    """Exception for webdown errors.

    This exception class is used for all errors raised by the webdown package.
    The error type is indicated by a descriptive message and an error code,
    allowing programmatic error handling.

    Error types include:
        URL format errors: When the URL doesn't follow standard format
        Network errors: Connection issues, timeouts, HTTP errors
        Parsing errors: Issues with processing the HTML content
        Validation errors: Invalid parameters or configuration

    Attributes:
        code (str): Error code for programmatic error handling
    """

    def __init__(self, message: str, code: str = "UNEXPECTED_ERROR"):
        """Initialize a WebdownError.

        Args:
            message: Error message
            code: Error code for programmatic error handling
        """
        super().__init__(message)
        self.code = code

Functions

__init__(message: str, code: str = 'UNEXPECTED_ERROR')

Initialize a WebdownError.

Parameters:

Name Type Description Default
message str

Error message

required
code str

Error code for programmatic error handling

'UNEXPECTED_ERROR'
Source code in webdown/config.py
def __init__(self, message: str, code: str = "UNEXPECTED_ERROR"):
    """Initialize a WebdownError.

    Args:
        message: Error message
        code: Error code for programmatic error handling
    """
    super().__init__(message)
    self.code = code

Validation

Validation utilities for webdown.

This module provides centralized validation functions for various inputs used throughout the webdown package.

Functions

validate_url(url: str) -> str

Validate a URL and return it if valid.

Parameters:

Name Type Description Default
url str

The URL to validate

required

Returns:

Type Description
str

The validated URL

Raises:

Type Description
ValueError

If the URL is invalid

Source code in webdown/validation.py
def validate_url(url: str) -> str:
    """Validate a URL and return it if valid.

    Args:
        url: The URL to validate

    Returns:
        The validated URL

    Raises:
        ValueError: If the URL is invalid
    """
    if not url:
        raise ValueError("URL cannot be empty")

    parsed = urllib.parse.urlparse(url)

    # Check if URL has a scheme and netloc
    if not parsed.scheme or not parsed.netloc:
        raise ValueError(
            f"Invalid URL: {url}. URL must include scheme "
            f"(http:// or https://) and domain."
        )

    # Ensure scheme is http or https
    if parsed.scheme not in ["http", "https"]:
        raise ValueError(
            f"Invalid URL scheme: {parsed.scheme}. Only http and https are supported."
        )

    return url

validate_css_selector(selector: str) -> str

Validate a CSS selector.

Parameters:

Name Type Description Default
selector str

The CSS selector to validate

required

Returns:

Type Description
str

The validated CSS selector

Raises:

Type Description
ValueError

If the selector is invalid

Source code in webdown/validation.py
def validate_css_selector(selector: str) -> str:
    """Validate a CSS selector.

    Args:
        selector: The CSS selector to validate

    Returns:
        The validated CSS selector

    Raises:
        ValueError: If the selector is invalid
    """
    if not selector:
        raise ValueError("CSS selector cannot be empty")

    # Simple validation - just create a soup and try using the selector
    # This will raise a ValueError if the selector is invalid
    try:
        soup = BeautifulSoup("<html></html>", "html.parser")
        soup.select(selector)
        return selector
    except Exception as e:
        raise ValueError(f"Invalid CSS selector: {selector}. Error: {str(e)}")

validate_body_width(width: Optional[int]) -> Optional[int]

Validate body width parameter.

Parameters:

Name Type Description Default
width Optional[int]

The body width to validate, or None for no width limit

required

Returns:

Type Description
Optional[int]

The validated body width or None

Raises:

Type Description
ValueError

If the width is invalid

Source code in webdown/validation.py
def validate_body_width(width: Optional[int]) -> Optional[int]:
    """Validate body width parameter.

    Args:
        width: The body width to validate, or None for no width limit

    Returns:
        The validated body width or None

    Raises:
        ValueError: If the width is invalid
    """
    if width is None:
        return None

    # Ensure width is an integer and within reasonable range
    if not isinstance(width, int):
        raise ValueError(f"Body width must be an integer, got {type(width).__name__}")

    if width < 0:
        raise ValueError(f"Body width cannot be negative, got {width}")

    # Upper limit of 2000 is arbitrary but reasonable
    if width > 2000:
        raise ValueError(f"Body width too large, maximum is 2000, got {width}")

    return width

validate_numeric_parameter(name: str, value: Optional[int], min_value: Optional[int] = None, max_value: Optional[int] = None) -> Optional[int]

Validate a numeric parameter.

Parameters:

Name Type Description Default
name str

The name of the parameter (for error messages)

required
value Optional[int]

The value to validate

required
min_value Optional[int]

Optional minimum value

None
max_value Optional[int]

Optional maximum value

None

Returns:

Type Description
Optional[int]

The validated value

Raises:

Type Description
ValueError

If the value is invalid

Source code in webdown/validation.py
def validate_numeric_parameter(
    name: str,
    value: Optional[int],
    min_value: Optional[int] = None,
    max_value: Optional[int] = None,
) -> Optional[int]:
    """Validate a numeric parameter.

    Args:
        name: The name of the parameter (for error messages)
        value: The value to validate
        min_value: Optional minimum value
        max_value: Optional maximum value

    Returns:
        The validated value

    Raises:
        ValueError: If the value is invalid
    """
    if value is None:
        return None

    if not isinstance(value, int):
        raise ValueError(f"{name} must be an integer, got {type(value).__name__}")

    if min_value is not None and value < min_value:
        raise ValueError(f"{name} must be at least {min_value}, got {value}")

    if max_value is not None and value > max_value:
        raise ValueError(f"{name} must be at most {max_value}, got {value}")

    return value

validate_string_parameter(name: str, value: Optional[str], allowed_values: Optional[list] = None) -> Optional[str]

Validate a string parameter.

Parameters:

Name Type Description Default
name str

The name of the parameter (for error messages)

required
value Optional[str]

The value to validate

required
allowed_values Optional[list]

Optional list of allowed values

None

Returns:

Type Description
Optional[str]

The validated value

Raises:

Type Description
ValueError

If the value is invalid

Source code in webdown/validation.py
def validate_string_parameter(
    name: str, value: Optional[str], allowed_values: Optional[list] = None
) -> Optional[str]:
    """Validate a string parameter.

    Args:
        name: The name of the parameter (for error messages)
        value: The value to validate
        allowed_values: Optional list of allowed values

    Returns:
        The validated value

    Raises:
        ValueError: If the value is invalid
    """
    if value is None:
        return None

    if not isinstance(value, str):
        raise ValueError(f"{name} must be a string, got {type(value).__name__}")

    if allowed_values is not None and value not in allowed_values:
        allowed_str = ", ".join(allowed_values)
        raise ValueError(f"{name} must be one of: {allowed_str}, got {value}")

    return value

validate_boolean_parameter(name: str, value: Optional[bool]) -> Optional[bool]

Validate a boolean parameter.

Parameters:

Name Type Description Default
name str

The name of the parameter (for error messages)

required
value Optional[bool]

The value to validate

required

Returns:

Type Description
Optional[bool]

The validated value

Raises:

Type Description
ValueError

If the value is invalid

Source code in webdown/validation.py
def validate_boolean_parameter(name: str, value: Optional[bool]) -> Optional[bool]:
    """Validate a boolean parameter.

    Args:
        name: The name of the parameter (for error messages)
        value: The value to validate

    Returns:
        The validated value

    Raises:
        ValueError: If the value is invalid
    """
    if value is None:
        return None

    if not isinstance(value, bool):
        raise ValueError(f"{name} must be a boolean, got {type(value).__name__}")

    return value