Core API Reference¶

This page documents the main API functions and classes provided by Webdown.

Main Functions¶

HTML to Markdown and Claude XML conversion functionality.

This module serves as the main entry point for the webdown package, providing the primary functions for converting web content to Markdown and Claude XML formats.

The conversion process involves multiple steps: 1. Fetch or read HTML content (from URL or local file) 2. Convert HTML to Markdown 3. Optionally convert Markdown to Claude XML format

Key functions: - convert_url: Convert web content to Markdown or XML - convert_file: Convert local HTML file to Markdown or XML

Functions¶

`html_to_markdown(html: str, config: WebdownConfig) -> str` ¶

Convert HTML to Markdown with formatting options.

This function takes HTML content and converts it to Markdown format based on the provided configuration object.

Parameters:

Name	Type	Description	Default
`html`	`str`	HTML content to convert	required
`config`	`WebdownConfig`	Configuration options for the conversion	required

Returns:

Type	Description
`str`	Converted Markdown content

Examples:

>>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
>>> config = WebdownConfig()
>>> print(html_to_markdown(html, config))
# Title

Content with link

>>> config = WebdownConfig(include_links=False)
>>> print(html_to_markdown(html, config))
# Title

Content with link

Source code in webdown/markdown_converter.py

def html_to_markdown(
    html: str,
    config: WebdownConfig,
) -> str:
    """Convert HTML to Markdown with formatting options.

    This function takes HTML content and converts it to Markdown format
    based on the provided configuration object.

    Args:
        html: HTML content to convert
        config: Configuration options for the conversion

    Returns:
        Converted Markdown content

    Examples:
        >>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
        >>> config = WebdownConfig()
        >>> print(html_to_markdown(html, config))
        # Title

        Content with [link](#)

        >>> config = WebdownConfig(include_links=False)
        >>> print(html_to_markdown(html, config))
        # Title

        Content with link
    """
    # Validate all configuration parameters
    _validate_config(config)

    # Extract specific content by CSS selector if provided
    if config.css_selector:
        html = extract_content_with_css(html, config.css_selector)

    # Configure and run html2text
    converter = _configure_html2text(config)
    markdown = converter.handle(html)

    # Clean up the markdown
    markdown = clean_markdown(markdown, config.document_options.compact_output)

    # Add table of contents if requested
    if config.document_options.include_toc:
        markdown = generate_table_of_contents(markdown)

    return str(markdown)

`markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str` ¶

Convert Markdown content to Claude XML format.

This function converts Markdown content to a structured XML format suitable for use with Claude AI models. It handles elements like headings, paragraphs, and code blocks, organizing them into a hierarchical XML document.

Parameters:

Name	Type	Description	Default
`markdown`	`str`	Markdown content to convert	required
`source_url`	`Optional[str]`	Source URL for the content (for metadata)	`None`
`include_metadata`	`bool`	Whether to include metadata section (title, source, date)	`True`

Returns:

Type	Description
`str`	Claude XML formatted content

Source code in webdown/xml_converter.py

def markdown_to_claude_xml(
    markdown: str,
    source_url: Optional[str] = None,
    include_metadata: bool = True,
) -> str:
    """Convert Markdown content to Claude XML format.

    This function converts Markdown content to a structured XML format
    suitable for use with Claude AI models. It handles elements like
    headings, paragraphs, and code blocks, organizing them into a
    hierarchical XML document.

    Args:
        markdown: Markdown content to convert
        source_url: Source URL for the content (for metadata)
        include_metadata: Whether to include metadata section (title, source, date)

    Returns:
        Claude XML formatted content
    """
    xml_parts = []

    # Use a fixed document tag - simplifying configuration
    doc_tag = "claude_documentation"

    # Root element
    xml_parts.append(f"<{doc_tag}>")

    # Extract title
    title = extract_markdown_title(markdown)

    # Add metadata if requested
    if include_metadata:
        xml_parts.extend(generate_metadata_xml(title, source_url))

    # Begin content section
    xml_parts.append(indent_xml("<content>", 1))

    # Process all content by section

    # Extract all section headings
    section_matches = list(re.finditer(r"^(#+\s+)(.+?)$", markdown, re.MULTILINE))

    if section_matches:
        # Process each section including content following the heading
        for i, match in enumerate(section_matches):
            heading_start = match.start()
            heading = match.group(0)
            # If this is the last heading, content goes to the end
            if i == len(section_matches) - 1:
                content = markdown[heading_start + len(heading) :].strip()
            else:
                # Otherwise content goes until the next heading
                next_heading_start = section_matches[i + 1].start()
                content = markdown[
                    heading_start + len(heading) : next_heading_start
                ].strip()

            # Create section with heading and content
            section_xml = []
            section_xml.append(indent_xml("<section>", 2))
            section_xml.append(
                indent_xml(
                    f"<heading>{escape_xml(match.group(2).strip())}</heading>", 3
                )
            )

            # Process content inside this section
            if content:
                section_xml.extend(_process_paragraphs(content, 3))

            section_xml.append(indent_xml("</section>", 2))
            xml_parts.extend(section_xml)

        # Process content before the first heading (if any)
        if section_matches[0].start() > 0:
            pre_content = markdown[: section_matches[0].start()].strip()
            if pre_content:
                # Add pre-heading content at the beginning
                pre_parts = _process_paragraphs(pre_content, 2)

                xml_parts = xml_parts[:2] + pre_parts + xml_parts[2:]
    else:
        # No headings - just process all content
        xml_parts.extend(_process_paragraphs(markdown, 2))

    # Close content and root
    xml_parts.append(indent_xml("</content>", 1))
    xml_parts.append(f"</{doc_tag}>")

    return "\n".join(xml_parts)

Configuration Classes¶

Configuration options for HTML to Markdown conversion.

This class centralizes all configuration options for the conversion process, focusing on the most useful options for LLM documentation processing.

Attributes:

Name	Type	Description
`url`	`Optional[str]`	URL of the web page to convert
`file_path`	`Optional[str]`	Path to local HTML file to convert
`include_links`	`bool`	Whether to include hyperlinks (True) or plain text (False)
`include_images`	`bool`	Whether to include images (True) or exclude them
`css_selector`	`Optional[str]`	CSS selector to extract specific content
`show_progress`	`bool`	Whether to display a progress bar during download
`format`	`OutputFormat`	Output format (Markdown or Claude XML)
`document_options`	`DocumentOptions`	Document structure configuration

Source code in webdown/config.py

@dataclass
class WebdownConfig:
    """Configuration options for HTML to Markdown conversion.

    This class centralizes all configuration options for the conversion process,
    focusing on the most useful options for LLM documentation processing.

    Attributes:
        url (Optional[str]): URL of the web page to convert
        file_path (Optional[str]): Path to local HTML file to convert
        include_links (bool): Whether to include hyperlinks (True) or plain text (False)
        include_images (bool): Whether to include images (True) or exclude them
        css_selector (Optional[str]): CSS selector to extract specific content
        show_progress (bool): Whether to display a progress bar during download
        format (OutputFormat): Output format (Markdown or Claude XML)
        document_options (DocumentOptions): Document structure configuration
    """

    # Source options
    url: Optional[str] = None
    file_path: Optional[str] = None
    show_progress: bool = False

    # Content options
    include_links: bool = True
    include_images: bool = True
    css_selector: Optional[str] = None

    # Output options
    format: OutputFormat = OutputFormat.MARKDOWN

    # We need to use field with default_factory to avoid mutable default value
    document_options: DocumentOptions = field(default_factory=DocumentOptions)

Attributes¶

`css_selector: Optional[str] = None` `class-attribute` `instance-attribute` ¶

`document_options: DocumentOptions = field(default_factory=DocumentOptions)` `class-attribute` `instance-attribute` ¶

`file_path: Optional[str] = None` `class-attribute` `instance-attribute` ¶

`format: OutputFormat = OutputFormat.MARKDOWN` `class-attribute` `instance-attribute` ¶

`include_images: bool = True` `class-attribute` `instance-attribute` ¶

`include_links: bool = True` `class-attribute` `instance-attribute` ¶

`show_progress: bool = False` `class-attribute` `instance-attribute` ¶

`url: Optional[str] = None` `class-attribute` `instance-attribute` ¶

Functions¶

`init(url: Optional[str] = None, file_path: Optional[str] = None, show_progress: bool = False, include_links: bool = True, include_images: bool = True, css_selector: Optional[str] = None, format: OutputFormat = OutputFormat.MARKDOWN, document_options: DocumentOptions = DocumentOptions()) -> None` ¶

Configuration for document output structure.

This class contains settings that affect the structure of the generated document, independent of the output format.

Attributes:

Name	Type	Description
`include_toc`	`bool`	Whether to generate a table of contents
`compact_output`	`bool`	Whether to remove excessive blank lines
`body_width`	`int`	Maximum line length for wrapping (0 for no wrapping)
`include_metadata`	`bool`	Include metadata section with title, source URL, date (only applies to Claude XML format)

Source code in webdown/config.py

@dataclass
class DocumentOptions:
    """Configuration for document output structure.

    This class contains settings that affect the structure of the generated document,
    independent of the output format.

    Attributes:
        include_toc (bool): Whether to generate a table of contents
        compact_output (bool): Whether to remove excessive blank lines
        body_width (int): Maximum line length for wrapping (0 for no wrapping)
        include_metadata (bool): Include metadata section with title, source URL, date
            (only applies to Claude XML format)
    """

    include_toc: bool = False
    compact_output: bool = False
    body_width: int = 0
    include_metadata: bool = True

Attributes¶

`body_width: int = 0` `class-attribute` `instance-attribute` ¶

`compact_output: bool = False` `class-attribute` `instance-attribute` ¶

`include_metadata: bool = True` `class-attribute` `instance-attribute` ¶

`include_toc: bool = False` `class-attribute` `instance-attribute` ¶

Functions¶

`init(include_toc: bool = False, compact_output: bool = False, body_width: int = 0, include_metadata: bool = True) -> None` ¶

HTML Parsing¶

HTML parsing and fetching functionality.

This module handles fetching web content and basic HTML parsing: - URL validation and verification - HTML fetching with proper error handling and progress tracking - HTML file reading from local filesystem - Content extraction with CSS selectors - Streaming support for large web pages

The primary functions are fetch_url() for retrieving HTML content from web, read_html_file() for reading HTML from local files, and extract_content_with_css() for selecting specific parts of HTML.

Functions¶

`fetch_url(url: str, show_progress: bool = False) -> str` ¶

Fetch HTML content from URL with optional progress bar.

This is a simplified wrapper around fetch_url_with_progress with default parameters.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to fetch	required
`show_progress`	`bool`	Whether to display a progress bar during download	`False`

Returns:

Type	Description
`str`	HTML content as string

Raises:

Type	Description
`WebdownError`	If URL is invalid or content cannot be fetched

Source code in webdown/html_parser.py

def fetch_url(url: str, show_progress: bool = False) -> str:
    """Fetch HTML content from URL with optional progress bar.

    This is a simplified wrapper around fetch_url_with_progress with default parameters.

    Args:
        url: URL to fetch
        show_progress: Whether to display a progress bar during download

    Returns:
        HTML content as string

    Raises:
        WebdownError: If URL is invalid or content cannot be fetched
    """
    # Validate URL for backward compatibility with tests
    # In normal usage, URL is already validated by _get_normalized_config
    try:
        validate_url(url)
    except ValueError as e:
        raise WebdownError(str(e), code=ErrorCode.URL_INVALID)

    return fetch_url_with_progress(url, show_progress, chunk_size=1024, timeout=10)

`fetch_url_with_progress(url: str, show_progress: bool = False, chunk_size: int = 1024, timeout: int = 10) -> str` ¶

Fetch content from URL with streaming and optional progress bar.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to fetch	required
`show_progress`	`bool`	Whether to display a progress bar during download	`False`
`chunk_size`	`int`	Size of chunks to read in bytes	`1024`
`timeout`	`int`	Request timeout in seconds	`10`

Returns:

Type	Description
`str`	Content as string

Raises:

Type	Description
`WebdownError`	If content cannot be fetched

Source code in webdown/html_parser.py

def fetch_url_with_progress(
    url: str, show_progress: bool = False, chunk_size: int = 1024, timeout: int = 10
) -> str:
    """Fetch content from URL with streaming and optional progress bar.

    Args:
        url: URL to fetch
        show_progress: Whether to display a progress bar during download
        chunk_size: Size of chunks to read in bytes
        timeout: Request timeout in seconds

    Returns:
        Content as string

    Raises:
        WebdownError: If content cannot be fetched
    """
    # Note: URL validation is now centralized in _get_normalized_config
    # We assume URL is already validated when this function is called

    try:
        # Make a GET request with stream=True for both cases
        response = requests.get(url, timeout=timeout, stream=True)
        response.raise_for_status()

        # Try to handle small responses without streaming for performance
        small_response = _handle_small_response(response, show_progress)
        if small_response is not None:
            return small_response

        # For larger responses or when progress is requested, use streaming
        total_size = int(response.headers.get("content-length", 0))
        with _create_progress_bar(url, total_size, show_progress) as progress_bar:
            return _process_response_chunks(response, progress_bar, chunk_size)

    except (
        requests.exceptions.Timeout,
        requests.exceptions.ConnectionError,
        requests.exceptions.HTTPError,
        requests.exceptions.RequestException,
    ) as e:
        # This function raises a WebdownError with appropriate message
        handle_request_exception(e, url)
        # The line below is never reached but needed for type checking
        raise RuntimeError("This should never be reached")  # pragma: no cover

`extract_content_with_css(html: str, css_selector: str) -> str` ¶

Extract specific content from HTML using a CSS selector.

CSS selector is assumed to be already validated before this function is called.

Parameters:

Name	Type	Description	Default
`html`	`str`	HTML content	required
`css_selector`	`str`	CSS selector to extract content (pre-validated)	required

Returns:

Type	Description
`str`	HTML content of selected elements

Raises:

Type	Description
`WebdownError`	If there is an error applying the selector

Source code in webdown/html_parser.py

def extract_content_with_css(html: str, css_selector: str) -> str:
    """Extract specific content from HTML using a CSS selector.

    CSS selector is assumed to be already validated before this function is called.

    Args:
        html: HTML content
        css_selector: CSS selector to extract content (pre-validated)

    Returns:
        HTML content of selected elements

    Raises:
        WebdownError: If there is an error applying the selector
    """
    import warnings

    # Note: No validation here - validation is now centralized in html_to_markdown

    try:
        soup = BeautifulSoup(html, "html.parser")
        selected = soup.select(css_selector)
        if selected:
            return "".join(str(element) for element in selected)
        else:
            # Warning - no elements matched
            warnings.warn(f"CSS selector '{css_selector}' did not match any elements")
            return html
    except Exception as e:
        raise WebdownError(
            f"Error applying CSS selector '{css_selector}': {str(e)}",
            code=ErrorCode.CSS_SELECTOR_INVALID,
        )

Markdown Conversion¶

HTML to Markdown conversion functionality.

This module handles conversion of HTML content to Markdown with optional features: - HTML to Markdown conversion using html2text - Table of contents generation - Content selection with CSS selectors - Compact output mode - Removal of invisible characters

The main function is html_to_markdown(), but this module also provides helper functions for each conversion step.

Functions¶

`html_to_markdown(html: str, config: WebdownConfig) -> str` ¶

Convert HTML to Markdown with formatting options.

This function takes HTML content and converts it to Markdown format based on the provided configuration object.

Parameters:

Name	Type	Description	Default
`html`	`str`	HTML content to convert	required
`config`	`WebdownConfig`	Configuration options for the conversion	required

Returns:

Type	Description
`str`	Converted Markdown content

Examples:

>>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
>>> config = WebdownConfig()
>>> print(html_to_markdown(html, config))
# Title

Content with link

>>> config = WebdownConfig(include_links=False)
>>> print(html_to_markdown(html, config))
# Title

Content with link

Source code in webdown/markdown_converter.py

def html_to_markdown(
    html: str,
    config: WebdownConfig,
) -> str:
    """Convert HTML to Markdown with formatting options.

    This function takes HTML content and converts it to Markdown format
    based on the provided configuration object.

    Args:
        html: HTML content to convert
        config: Configuration options for the conversion

    Returns:
        Converted Markdown content

    Examples:
        >>> html = "<h1>Title</h1><p>Content with <a href='#'>link</a></p>"
        >>> config = WebdownConfig()
        >>> print(html_to_markdown(html, config))
        # Title

        Content with [link](#)

        >>> config = WebdownConfig(include_links=False)
        >>> print(html_to_markdown(html, config))
        # Title

        Content with link
    """
    # Validate all configuration parameters
    _validate_config(config)

    # Extract specific content by CSS selector if provided
    if config.css_selector:
        html = extract_content_with_css(html, config.css_selector)

    # Configure and run html2text
    converter = _configure_html2text(config)
    markdown = converter.handle(html)

    # Clean up the markdown
    markdown = clean_markdown(markdown, config.document_options.compact_output)

    # Add table of contents if requested
    if config.document_options.include_toc:
        markdown = generate_table_of_contents(markdown)

    return str(markdown)

XML Conversion¶

Markdown to Claude XML conversion functionality.

This module handles conversion of Markdown content to Claude XML format: - Processes code blocks directly (no placeholders) - Handles headings, sections, and paragraphs - Generates metadata when requested - Creates a structured XML document for use with Claude

The main function is markdown_to_claude_xml(), which converts Markdown content to a format suitable for Claude AI models.

Functions¶

`markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str` ¶

Convert Markdown content to Claude XML format.

This function converts Markdown content to a structured XML format suitable for use with Claude AI models. It handles elements like headings, paragraphs, and code blocks, organizing them into a hierarchical XML document.

Parameters:

Name	Type	Description	Default
`markdown`	`str`	Markdown content to convert	required
`source_url`	`Optional[str]`	Source URL for the content (for metadata)	`None`
`include_metadata`	`bool`	Whether to include metadata section (title, source, date)	`True`

Returns:

Type	Description
`str`	Claude XML formatted content

Source code in webdown/xml_converter.py

def markdown_to_claude_xml(
    markdown: str,
    source_url: Optional[str] = None,
    include_metadata: bool = True,
) -> str:
    """Convert Markdown content to Claude XML format.

    This function converts Markdown content to a structured XML format
    suitable for use with Claude AI models. It handles elements like
    headings, paragraphs, and code blocks, organizing them into a
    hierarchical XML document.

    Args:
        markdown: Markdown content to convert
        source_url: Source URL for the content (for metadata)
        include_metadata: Whether to include metadata section (title, source, date)

    Returns:
        Claude XML formatted content
    """
    xml_parts = []

    # Use a fixed document tag - simplifying configuration
    doc_tag = "claude_documentation"

    # Root element
    xml_parts.append(f"<{doc_tag}>")

    # Extract title
    title = extract_markdown_title(markdown)

    # Add metadata if requested
    if include_metadata:
        xml_parts.extend(generate_metadata_xml(title, source_url))

    # Begin content section
    xml_parts.append(indent_xml("<content>", 1))

    # Process all content by section

    # Extract all section headings
    section_matches = list(re.finditer(r"^(#+\s+)(.+?)$", markdown, re.MULTILINE))

    if section_matches:
        # Process each section including content following the heading
        for i, match in enumerate(section_matches):
            heading_start = match.start()
            heading = match.group(0)
            # If this is the last heading, content goes to the end
            if i == len(section_matches) - 1:
                content = markdown[heading_start + len(heading) :].strip()
            else:
                # Otherwise content goes until the next heading
                next_heading_start = section_matches[i + 1].start()
                content = markdown[
                    heading_start + len(heading) : next_heading_start
                ].strip()

            # Create section with heading and content
            section_xml = []
            section_xml.append(indent_xml("<section>", 2))
            section_xml.append(
                indent_xml(
                    f"<heading>{escape_xml(match.group(2).strip())}</heading>", 3
                )
            )

            # Process content inside this section
            if content:
                section_xml.extend(_process_paragraphs(content, 3))

            section_xml.append(indent_xml("</section>", 2))
            xml_parts.extend(section_xml)

        # Process content before the first heading (if any)
        if section_matches[0].start() > 0:
            pre_content = markdown[: section_matches[0].start()].strip()
            if pre_content:
                # Add pre-heading content at the beginning
                pre_parts = _process_paragraphs(pre_content, 2)

                xml_parts = xml_parts[:2] + pre_parts + xml_parts[2:]
    else:
        # No headings - just process all content
        xml_parts.extend(_process_paragraphs(markdown, 2))

    # Close content and root
    xml_parts.append(indent_xml("</content>", 1))
    xml_parts.append(f"</{doc_tag}>")

    return "\n".join(xml_parts)

`process_section(match: Match[str], level: int) -> List[str]` ¶

Process a section (heading + content) into XML.

Parameters:

Name	Type	Description	Default
`match`	`Match[str]`	Regex match containing heading and content	required
`level`	`int`	Indentation level	required

Returns:

Type	Description
`List[str]`	List of XML strings for the section

Source code in webdown/xml_converter.py

def process_section(match: Match[str], level: int) -> List[str]:
    """Process a section (heading + content) into XML.

    Args:
        match: Regex match containing heading and content
        level: Indentation level

    Returns:
        List of XML strings for the section
    """
    heading_text = match.group(2).strip()
    content = match.group(3).strip() if match.group(3) else ""

    result = []

    # Open section
    result.append(indent_xml("<section>", level))

    # Add heading
    result.append(
        indent_xml(f"<heading>{escape_xml(heading_text)}</heading>", level + 1)
    )

    # Process content
    if content:
        result.extend(_process_paragraphs(content, level + 1))

    # Close section
    result.append(indent_xml("</section>", level))

    return result

Error Handling¶

Error handling utilities for webdown.

This module provides centralized error handling utilities used throughout the webdown package.

Functions¶

`handle_validation_error(message: str, code: str = ErrorCode.VALIDATION_ERROR) -> NoReturn` ¶

Handle a validation error and raise a WebdownError.

Parameters:

Name	Type	Description	Default
`message`	`str`	Error message	required
`code`	`str`	Error code	`VALIDATION_ERROR`

Raises:

Type	Description
`WebdownError`	Always raised with appropriate message

Source code in webdown/error_utils.py

def handle_validation_error(
    message: str, code: str = ErrorCode.VALIDATION_ERROR
) -> NoReturn:
    """Handle a validation error and raise a WebdownError.

    Args:
        message: Error message
        code: Error code

    Raises:
        WebdownError: Always raised with appropriate message
    """
    raise WebdownError(message, code=code)

`get_friendly_error_message(error: Exception) -> str` ¶

Get a user-friendly error message for an exception.

This function is intended for CLI and user-facing interfaces.

Parameters:

Name	Type	Description	Default
`error`	`Exception`	The exception to get a message for	required

Returns:

Type	Description
`str`	A user-friendly error message

Source code in webdown/error_utils.py

def get_friendly_error_message(error: Exception) -> str:
    """Get a user-friendly error message for an exception.

    This function is intended for CLI and user-facing interfaces.

    Args:
        error: The exception to get a message for

    Returns:
        A user-friendly error message
    """
    # For WebdownError, we already have a good message
    if isinstance(error, WebdownError):
        # Handle URL validation errors specially for better UX
        message = str(error)
        if hasattr(error, "code") and error.code == ErrorCode.URL_INVALID:
            message += (
                "\nPlease make sure the URL includes a valid protocol "
                "and domain (like https://example.com)."
            )
        return message

    # For other exceptions, provide a generic message
    return f"An unexpected error occurred: {str(error)}"

`format_error_for_cli(error: Exception) -> str` ¶

Format an error message for CLI output.

Parameters:

Name	Type	Description	Default
`error`	`Exception`	The exception to format	required

Returns:

Type	Description
`str`	A formatted error message for CLI output

Source code in webdown/error_utils.py

def format_error_for_cli(error: Exception) -> str:
    """Format an error message for CLI output.

    Args:
        error: The exception to format

    Returns:
        A formatted error message for CLI output
    """
    friendly_message = get_friendly_error_message(error)

    # For CLI, prefix with "Error: " and format nicely
    lines = friendly_message.split("\n")
    if len(lines) == 1:
        return f"Error: {friendly_message}"

    # For multi-line messages, format with indentation
    result = ["Error:"]
    for line in lines:
        result.append(f"  {line}")

    return "\n".join(result)

`handle_request_exception(exception: Exception, url: str) -> NoReturn` ¶

Handle a request exception and raise a WebdownError with appropriate message.

Parameters:

Name	Type	Description	Default
`exception`	`Exception`	The exception to handle	required
`url`	`str`	The URL that was requested	required

Raises:

Type	Description
`WebdownError`	Always raised with appropriate message

Source code in webdown/error_utils.py

def handle_request_exception(exception: Exception, url: str) -> NoReturn:
    """Handle a request exception and raise a WebdownError with appropriate message.

    Args:
        exception: The exception to handle
        url: The URL that was requested

    Raises:
        WebdownError: Always raised with appropriate message
    """
    if isinstance(exception, requests.exceptions.Timeout):
        raise WebdownError(
            f"Timeout error fetching {url}. The server took too long to respond.",
            code=ErrorCode.NETWORK_TIMEOUT,
        )
    elif isinstance(exception, requests.exceptions.ConnectionError):
        raise WebdownError(
            f"Connection error fetching {url}. Please check your internet connection.",
            code=ErrorCode.NETWORK_CONNECTION,
        )
    elif isinstance(exception, requests.exceptions.HTTPError):
        # Extract status code if available
        status_code = None
        if hasattr(exception, "response") and hasattr(
            exception.response, "status_code"
        ):
            status_code = exception.response.status_code

        status_msg = f" (Status code: {status_code})" if status_code else ""
        raise WebdownError(
            f"HTTP error fetching {url}{status_msg}. The server returned an error.",
            code=ErrorCode.HTTP_ERROR,
        )
    else:
        # Generic RequestException or any other exception
        raise WebdownError(
            f"Error fetching {url}: {str(exception)}",
            code=ErrorCode.REQUEST_ERROR,
        )

Exceptions¶

Bases: Exception

Exception for webdown errors.

This exception class is used for all errors raised by the webdown package. The error type is indicated by a descriptive message and an error code, allowing programmatic error handling.

Error types include

URL format errors: When the URL doesn't follow standard format Network errors: Connection issues, timeouts, HTTP errors Parsing errors: Issues with processing the HTML content Validation errors: Invalid parameters or configuration

Attributes:

Name	Type	Description
`code`	`str`	Error code for programmatic error handling

Source code in webdown/config.py

class WebdownError(Exception):
    """Exception for webdown errors.

    This exception class is used for all errors raised by the webdown package.
    The error type is indicated by a descriptive message and an error code,
    allowing programmatic error handling.

    Error types include:
        URL format errors: When the URL doesn't follow standard format
        Network errors: Connection issues, timeouts, HTTP errors
        Parsing errors: Issues with processing the HTML content
        Validation errors: Invalid parameters or configuration

    Attributes:
        code (str): Error code for programmatic error handling
    """

    def __init__(self, message: str, code: str = "UNEXPECTED_ERROR"):
        """Initialize a WebdownError.

        Args:
            message: Error message
            code: Error code for programmatic error handling
        """
        super().__init__(message)
        self.code = code

Functions¶

`init(message: str, code: str = 'UNEXPECTED_ERROR')` ¶

Initialize a WebdownError.

Parameters:

Name	Type	Description	Default
`message`	`str`	Error message	required
`code`	`str`	Error code for programmatic error handling	`'UNEXPECTED_ERROR'`

Source code in webdown/config.py

def __init__(self, message: str, code: str = "UNEXPECTED_ERROR"):
    """Initialize a WebdownError.

    Args:
        message: Error message
        code: Error code for programmatic error handling
    """
    super().__init__(message)
    self.code = code

Validation¶

Validation utilities for webdown.

This module provides centralized validation functions for various inputs used throughout the webdown package.

Functions¶

`validate_url(url: str) -> str` ¶

Validate a URL and return it if valid.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL to validate	required

Returns:

Type	Description
`str`	The validated URL

Raises:

Type	Description
`ValueError`	If the URL is invalid

Source code in webdown/validation.py

def validate_url(url: str) -> str:
    """Validate a URL and return it if valid.

    Args:
        url: The URL to validate

    Returns:
        The validated URL

    Raises:
        ValueError: If the URL is invalid
    """
    if not url:
        raise ValueError("URL cannot be empty")

    parsed = urllib.parse.urlparse(url)

    # Check if URL has a scheme and netloc
    if not parsed.scheme or not parsed.netloc:
        raise ValueError(
            f"Invalid URL: {url}. URL must include scheme "
            f"(http:// or https://) and domain."
        )

    # Ensure scheme is http or https
    if parsed.scheme not in ["http", "https"]:
        raise ValueError(
            f"Invalid URL scheme: {parsed.scheme}. Only http and https are supported."
        )

    return url

`validate_css_selector(selector: str) -> str` ¶

Validate a CSS selector.

Parameters:

Name	Type	Description	Default
`selector`	`str`	The CSS selector to validate	required

Returns:

Type	Description
`str`	The validated CSS selector

Raises:

Type	Description
`ValueError`	If the selector is invalid

Source code in webdown/validation.py

def validate_css_selector(selector: str) -> str:
    """Validate a CSS selector.

    Args:
        selector: The CSS selector to validate

    Returns:
        The validated CSS selector

    Raises:
        ValueError: If the selector is invalid
    """
    if not selector:
        raise ValueError("CSS selector cannot be empty")

    # Simple validation - just create a soup and try using the selector
    # This will raise a ValueError if the selector is invalid
    try:
        soup = BeautifulSoup("<html></html>", "html.parser")
        soup.select(selector)
        return selector
    except Exception as e:
        raise ValueError(f"Invalid CSS selector: {selector}. Error: {str(e)}")

`validate_body_width(width: Optional[int]) -> Optional[int]` ¶

Validate body width parameter.

Parameters:

Name	Type	Description	Default
`width`	`Optional[int]`	The body width to validate, or None for no width limit	required

Returns:

Type	Description
`Optional[int]`	The validated body width or None

Raises:

Type	Description
`ValueError`	If the width is invalid

Source code in webdown/validation.py

def validate_body_width(width: Optional[int]) -> Optional[int]:
    """Validate body width parameter.

    Args:
        width: The body width to validate, or None for no width limit

    Returns:
        The validated body width or None

    Raises:
        ValueError: If the width is invalid
    """
    if width is None:
        return None

    # Ensure width is an integer and within reasonable range
    if not isinstance(width, int):
        raise ValueError(f"Body width must be an integer, got {type(width).__name__}")

    if width < 0:
        raise ValueError(f"Body width cannot be negative, got {width}")

    # Upper limit of 2000 is arbitrary but reasonable
    if width > 2000:
        raise ValueError(f"Body width too large, maximum is 2000, got {width}")

    return width

`validate_numeric_parameter(name: str, value: Optional[int], min_value: Optional[int] = None, max_value: Optional[int] = None) -> Optional[int]` ¶

Validate a numeric parameter.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the parameter (for error messages)	required
`value`	`Optional[int]`	The value to validate	required
`min_value`	`Optional[int]`	Optional minimum value	`None`
`max_value`	`Optional[int]`	Optional maximum value	`None`

Returns:

Type	Description
`Optional[int]`	The validated value

Raises:

Type	Description
`ValueError`	If the value is invalid

Source code in webdown/validation.py

def validate_numeric_parameter(
    name: str,
    value: Optional[int],
    min_value: Optional[int] = None,
    max_value: Optional[int] = None,
) -> Optional[int]:
    """Validate a numeric parameter.

    Args:
        name: The name of the parameter (for error messages)
        value: The value to validate
        min_value: Optional minimum value
        max_value: Optional maximum value

    Returns:
        The validated value

    Raises:
        ValueError: If the value is invalid
    """
    if value is None:
        return None

    if not isinstance(value, int):
        raise ValueError(f"{name} must be an integer, got {type(value).__name__}")

    if min_value is not None and value < min_value:
        raise ValueError(f"{name} must be at least {min_value}, got {value}")

    if max_value is not None and value > max_value:
        raise ValueError(f"{name} must be at most {max_value}, got {value}")

    return value

`validate_string_parameter(name: str, value: Optional[str], allowed_values: Optional[list] = None) -> Optional[str]` ¶

Validate a string parameter.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the parameter (for error messages)	required
`value`	`Optional[str]`	The value to validate	required
`allowed_values`	`Optional[list]`	Optional list of allowed values	`None`

Returns:

Type	Description
`Optional[str]`	The validated value

Raises:

Type	Description
`ValueError`	If the value is invalid

Source code in webdown/validation.py

def validate_string_parameter(
    name: str, value: Optional[str], allowed_values: Optional[list] = None
) -> Optional[str]:
    """Validate a string parameter.

    Args:
        name: The name of the parameter (for error messages)
        value: The value to validate
        allowed_values: Optional list of allowed values

    Returns:
        The validated value

    Raises:
        ValueError: If the value is invalid
    """
    if value is None:
        return None

    if not isinstance(value, str):
        raise ValueError(f"{name} must be a string, got {type(value).__name__}")

    if allowed_values is not None and value not in allowed_values:
        allowed_str = ", ".join(allowed_values)
        raise ValueError(f"{name} must be one of: {allowed_str}, got {value}")

    return value

`validate_boolean_parameter(name: str, value: Optional[bool]) -> Optional[bool]` ¶

Validate a boolean parameter.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the parameter (for error messages)	required
`value`	`Optional[bool]`	The value to validate	required

Returns:

Type	Description
`Optional[bool]`	The validated value

Raises:

Type	Description
`ValueError`	If the value is invalid

Source code in webdown/validation.py

def validate_boolean_parameter(name: str, value: Optional[bool]) -> Optional[bool]:
    """Validate a boolean parameter.

    Args:
        name: The name of the parameter (for error messages)
        value: The value to validate

    Returns:
        The validated value

    Raises:
        ValueError: If the value is invalid
    """
    if value is None:
        return None

    if not isinstance(value, bool):
        raise ValueError(f"{name} must be a boolean, got {type(value).__name__}")

    return value

Core API Reference¶

Main Functions¶

Functions¶

html_to_markdown(html: str, config: WebdownConfig) -> str ¶

markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str ¶

Configuration Classes¶

Attributes¶

css_selector: Optional[str] = None class-attribute instance-attribute ¶

document_options: DocumentOptions = field(default_factory=DocumentOptions) class-attribute instance-attribute ¶

file_path: Optional[str] = None class-attribute instance-attribute ¶

format: OutputFormat = OutputFormat.MARKDOWN class-attribute instance-attribute ¶

include_images: bool = True class-attribute instance-attribute ¶

include_links: bool = True class-attribute instance-attribute ¶

show_progress: bool = False class-attribute instance-attribute ¶

url: Optional[str] = None class-attribute instance-attribute ¶

Functions¶

Attributes¶

body_width: int = 0 class-attribute instance-attribute ¶

compact_output: bool = False class-attribute instance-attribute ¶

include_metadata: bool = True class-attribute instance-attribute ¶

include_toc: bool = False class-attribute instance-attribute ¶

Functions¶

__init__(include_toc: bool = False, compact_output: bool = False, body_width: int = 0, include_metadata: bool = True) -> None ¶

HTML Parsing¶

Functions¶

fetch_url(url: str, show_progress: bool = False) -> str ¶

fetch_url_with_progress(url: str, show_progress: bool = False, chunk_size: int = 1024, timeout: int = 10) -> str ¶

extract_content_with_css(html: str, css_selector: str) -> str ¶

Markdown Conversion¶

Functions¶

html_to_markdown(html: str, config: WebdownConfig) -> str ¶

XML Conversion¶

Functions¶

markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str ¶

process_section(match: Match[str], level: int) -> List[str] ¶

Error Handling¶

Functions¶

handle_validation_error(message: str, code: str = ErrorCode.VALIDATION_ERROR) -> NoReturn ¶

get_friendly_error_message(error: Exception) -> str ¶

format_error_for_cli(error: Exception) -> str ¶

handle_request_exception(exception: Exception, url: str) -> NoReturn ¶

Exceptions¶

Functions¶

__init__(message: str, code: str = 'UNEXPECTED_ERROR') ¶

Validation¶

Functions¶

validate_url(url: str) -> str ¶

validate_css_selector(selector: str) -> str ¶

validate_body_width(width: Optional[int]) -> Optional[int] ¶

validate_numeric_parameter(name: str, value: Optional[int], min_value: Optional[int] = None, max_value: Optional[int] = None) -> Optional[int] ¶

validate_string_parameter(name: str, value: Optional[str], allowed_values: Optional[list] = None) -> Optional[str] ¶

validate_boolean_parameter(name: str, value: Optional[bool]) -> Optional[bool] ¶

`html_to_markdown(html: str, config: WebdownConfig) -> str` ¶

`markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str` ¶

`css_selector: Optional[str] = None` `class-attribute` `instance-attribute` ¶

`document_options: DocumentOptions = field(default_factory=DocumentOptions)` `class-attribute` `instance-attribute` ¶

`file_path: Optional[str] = None` `class-attribute` `instance-attribute` ¶

`format: OutputFormat = OutputFormat.MARKDOWN` `class-attribute` `instance-attribute` ¶

`include_images: bool = True` `class-attribute` `instance-attribute` ¶

`include_links: bool = True` `class-attribute` `instance-attribute` ¶

`show_progress: bool = False` `class-attribute` `instance-attribute` ¶

`url: Optional[str] = None` `class-attribute` `instance-attribute` ¶

`body_width: int = 0` `class-attribute` `instance-attribute` ¶

`compact_output: bool = False` `class-attribute` `instance-attribute` ¶

`include_metadata: bool = True` `class-attribute` `instance-attribute` ¶

`include_toc: bool = False` `class-attribute` `instance-attribute` ¶

`init(include_toc: bool = False, compact_output: bool = False, body_width: int = 0, include_metadata: bool = True) -> None` ¶

`fetch_url(url: str, show_progress: bool = False) -> str` ¶

`fetch_url_with_progress(url: str, show_progress: bool = False, chunk_size: int = 1024, timeout: int = 10) -> str` ¶

`extract_content_with_css(html: str, css_selector: str) -> str` ¶

`html_to_markdown(html: str, config: WebdownConfig) -> str` ¶

`markdown_to_claude_xml(markdown: str, source_url: Optional[str] = None, include_metadata: bool = True) -> str` ¶

`process_section(match: Match[str], level: int) -> List[str]` ¶

`handle_validation_error(message: str, code: str = ErrorCode.VALIDATION_ERROR) -> NoReturn` ¶

`get_friendly_error_message(error: Exception) -> str` ¶

`format_error_for_cli(error: Exception) -> str` ¶

`handle_request_exception(exception: Exception, url: str) -> NoReturn` ¶

`init(message: str, code: str = 'UNEXPECTED_ERROR')` ¶

`validate_url(url: str) -> str` ¶

`validate_css_selector(selector: str) -> str` ¶

`validate_body_width(width: Optional[int]) -> Optional[int]` ¶

`validate_numeric_parameter(name: str, value: Optional[int], min_value: Optional[int] = None, max_value: Optional[int] = None) -> Optional[int]` ¶

`validate_string_parameter(name: str, value: Optional[str], allowed_values: Optional[list] = None) -> Optional[str]` ¶

`validate_boolean_parameter(name: str, value: Optional[bool]) -> Optional[bool]` ¶