Automatic Streaming for Large Web Pages¶
Webdown includes built-in support for handling large web pages through an automatic streaming mechanism. This guide explains how streaming works and when it's activated.
Understanding Streaming Mode¶
For most web pages, Webdown downloads the entire HTML content at once before processing it. However, for very large web pages, this approach could use excessive memory and potentially cause issues on systems with limited resources.
To address this, Webdown automatically switches to a streaming mode when handling large web pages.
The 10MB Threshold¶
Webdown uses a fixed 10MB threshold to determine when to activate streaming mode:
- When requesting a web page, Webdown first checks its size
- If the page size is less than 10MB, the entire page is downloaded at once
- If the page size is 10MB or larger, streaming mode is activated
How Streaming Works¶
When streaming mode is activated:
- Webdown downloads the page in chunks rather than all at once
- Each chunk is processed as it's received
- Progress is reported based on the amount of data downloaded
- Memory usage remains controlled even for extremely large pages
The streaming implementation uses Python's requests library with stream=True
and processes the response in iterations.
Progress Reporting in Streaming Mode¶
When streaming large pages, you'll notice the progress bar behaves slightly differently:
- For regular downloads, the progress bar shows both the download and processing phases
- In streaming mode, the progress bar primarily reflects the download progress
Example terminal output for a streaming download:
Downloading and processing https://example.com/largepage
[████████████████████] 100% | 15.8 MB | 00:04 remaining
Converting to Markdown... Done!
Technical Implementation¶
The streaming functionality is implemented through these key components:
- Size checking: Before downloading, Webdown sends a HEAD request to determine the content size
- Threshold decision: Based on the 10MB threshold, Webdown decides whether to stream
- Chunked download: For streaming, content is downloaded in reasonable chunks (8KB by default)
- Incremental processing: Each chunk is decoded and added to the processing buffer
This implementation balances memory efficiency with processing speed.
Common Questions¶
Why is 10MB the fixed threshold?¶
The 10MB threshold was chosen based on: - Testing with a wide range of web pages - Balancing memory usage and performance - Ensuring compatibility with most systems
This threshold works well for the vast majority of web pages while protecting against memory issues with extremely large pages.
Does streaming affect the output quality?¶
No. The streaming mode only changes how the content is downloaded and processed internally. The final Markdown or Claude XML output is identical regardless of whether streaming was used.
Is streaming slower than regular downloading?¶
Streaming can be slightly slower for pages just above the threshold because of the additional processing overhead. However, for very large pages, streaming is often more efficient and prevents potential out-of-memory errors.
Can I disable streaming?¶
No, the streaming feature is automatically activated based on the page size and cannot be disabled. This design decision ensures Webdown remains robust when handling pages of any size.
Use Cases for Large Page Handling¶
The automatic streaming capability makes Webdown suitable for:
- Documentation sites: Handling large technical documentation
- Academic papers: Processing lengthy scientific publications
- Legal documents: Converting extensive legal texts
- Data-heavy reports: Working with pages containing large tables or datasets
Limitations¶
While streaming handles large pages well, be aware of these limitations:
- Progress reporting is less precise in streaming mode
- For extremely large pages (100MB+), processing may still take significant time
- Some websites might not report their content size correctly, affecting the streaming decision