Skip to main content

Refactoring

It may seem that the data is extracted and the crawler is done, but honestly, this is just the beginning. For the sake of brevity, we've completely omitted error handling, proxies, logging, architecture, tests, documentation and other stuff that a reliable software should have. The good thing is, error handling is mostly done by Crawlee itself, so no worries on that front, unless you need some custom magic.

Navigating automatic bot-protextion avoidance

You might be wondering about the anti-blocking, bot-protection avoiding stealthy features and why we haven't highlighted them yet. The reason is straightforward: these features are automatically used within the default configuration, providing a smooth start without manual adjustments.

To promote good coding practices, let's look at how you can use a Router class to better structure your crawler code.

Request routing

In the following code, we've made several changes:

  • Split the code into multiple files.
  • Added custom instance of Router to make our routing cleaner, without if clauses.
  • Moved route definitions to a separate routes.py file.
  • Simplified the main.py file to focus on the general structure of the crawler.

Routes file

First, let's define our routes in a separate file:

src/routes.py
from crawlee.basic_crawler import Router
from crawlee.playwright_crawler import PlaywrightCrawlingContext

router = Router[PlaywrightCrawlingContext]()


@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
# This is a fallback route which will handle the start URL.
context.log.info(f'default_handler is processing {context.request.url}')

await context.page.wait_for_selector('.collection-block-item')

await context.enqueue_links(
selector='.collection-block-item',
label='CATEGORY',
)


@router.handler('CATEGORY')
async def category_handler(context: PlaywrightCrawlingContext) -> None:
# This replaces the context.request.label == CATEGORY branch of the if clause.
context.log.info(f'category_handler is processing {context.request.url}')

await context.page.wait_for_selector('.product-item > a')

await context.enqueue_links(
selector='.product-item > a',
label='DETAIL',
)

next_button = await context.page.query_selector('a.pagination__next')

if next_button:
await context.enqueue_links(
selector='a.pagination__next',
label='CATEGORY',
)


@router.handler('DETAIL')
async def detail_handler(context: PlaywrightCrawlingContext) -> None:
# This replaces the context.request.label == DETAIL branch of the if clause.
context.log.info(f'detail_handler is processing {context.request.url}')

url_part = context.request.url.split('/').pop()
manufacturer = url_part.split('-')[0]

title = await context.page.locator('.product-meta h1').text_content()

sku = await context.page.locator('span.product-meta__sku-number').text_content()

price_element = context.page.locator('span.price', has_text='$').first
current_price_string = await price_element.text_content() or ''
raw_price = current_price_string.split('$')[1]
price = float(raw_price.replace(',', ''))

in_stock_element = context.page.locator(
selector='span.product-form__inventory',
has_text='In stock',
).first
in_stock = await in_stock_element.count() > 0

data = {
'manufacturer': manufacturer,
'title': title,
'sku': sku,
'price': price,
'in_stock': in_stock,
}

await context.push_data(data)

Main file

Next, our main file becomes much simpler and cleaner:

src/main.py
import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler

from .routes import router


async def main() -> None:
crawler = PlaywrightCrawler(
# Let's limit our crawls to make our tests shorter and safer.
max_requests_per_crawl=50,
# Provide our router instance to the crawler.
request_handler=router,
)

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections'])


if __name__ == '__main__':
asyncio.run(main())

By structuring your code this way, you achieve better separation of concerns, making the code easier to read, manage and extend. The Router class keeps your routing logic clean and modular, replacing if clauses with function decorators.

Summary

Refactoring your crawler code with these practices enhances readability, maintainability, and scalability.

Splitting your code into multiple files

There's no reason not to split your code into multiple files and keep your logic separate. Less code in a single file means less complexity to handle at any time, which improves overall readability and maintainability. Consider further splitting the routes into separate files for even better organization.

Using a router to structure your crawling

Initially, using a simple if / else statement for selecting different logic based on the crawled pages might appear more readable. However, this approach can become cumbersome with more than two types of pages, especially when the logic for each page extends over dozens or even hundreds of lines of code.

It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long request_handler() where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files.