📄️ Add data to dataset
This example demonstrates how to store extracted data into datasets using the context.pushdata() helper function. If the specified dataset does not already exist, it will be created automatically. Additionally, you can save data to custom datasets by providing datasetid or datasetname parameters to the pushdata method.
📄️ BeautifulSoup crawler
This example demonstrates how to use BeautifulSoupCrawler to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the BeautifulSoup library and extract some data from it - the page title and all `, and ` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code.
📄️ Capture screenshots using Playwright
This example demonstrates how to capture screenshots of web pages using PlaywrightCrawler and store them in the key-value store.
📄️ Crawl all links on website
This example uses the enqueue_links() helper to add new links to the RequestQueue as the crawler navigates from page to page. By automatically discovering and enqueuing all links on a given page, the crawler can systematically scrape an entire website. This approach is ideal for web scraping tasks where you need to collect data from multiple interconnected pages.
📄️ Crawl multiple URLs
This example demonstrates how to crawl a specified list of URLs using different crawlers. You'll learn how to set up the crawler, define a request handler, and run the crawler with multiple URLs. This setup is useful for scraping data from multiple pages or websites concurrently.
📄️ Crawl specific links on website
This example demonstrates how to crawl a website while targeting specific patterns of links. By utilizing the enqueue_links() helper, you can pass include or exclude parameters to improve your crawling strategy. This approach ensures that only the links matching the specified patterns are added to the RequestQueue. Both include and exclude support lists of globs or regular expressions. This functionality is great for focusing on relevant sections of a website and avoiding scraping unnecessary or irrelevant content.
📄️ Crawl website with relative links
When crawling a website, you may encounter various types of links that you wish to include in your crawl. To facilitate this, we provide the enqueue_links() method on the crawler context, which will automatically find and add these links to the crawler's RequestQueue. This method simplifies the process of handling different types of links, including relative links, by automatically resolving them based on the page's context.
📄️ Export entire dataset to file
This example demonstrates how to use the export_data() method of the crawler to export the entire default dataset to a single file. This method supports exporting data in either CSV or JSON format.
📄️ Playwright crawler
This example demonstrates how to use PlaywrightCrawler to recursively scrape the Hacker news website using headless Chromium and Playwright.