Skip to main content

Index

Main Classes

Helper Classes

Errors

Methods

Properties

Constants

Errors

ErrorHandler

ErrorHandler:

ROTATE_PROXY_ERRORS

ROTATE_PROXY_ERRORS:

Content of proxy errors that should trigger a retry, as the proxy is likely getting blocked / is malfunctioning.

Methods

callback

  • callback(): None
  • An empty callback to force typer into making a CLI with a single command.


    Returns None

compute_short_hash

  • compute_short_hash(data, *, length): str
  • Computes a hexadecimal SHA-256 hash of the provided data and returns a substring (prefix) of it.


    Parameters

    • data: bytes
    • length: int = 8keyword-only

    Returns str

compute_unique_key

  • compute_unique_key(url, method, payload, *, keep_url_fragment, use_extended_unique_key): str
  • Computes a unique key for caching & deduplication of requests.

    This function computes a unique key by normalizing the provided URL and method. If 'use_extended_unique_key' is True and a payload is provided, the payload is hashed and included in the key. Otherwise, the unique key is just the normalized URL.


    Parameters

    • url: str
    • method: str = 'GET'
    • payload: bytes | None = None
    • keep_url_fragment: bool = Falsekeyword-only
    • use_extended_unique_key: bool = Falsekeyword-only

    Returns str

compute_weighted_avg

  • compute_weighted_avg(values, weights): float
  • Computes a weighted average of an array of numbers, complemented by an array of weights.


    Parameters

    • values: list[float]
    • weights: list[float]

    Returns float

    Weighted average.

create

  • async create(project_name, template): None
  • Bootstrap a new Crawlee project.


    Parameters

    • project_name: Annotated[ Union[str, None], typer.Argument( help='The name of the project and the directory that will be created to contain it. ' 'If none is given, you will be prompted.' ), ] = None
    • template: Annotated[ Union[str, None], typer.Option(help='The template to be used to create the project. If none is given, you will be prompted.'), ] = None

    Returns None

create_dataset_from_directory

  • create_dataset_from_directory(storage_directory, memory_storage_client, id, name): DatasetClient
  • Parameters

    • storage_directory: str
    • memory_storage_client: MemoryStorageClient
    • id: str | None = None
    • name: str | None = None

    Returns DatasetClient

create_kvs_from_directory

  • create_kvs_from_directory(storage_directory, memory_storage_client, id, name): KeyValueStoreClient
  • Parameters

    • storage_directory: str
    • memory_storage_client: MemoryStorageClient
    • id: str | None = None
    • name: str | None = None

    Returns KeyValueStoreClient

create_rq_from_directory

  • create_rq_from_directory(storage_directory, memory_storage_client, id, name): RequestQueueClient
  • Parameters

    • storage_directory: str
    • memory_storage_client: MemoryStorageClient
    • id: str | None = None
    • name: str | None = None

    Returns RequestQueueClient

crypto_random_object_id

  • crypto_random_object_id(length): str
  • Generates a random object ID.


    Parameters

    • length: int = 17

    Returns str

determine_file_extension

  • determine_file_extension(content_type): str | None
  • Determine the file extension for a given MIME content type.


    Parameters

    • content_type: str

    Returns str | None

filter_out_none_values_recursively

  • filter_out_none_values_recursively(dictionary, *, remove_empty_dicts): dict | None
  • Recursively filters out None values from a dictionary.


    Parameters

    • dictionary: dict
    • remove_empty_dicts: bool = Falsekeyword-only

    Returns dict | None

find_or_create_client_by_id_or_name_inner

  • find_or_create_client_by_id_or_name_inner(resource_client_class, memory_storage_client, id, name): TResourceClient | None
  • Locates or creates a new storage client based on the given ID or name.

    This method attempts to find a storage client in the memory cache first. If not found, it tries to locate a storage directory by name. If still not found, it searches through storage directories for a matching ID or name in their metadata. If none exists, and the specified ID is 'default', it checks for a default storage directory. If a storage client is found or created, it is added to the memory cache. If no storage client can be located or created, the method returns None.


    Parameters

    • resource_client_class: type[TResourceClient]
    • memory_storage_client: MemoryStorageClient
    • id: str | None = None
    • name: str | None = None

    Returns TResourceClient | None

force_remove

  • async force_remove(filename): None
  • Removes a file, suppressing the FileNotFoundError if it does not exist.

    JS-like rm(filename, { force: true }).


    Parameters

    • filename: str

    Returns None

force_rename

  • async force_rename(src_dir, dst_dir): None
  • Renames a directory, ensuring that the destination directory is removed if it exists.


    Parameters

    • src_dir: str
    • dst_dir: str

    Returns None

get_cpu_info

  • get_cpu_info(): CpuInfo
  • Retrieves the current CPU usage.

    It utilizes the psutil library. Function psutil.cpu_percent() returns a float representing the current system-wide CPU utilization as a percentage.


    Returns CpuInfo

get_memory_info

  • get_memory_info(): MemoryInfo
  • Retrieves the current memory usage of the process and its children.

    It utilizes the psutil library.


    Returns MemoryInfo

get_or_create_inner

  • async get_or_create_inner(*, memory_storage_client, storage_client_cache, resource_client_class, name, id): TResourceClient
  • Retrieve a named storage, or create a new one when it doesn't exist.


    Parameters

    • memory_storage_client: MemoryStorageClientkeyword-only
    • storage_client_cache: list[TResourceClient]keyword-only
    • resource_client_class: type[TResourceClient]keyword-only
    • name: str | None = Nonekeyword-only
    • id: str | None = Nonekeyword-only

    Returns TResourceClient

is_content_type

  • is_content_type(content_type_enum, content_type): bool
  • Check if the provided content type string matches the specified ContentType.


    Parameters

    • content_type_enum: ContentType
    • content_type: str

    Returns bool

is_file_or_bytes

  • is_file_or_bytes(value): bool
  • Determine if the input value is a file-like object or bytes.

    This function checks whether the provided value is an instance of bytes, bytearray, or io.IOBase (file-like). The method is simplified for common use cases and may not cover all edge cases.


    Parameters

    • value: Any

    Returns bool

json_dumps

  • async json_dumps(obj): str
  • Serialize an object to a JSON-formatted string with specific settings.


    Parameters

    • obj: Any

    Returns str

maybe_extract_enum_member_value

  • maybe_extract_enum_member_value(maybe_enum_member): Any
  • Extract the value of an enumeration member if it is an Enum, otherwise return the original value.


    Parameters

    • maybe_enum_member: Any

    Returns Any

maybe_parse_body

  • maybe_parse_body(body, content_type): Any
  • Parse the response body based on the content type.


    Parameters

    • body: bytes
    • content_type: str

    Returns Any

measure_time

  • measure_time(): Iterator[TimerResult]
  • Measure the execution time (wall-clock and CPU) between the start and end of the with-block.


    Returns Iterator[TimerResult]

normalize_url

  • normalize_url(url, *, keep_url_fragment): str
  • Normalizes a URL.

    This function cleans and standardizes a URL by removing leading and trailing whitespaces, converting the scheme and netloc to lower case, stripping unwanted tracking parameters (specifically those beginning with 'utm_'), sorting the remaining query parameters alphabetically, and optionally retaining the URL fragment. The goal is to ensure that URLs that are functionally identical but differ in trivial ways (such as parameter order or casing) are treated as the same.


    Parameters

    • url: str
    • keep_url_fragment: bool = Falsekeyword-only

    Returns str

open_storage

  • async open_storage(*, storage_class, configuration, id, name): TResource
  • Open either a new storage or restore an existing one and return it.


    Parameters

    • storage_class: type[TResource]keyword-only
    • configuration: Configuration | None = Nonekeyword-only
    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only

    Returns TResource

persist_metadata_if_enabled

  • async persist_metadata_if_enabled(*, data, entity_directory, write_metadata): None
  • Updates or writes metadata to a specified directory.

    The function writes a given metadata dictionary to a JSON file within a specified directory. The writing process is skipped if write_metadata is False. Before writing, it ensures that the target directory exists, creating it if necessary.


    Parameters

    • data: dictkeyword-only
    • entity_directory: strkeyword-only
    • write_metadata: boolkeyword-only

    Returns None

raise_on_duplicate_storage

  • raise_on_duplicate_storage(client_type, key_name, value): NoReturn
  • Raise an error indicating that a storage with the provided key name and value already exists.


    Parameters

    • client_type: StorageTypes
    • key_name: str
    • value: str

    Returns NoReturn

raise_on_non_existing_storage

  • raise_on_non_existing_storage(client_type, id): NoReturn
  • Raise an error indicating that a storage with the provided id does not exist.


    Parameters

    • client_type: StorageTypes
    • id: str | None

    Returns NoReturn

remove_storage_from_cache

  • remove_storage_from_cache(*, storage_class, id, name): None
  • Remove a storage from cache by ID or name.


    Parameters

    • storage_class: typekeyword-only
    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only

    Returns None

run_async

  • run_async(func): Callable
  • Decorates a coroutine function so that it is ran with asyncio.run.


    Parameters

    • func: Callable[..., Coroutine]

    Returns Callable

unique_key_to_request_id

  • unique_key_to_request_id(unique_key, *, request_id_length): str
  • Generate a deterministic request ID based on a unique key.


    Parameters

    • unique_key: str
    • request_id_length: int = 15keyword-only

    Returns str

wait_for

  • async wait_for(operation, *, timeout, timeout_message, max_retries, logger): T
  • Wait for an async operation to complete.

    If the wait times out, TimeoutError is raised and the future is cancelled. Optionally retry on error.


    Parameters

    • operation: Callable[[], Awaitable[T]]
    • timeout: timedeltakeyword-only
    • timeout_message: str | None = Nonekeyword-only
    • max_retries: int = 1keyword-only
    • logger: Loggerkeyword-only

    Returns T

wait_for_all_tasks_for_finish

  • async wait_for_all_tasks_for_finish(tasks, *, logger, timeout): None
  • Wait for all tasks to finish or until the timeout is reached.


    Parameters

    • tasks: Sequence[asyncio.Task]
    • logger: Loggerkeyword-only
    • timeout: timedelta | None = Nonekeyword-only

    Returns None

Properties

__version__

__version__:

AsyncListener

AsyncListener:

cli

cli:

CLOUDFLARE_RETRY_CSS_SELECTORS

CLOUDFLARE_RETRY_CSS_SELECTORS:

CreateSessionFunctionType

CreateSessionFunctionType:

EventData

EventData:

FailedRequestHandler

FailedRequestHandler:

JSONSerializable

JSONSerializable:

Listener

Listener:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

logger

logger:

METADATA_FILENAME

METADATA_FILENAME:

RequestHandler

RequestHandler:

ResourceClient

ResourceClient:

ResourceCollectionClient

ResourceCollectionClient:

RETRY_CSS_SELECTORS

RETRY_CSS_SELECTORS:

CSS selectors for elements that should trigger a retry, as the crawler is likely getting blocked.

Snapshot

Snapshot:

SyncListener

SyncListener:

T

T:

T

T:

T

T:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TCrawlingContext

TCrawlingContext:

TEMPLATE_LIST_URL

TEMPLATE_LIST_URL:

timedelta_ms

timedelta_ms:

TMiddlewareCrawlingContext

TMiddlewareCrawlingContext:

TResource

TResource:

TResourceClient

TResourceClient:

TStatisticsState

TStatisticsState:

ValueType

ValueType:

WrappedListener

WrappedListener: