Skip to main content

Environment Variables

The following is a list of the environment variables used by Crawlee that are available to the user. Crawlee is capable of running without any env vars present, but certain features will only become available after env vars are properly set.

Important env vars

The following environment variables have large impact on the way Crawlee works and its behavior can be changed significantly by setting or unsetting them.

CRAWLEE_STORAGE_DIR

Defines the path to a local directory where KeyValueStore, Dataset, and RequestQueue store their data. By default, it is set to ./crawlee_storage.

CRAWLEE_DEFAULT_DATASET_ID

The default dataset has ID default, unless we override it by setting the CRAWLEE_DEFAULT_DATASET_ID environment variable.

CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID

The default key-value store has ID default, unless we override it by setting the CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID environment variable.

CRAWLEE_DEFAULT_REQUEST_QUEUE_ID

The default request queue has ID default, unless we override it by setting the CRAWLEE_DEFAULT_REQUEST_QUEUE_ID environment variable.

CRAWLEE_PURGE_ON_START

If set to false - local storage directories would not be purged automatically at the start of the crawler run or before opening of some storage explicitly (e.g. via Dataset.open()). Useful if we're trying e.g. to add more items to dataset with each next run (and keep the previously saved/scraped items). Enabled by default.

Convenience env vars

The next group includes env vars that can help achieve certain goals without having to change our code, such as temporarily switching log level to DEBUG or enabling verbose logging for errors.

CRAWLEE_HEADLESS

If set to 1, web browsers launched by Crawlee will run in the headless mode. We can still override this setting in the code, e.g. by passing the headless: true option to the launchPuppeteer() function. By default, the browsers are launched in headful mode, i.e. with windows.

CRAWLEE_LOG_LEVEL

Specifies the minimum log level, which can be one of the following values (in order of severity): DEBUG, INFO, WARNING and ERROR. By default, the log level is set to INFO, which means that DEBUG messages are not printed to console. See the utils.log namespace for logging utilities.

CRAWLEE_VERBOSE_LOG

Enables verbose logging if set to true. If not explicitly set to true - for errors thrown from inside request handler a warning with only error message will be logged as long as we know the request will be retried. Same applies to some known errors (such as timeout errors). Disabled by default.

CRAWLEE_MEMORY_MBYTES

Sets the amount of system memory in megabytes to be used by the AutoscaledPool. It is used to limit the number of concurrently running tasks. By default, the max amount of memory to be used is set to one quarter of total system memory, i.e. on a system with 8192 MB of memory, the autoscaling feature will only use up to 2048 MB of memory.