Environment Variables
The following is a list of the environment variables used by Crawlee that are available to the user. Crawlee is capable of running without any env vars present, but certain features will only become available after env vars are properly set.
Important env vars
The following environment variables have large impact on the way Crawlee works and its behavior can be changed significantly by setting or unsetting them.
CRAWLEE_STORAGE_DIR
Defines the path to a local directory where KeyValueStore
, Dataset
, and RequestQueue
store their data. By default, it is set to ./crawlee_storage
.
CRAWLEE_DEFAULT_DATASET_ID
The default dataset has ID default
, unless we override it by setting the CRAWLEE_DEFAULT_DATASET_ID
environment variable.
CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID
The default key-value store has ID default
, unless we override it by setting the CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID
environment variable.
CRAWLEE_DEFAULT_REQUEST_QUEUE_ID
The default request queue has ID default
, unless we override it by setting the CRAWLEE_DEFAULT_REQUEST_QUEUE_ID
environment variable.
CRAWLEE_PURGE_ON_START
If set to false
- local storage directories would not be purged automatically at the start of the crawler run or before opening of some storage explicitly (e.g. via Dataset.open()
). Useful if we're trying e.g. to add more items to dataset with each next run (and keep the previously saved/scraped items). Enabled by default.
Convenience env vars
The next group includes env vars that can help achieve certain goals without having to change our code, such as temporarily switching log level to DEBUG or enabling verbose logging for errors.
CRAWLEE_HEADLESS
If set to 1
, web browsers launched by Crawlee will run in the headless mode. We can still override
this setting in the code, e.g. by passing the headless: true
option to the launchPuppeteer()
function. By default, the browsers
are launched in headful mode, i.e. with windows.
CRAWLEE_LOG_LEVEL
Specifies the minimum log level, which can be one of the following values (in order of severity):
DEBUG
, INFO
, WARNING
and ERROR
. By default, the log level is set to INFO
,
which means that DEBUG
messages are not printed to console. See the utils.log
namespace for logging utilities.
CRAWLEE_VERBOSE_LOG
Enables verbose logging if set to true
. If not explicitly set to true
- for errors thrown from inside request handler a warning with only error message will be logged as long as we know the request will be retried. Same applies to some known errors (such as timeout errors). Disabled by default.
CRAWLEE_MEMORY_MBYTES
Sets the amount of system memory in megabytes to be used by the AutoscaledPool
.
It is used to limit the number of concurrently running tasks. By default, the max amount of memory
to be used is set to one quarter of total system memory, i.e. on a system with 8192 MB of memory,
the autoscaling feature will only use up to 2048 MB of memory.