Skip to main content

Apify Platform

Apify is a platform built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to compute instances (Actors), convenient request and result storages, proxies, scheduling, webhooks and more, accessible through a web interface or an API.

While we think that the Apify platform is super cool, and it's definitely worth signing up for a free account, Crawlee is and will always be open source, runnable locally or on any cloud infrastructure.

note

We do not test Crawlee in other cloud environments such as Lambda or on specific architectures such as Raspberry PI. We strive to make it work, but there are no guarantees.

Logging into Apify platform from Crawlee

To access our Apify account from Crawlee, we must provide credentials - our API token. We can do that either by utilizing Apify CLI or with environment variables.

Once we provide credentials to our scraper, we will be able to use all the Apify platform features, such as calling actors, saving to cloud storages, using Apify proxies, setting up webhooks and so on.

Log in with CLI

Apify CLI allows us to log in to our Apify account on our computer. If we then run our scraper using the CLI, our credentials will automatically be added.

npm install -g apify-cli
apify login -t OUR_API_TOKEN

Log in with environment variables

Alternatively, we can always provide credentials to our scraper by setting the APIFY_TOKEN environment variable to our API token.

There's also the APIFY_PROXY_PASSWORD environment variable. Actor automatically infers that from our token, but it can be useful when we need to access proxies from a different account than our token represents.

Log in with Configuration

Another option is to use the Configuration instance and set our api token there.

import { Actor } from 'apify';

const sdk = new Actor({ token: 'our_api_token' });

What is an actor

When we deploy our script to the Apify platform, it becomes an actor. An actor is a serverless microservice that accepts an input and produces an output. It can run for a few seconds, hours or even infinitely. An actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset.

Actors can be shared in the Apify Store so that other people can use them. But don't worry, if we share our actor in the store and somebody uses it, it runs under their account, not ours.

Related links

Running an actor locally

First let's create a boilerplate of the new actor. We could use Apify CLI and just run:

apify create my-hello-world

The CLI will prompt us to select a project boilerplate template - let's pick "Hello world". The tool will create a directory called my-hello-world with a Node.js project files. We can run the actor as follows:

cd my-hello-world
apify run

Running Crawlee code as an actor

For running the Crawlee code as an actor on the Apify platform we should either:

Let's look at the CheerioCrawler example from the Quick Start guide:

import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';

await Actor.main(async () => {
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const { url } = request;

// Extract HTML title of the page.
const title = $('title').text();
console.log(`Title of ${url}: ${title}`);

// Add URLs that match the provided pattern.
await enqueueLinks({
globs: ['https://www.iana.org/*'],
});

// Save extracted data to dataset.
await Actor.pushData({ url, title });
},
});

// Enqueue the initial request and run the crawler
await crawler.run(['https://www.iana.org/']);
});

Note that we could also run our actor (that is using Crawlee) locally with Apify CLI. We could start it via the following command in our project folder:

apify run -p

Deploying an actor to Apify platform

Now (assuming we are already logged in to our Apify account) we can easily deploy our code to the Apify platform by running:

apify push

Our script will be uploaded to and built on the Apify platform so that it can be run there. For more information, view the Apify Actor documentation.

Usage on Apify platform

We can also develop our actor in an online code editor directly on the platform (we'll need an Apify Account). Let's go to the Actors page in the app, click Create new and then go to the Source tab and start writing our code or paste one of the examples from the Examples section.

Storages

There are several things worth mentioning here.

  1. Compared to Crawlee, in order to simplify access to the default Key-Value Store and Dataset we don't need to use the helper functions of storage classes, but instead, we could use Actor.getValue(), Actor.setValue() for the default Key-Value Store and Actor.pushData() for the default Dataset directly.
  2. In order to open the storage, we shouldn't use the storage classes, but instead use the Actor class. Thus, instead of KeyValueStore.open(), Dataset.open() and RequestQueue.open(), we could use Actor.openKeyValueStore(), Actor.openDataset() and Actor.openRequestQueue() respectively. Using each of these methods allows us to pass the OpenStorageOptions as a second argument, which has only one optional property: forceCloud. If set to true - cloud storage will be used instead of the folder on the local disk.
  3. When the Dataset is stored on the Apify platform, we can export its data to the following formats: HTML, JSON, CSV, Excel, XML and RSS. The datasets are displayed on the actor run details page and in the Storage section in the Apify Console. The actual data is exported using the Get dataset items Apify API endpoint. This way we can easily share the crawling results.

Related links

Environment variables

The following are some additional environment variables specific to Apify platform. More Crawlee specific environment variables could be found in the Environment Variables guide.

note

It's important to notice that CRAWLEE_ environment variables don't need to be replaced with APIFY_ ones respected by Apify platform. E.g. if we have CRAWLEE_DEFAULT_DATASET_ID set in our project, and then we push our code to the Apify platform as an Actor - this variable would still be respected by the Actor/platform.

APIFY_TOKEN

The API token for our Apify account. It is used to access the Apify API, e.g. to access cloud storage or to run an actor on the Apify platform. We can find our API token on the Account - Integrations page.

Combinations of APIFY_TOKEN and CRAWLEE_STORAGE_DIR

CRAWLEE_STORAGE_DIR env variable description could be found in Environment Variables guide.

By combining the env vars in various ways, we can greatly influence the actor's behavior.

Env VarsAPIStorages
none OR CRAWLEE_STORAGE_DIRnolocal
APIFY_TOKENyesApify platform
APIFY_TOKEN AND CRAWLEE_STORAGE_DIRyeslocal + platform

When using both APIFY_TOKEN and CRAWLEE_STORAGE_DIR, we can use all the Apify platform features and our data will be stored locally by default. If we want to access platform storages, we can use the { forceCloud: true } option in their respective functions.

import { Actor } from 'apify';

const localDataset = await Actor.openDataset('my-local-data');
const remoteDataset = await Actor.openDataset('my-remote-data', { forceCloud: true });

APIFY_PROXY_PASSWORD

Optional password to Apify Proxy for IP address rotation. Assuming Apify Account was already created, we can find the password on the Proxy page in the Apify Console. The password is automatically inferred using the APIFY_TOKEN env var, so in most cases, we don't need to touch it. We should use it when, for some reason, we need access to Apify Proxy, but not access to Apify API, or when we need access to proxy from a different account than our token represents.

Proxy management

In addition to our own proxy servers and proxy servers acquired from third-party providers used together with Crawlee, we can also rely on Apify Proxy for our scraping needs.

If we are already subscribed to Apify Proxy, we can start using them immediately in only a few lines of code (for local usage we first we should be logged in to our Apify account.

import { Actor } from 'apify';

const proxyConfiguration = await Actor.createProxyConfiguration();
const proxyUrl = await proxyConfiguration.newUrl();

Note that unlike using proxy in Crawlee, we shouldn't use the constructor to create ProxyConfiguration instance. For using Apify Proxy we should create an instance using the Actor.createProxyConfiguration() function instead.

Apify Proxy vs. Own proxies

The ProxyConfiguration class covers both Apify Proxy and custom proxy URLs so that we can easily switch between proxy providers, however, some features of the class are available only to Apify Proxy users, mainly because Apify Proxy is what one would call a super-proxy. It's not a single proxy server, but an API endpoint that allows connection through millions of different IP addresses. So the class essentially has two modes: Apify Proxy or Own (third party) proxy.

The difference is easy to remember.

Apify Proxy Configuration

With Apify Proxy, we can select specific proxy groups to use, or countries to connect from. This allows us to get better proxy performance after some initial research.

import { Actor } from 'apify';

const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});
const proxyUrl = await proxyConfiguration.newUrl();

Now our crawlers will use only Residential proxies from the US. Note that we must first get access to a proxy group before We are able to use it. We can check proxy groups available to us in the proxy dashboard.

Related links