Skip to main content

Session Management

SessionPool is a class that allows us to handle the rotation of proxy IP addresses along with cookies and other custom settings in Crawlee.

The main benefit of using Session pool is that we can filter out blocked or non-working proxies, so our actor does not retry requests over known blocked/non-working proxies. Another benefit of using SessionPool is that we can store information tied tightly to an IP address, such as cookies, auth tokens, and particular headers. Having our cookies and other identifiers used only with a specific IP will reduce the chance of being blocked. The last but not least benefit is the even rotation of IP addresses - SessionPool picks the session randomly, which should prevent burning out a small pool of available IPs.

Now let's take a look at the examples of how to use Session pool:

import { BasicCrawler, ProxyConfiguration } from 'crawlee';
import { gotScraping } from 'got-scraping';

const proxyConfiguration = new ProxyConfiguration({ /* opts */ });

const crawler = new BasicCrawler({
// Activates the Session pool (default is true).
useSessionPool: true,
// Overrides default Session pool configuration.
sessionPoolOptions: { maxPoolSize: 100 },
async requestHandler({ request, session }) {
const { url } = request;
const requestOptions = {
url,
// We use session id in order to have the same proxyUrl
// for all the requests using the same session.
proxyUrl: await proxyConfiguration.newUrl(session.id),
throwHttpErrors: false,
headers: {
// If you want to use the cookieJar.
// This way you get the Cookie headers string from session.
Cookie: session.getCookieString(url),
},
};
let response;

try {
response = await gotScraping(requestOptions);
} catch (e) {
if (e === 'SomeNetworkError') {
// If a network error happens, such as timeout, socket hangup, etc.
// There is usually a chance that it was just bad luck
// and the proxy works. No need to throw it away.
session.markBad();
}
throw e;
}

// Automatically retires the session based on response HTTP status code.
session.retireOnBlockedStatusCodes(response.statusCode);

if (response.body.blocked) {
// You are sure it is blocked.
// This will throw away the session.
session.retire();
}

// Everything is ok, you can get the data.
// No need to call session.markGood -> BasicCrawler calls it for you.

// If you want to use the CookieJar in session you need.
session.setCookiesFromResponse(response);
},
});

These are the basics of configuring SessionPool. Please, bear in mind that a Session pool needs time to find working IPs and build up the pool, so we will probably see a lot of errors until it becomes stabilized.