Components of Crawler

Here are all the components of crawler role to let develop implement the detail what it works in the process. As noted above, there are some different types of components of crawler role:

  • HTTP sender

It’s responsible of sending HTTP request, it including set cookie, send via proxy, etc.

  • HTTP response parser

Parsing the HTTP response to get the target content data.

  • Data processing

Data process of the parsed data.

  • Persistence

Persist the final data as a file format or into database.

Please refer to lanes pool diagram to clear the relation between components and crawler role.

HTTP Sender

module smoothcrawler.components.httpio

What are the problems it may face in process of sending HTTP request? It absolutely are performance and retry mechanism. For the moment, let’s only consider about retry mechanism. It’s possible that occur 2 types of failure of sending HTTP request: raising any exception/error or get a HTTP response without status code 200. The former one we could implement it via override 4 functions — before_request, request_done, request_fail and request_final. Its principle is implementing with another Python package MultiRunnablemultirunnable.api.retry. It could refer to the API reference of it to clear more detail usage.

HTTP

class smoothcrawler.components.httpio.HTTP[source]
request(url: str, method: Union[str, HTTPMethod] = 'GET', timeout: int = 1, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request. About retry mechanism, it could let you override the functions before_request, request_done, request_final, request_fail to customize implementations if it needs.

  • before_request

Run before send HTTP request.

  • request_done

Run after send HTTP request and it gets the HTTP response successfully without any exceptions.

  • request_final

No matter it sends HTTP request successfully or not, it would run after send HTTP request finally.

  • request_fail

Run if it gets any exceptions when it sends HTTP request.

Parameters
  • url – URL.

  • method – HTTP method.

  • timeout – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

get(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by GET HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

post(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by POST HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

put(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by PUT HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

delete(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by DELETE HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

head(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by HEAD HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

option(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by OPTION HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

before_request(*args, **kwargs) None[source]

This function would be called before it sends HTTP request.

Returns

None

request_done(result) Any[source]

This function would be called after it sends HTTP request and it runs finely without any exceptions.

Parameters

result – The result of sending HTTP request. In generally, it’s HTTP response object.

Returns

The handled result.

request_fail(error: Exception) None[source]

This function would be called if it gets fail when it sends HTTP request.

Parameters

error – The exception it get.

Returns

None

request_final() None[source]

No matter it sends HTTP request successfully or not, this function must be called fianlly.

Returns

None

status_code()[source]

Send HTTP request by GET HTTP method.

Returns

AsyncHTTP

class smoothcrawler.components.httpio.AsyncHTTP[source]
async request(url: str, method: Union[str, HTTPMethod] = 'GET', timeout: int = 1, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request. About retry mechanism, it could let you override the functions before_request, request_done, request_final, request_fail to customize implementations if it needs.

  • before_request

Run before send HTTP request.

  • request_done

Run after send HTTP request and it gets the HTTP response successfully without any exceptions.

  • request_final

No matter it sends HTTP request successfully or not, it would run after send HTTP request finally.

  • request_fail

Run if it gets any exceptions when it sends HTTP request.

Parameters
  • url – URL.

  • method – HTTP method.

  • timeout – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

async get(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by GET HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

async post(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by POST HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

async put(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by PUT HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

async delete(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by DELETE HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

async head(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by HEAD HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

async option(url: str, *args, **kwargs) Generic[HTTPResponse][source]

Send HTTP request by OPTION HTTP method.

Parameters

url – URL.

Returns

A HTTP response object.

async before_request(*args, **kwargs) None[source]

Asynchronous version of HTTP.before_request.

Returns

None

async request_done(result)[source]

Asynchronous version of HTTP.request_done.

Parameters

result – The result of sending HTTP request. In generally, it’s HTTP response object.

Returns

The handled result.

async request_fail(error: Exception) None[source]

Asynchronous version of HTTP.request_fail.

Parameters

error

Returns

None

async request_final() None[source]

Asynchronous version of HTTP.request_final.

Returns

None

status_code()[source]

Send HTTP request by GET HTTP method.

Returns

HTTP Response Parser

module smoothcrawler.components.data

Parsing HTTP response object.

BaseHTTPResponseParser

class smoothcrawler.components.data.BaseHTTPResponseParser[source]
parse_content(response) Generic[T][source]

Parse the HTTP response object.

Parameters

response – The HTTP response object.

Returns

The data which has been parsed or handled from HTTP response.

abstract get_status_code(response) int[source]

Get the HTTP status code from the HTTP response.

Parameters

response

Returns

handling_200_response(response) Generic[T][source]

Handle the HTTP response object if it’s HTTP status code is 200.

Parameters

response

Returns

handling_not_200_response(response) Generic[T][source]

Handle the HTTP response object if it’s HTTP status code isn’t 200.

Parameters

response

Returns

BaseAsyncHTTPResponseParser

class smoothcrawler.components.data.BaseAsyncHTTPResponseParser[source]
async parse_content(response) Generic[T][source]

The asynchronous version of BaseHTTPResponseParser.parse_content.

Parameters

response

Returns

abstract async get_status_code(response) int[source]

The asynchronous version of BaseHTTPResponseParser.get_status_code.

Parameters

response

Returns

async handling_200_response(response) Generic[T][source]

The asynchronous version of BaseHTTPResponseParser.handling_200_response.

Parameters

response

Returns

async handling_not_200_response(response) Generic[T][source]

The asynchronous version of BaseHTTPResponseParser.handling_not_200_response.

Parameters

response

Returns

Data Processing Handler

module smoothcrawler.components.data

Data process of parsed data of HTTP response object.

BaseDataHandler

class smoothcrawler.components.data.BaseDataHandler[source]
abstract process(result) Generic[T][source]

The implementation of data process.

Parameters

result

Returns

BaseAsyncDataHandler

class smoothcrawler.components.data.BaseAsyncDataHandler[source]
abstract async process(result) Generic[T][source]

The asynchronous version of BaseDataHandler.process.

Parameters

result

Returns

Persistence

module smoothcrawler.components.persistence

Persist data as one specific file format or into database.

PersistenceFacade

class smoothcrawler.components.persistence.PersistenceFacade[source]
abstract save(data: Union[Iterable, Any], *args, **kwargs) Generic[T][source]

Save the data, no matter save it as one specific file format or insert into database.

Parameters

data – The target data which would be saved. In generally, it’s an iterator object.

Returns

In generally, it doesn’t return anything. But it does if it needs.