Components of Crawler
Here are all the components of crawler role to let develop implement the detail what it works in the process. As noted above, there are some different types of components of crawler role:
HTTP sender
It’s responsible of sending HTTP request, it including set cookie, send via proxy, etc.
HTTP response parser
Parsing the HTTP response to get the target content data.
Data processing
Data process of the parsed data.
Persistence
Persist the final data as a file format or into database.
Please refer to lanes pool diagram to clear the relation between components and crawler role.
HTTP Sender
module smoothcrawler.components.httpio
What are the problems it may face in process of sending HTTP request? It absolutely are performance and retry mechanism. For the moment, let’s only consider about retry mechanism. It’s possible that occur 2 types of failure of sending HTTP request: raising any exception/error or get a HTTP response without status code 200. The former one we could implement it via override 4 functions — before_request, request_done, request_fail and request_final. Its principle is implementing with another Python package MultiRunnable — multirunnable.api.retry. It could refer to the API reference of it to clear more detail usage.
HTTP
- class smoothcrawler.components.httpio.HTTP[source]
- request(url: str, method: Union[str, HTTPMethod] = 'GET', timeout: int = 1, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request. About retry mechanism, it could let you override the functions before_request, request_done, request_final, request_fail to customize implementations if it needs.
before_request
Run before send HTTP request.
request_done
Run after send HTTP request and it gets the HTTP response successfully without any exceptions.
request_final
No matter it sends HTTP request successfully or not, it would run after send HTTP request finally.
request_fail
Run if it gets any exceptions when it sends HTTP request.
- Parameters
url – URL.
method – HTTP method.
timeout – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A HTTP response object.
- get(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by GET HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- post(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by POST HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- put(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by PUT HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- delete(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by DELETE HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- head(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by HEAD HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- option(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by OPTION HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- before_request(*args, **kwargs) None[source]
This function would be called before it sends HTTP request.
- Returns
None
- request_done(result) Any[source]
This function would be called after it sends HTTP request and it runs finely without any exceptions.
- Parameters
result – The result of sending HTTP request. In generally, it’s HTTP response object.
- Returns
The handled result.
- request_fail(error: Exception) None[source]
This function would be called if it gets fail when it sends HTTP request.
- Parameters
error – The exception it get.
- Returns
None
AsyncHTTP
- class smoothcrawler.components.httpio.AsyncHTTP[source]
- async request(url: str, method: Union[str, HTTPMethod] = 'GET', timeout: int = 1, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request. About retry mechanism, it could let you override the functions before_request, request_done, request_final, request_fail to customize implementations if it needs.
before_request
Run before send HTTP request.
request_done
Run after send HTTP request and it gets the HTTP response successfully without any exceptions.
request_final
No matter it sends HTTP request successfully or not, it would run after send HTTP request finally.
request_fail
Run if it gets any exceptions when it sends HTTP request.
- Parameters
url – URL.
method – HTTP method.
timeout – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A HTTP response object.
- async get(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by GET HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- async post(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by POST HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- async put(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by PUT HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- async delete(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by DELETE HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- async head(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by HEAD HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- async option(url: str, *args, **kwargs) Generic[HTTPResponse][source]
Send HTTP request by OPTION HTTP method.
- Parameters
url – URL.
- Returns
A HTTP response object.
- async before_request(*args, **kwargs) None[source]
Asynchronous version of HTTP.before_request.
- Returns
None
- async request_done(result)[source]
Asynchronous version of HTTP.request_done.
- Parameters
result – The result of sending HTTP request. In generally, it’s HTTP response object.
- Returns
The handled result.
HTTP Response Parser
module smoothcrawler.components.data
Parsing HTTP response object.
BaseHTTPResponseParser
- class smoothcrawler.components.data.BaseHTTPResponseParser[source]
- parse_content(response) Generic[T][source]
Parse the HTTP response object.
- Parameters
response – The HTTP response object.
- Returns
The data which has been parsed or handled from HTTP response.
- abstract get_status_code(response) int[source]
Get the HTTP status code from the HTTP response.
- Parameters
response –
- Returns
BaseAsyncHTTPResponseParser
- class smoothcrawler.components.data.BaseAsyncHTTPResponseParser[source]
- async parse_content(response) Generic[T][source]
The asynchronous version of BaseHTTPResponseParser.parse_content.
- Parameters
response –
- Returns
- abstract async get_status_code(response) int[source]
The asynchronous version of BaseHTTPResponseParser.get_status_code.
- Parameters
response –
- Returns
Data Processing Handler
module smoothcrawler.components.data
Data process of parsed data of HTTP response object.
BaseDataHandler
BaseAsyncDataHandler
Persistence
module smoothcrawler.components.persistence
Persist data as one specific file format or into database.
PersistenceFacade
- class smoothcrawler.components.persistence.PersistenceFacade[source]
- abstract save(data: Union[Iterable, Any], *args, **kwargs) Generic[T][source]
Save the data, no matter save it as one specific file format or insert into database.
- Parameters
data – The target data which would be saved. In generally, it’s an iterator object.
- Returns
In generally, it doesn’t return anything. But it does if it needs.