Crawlers

module smoothcrawler.crawler

Here are the module which has many different Crawler roles for different scenarios. They also are the ‘final production’ which combines the needed components as web spider and uses that features.

So components implement what it works at each processes, crawler role implement how it works with its components.

Framework Modules

Base Crawler

class smoothcrawler.crawler.BaseCrawler(factory: Optional[BaseFactory] = None)[source]

_initial_factory() → BaseFactory[source]

Initial BaseFactory object. This function would be called if value of option factory of __init__ is None.

Returns: CrawlerFactory instance.

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) → None[source]

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters

http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.

Returns

None

crawl(method: str, url: str, retry: int = 1, *args, **kwargs) → Any[source]

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) → Generic[T][source]

Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

parse_http_response(response: Generic[T]) → Generic[T][source]

Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.

Parameters: response – The HTTP response.
Returns: The result which it has parsed from HTTP response. The data type is Generic[T].

data_process(parsed_response: Generic[T]) → Generic[T][source]

The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.

Parameters: parsed_response – The data which has been parsed from HTTP response object.
Returns: The result of data process. The data type is Generic[T].

persist(data: Any) → None[source]

Persist the data. It could override this function to implement your own customized logic to save data.

Parameters: data – The target data to persist. In generally, this is the data which has been parsed and handled.
Returns: None

MultiRunnable Crawler

class smoothcrawler.crawler.MultiRunnableCrawler(factory: Optional[BaseFactory] = None)[source]

property persistence_factory: PersistenceFacade

Get the instance of persistence factory object.

Returns: A PersistenceFacade type object.

process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) → List[Any][source]

Handling the crawler process with List of URLs.

Parameters

method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) → List[Any][source]

Handling the crawler process with Queue which saving URLs.

Parameters

method – HTTP method.
url – Queue of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

static _get_lock_feature(lock: bool = True, sema_value: int = 1) → Union[LockFactory, BoundedSemaphoreFactory][source]

Initialize Lock or Semaphore. Why? because of persistence process.

Parameters

lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

It would return LockFactory if option lock is True, or it returns BoundedSemaphoreFactory.

static _divide_urls(urls: List[str], executor_number: int) → List[List[str]][source]

Divide the data list which saving URLs to be a list saving multiple lists.

Parameters

urls – A collection of URLs.
executor_number – How many executors you activate to run.

Returns

A collection of element which also is collection of URLs.

Implementation Modules

Simple Crawler

class smoothcrawler.crawler.SimpleCrawler(factory: Optional[BaseFactory] = None)[source]

run_and_save(method: str, url: Union[str, list]) → None[source]

In addiction to crawl and handle the data from web, it persist the data.

Parameters

method – HTTP method.
url – One or more URLs (a collection of URLs).

Returns

None

crawl(method: str, url: str, retry: int = 1, *args, **kwargs) → Any

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

data_process(parsed_response: Generic[T]) → Generic[T]

The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.

Parameters: parsed_response – The data which has been parsed from HTTP response object.
Returns: The result of data process. The data type is Generic[T].

parse_http_response(response: Generic[T]) → Generic[T]

Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.

Parameters: response – The HTTP response.
Returns: The result which it has parsed from HTTP response. The data type is Generic[T].

persist(data: Any) → None

Persist the data. It could override this function to implement your own customized logic to save data.

Parameters: data – The target data to persist. In generally, this is the data which has been parsed and handled.
Returns: None

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) → None

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters

http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.

Returns

None

send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) → Generic[T]

Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

Asynchronous Simple Crawler

class smoothcrawler.crawler.AsyncSimpleCrawler(executors: int, factory: Optional[AsyncCrawlerFactory] = None)[source]

async crawl(url: str, method: str, retry: int = 1, *args, **kwargs) → Any[source]

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

async send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) → Generic[T][source]

The asynchronous version of BaseCrawler.send_http_request.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

async parse_http_response(response: Generic[T]) → Generic[T][source]

The asynchronous version of BaseCrawler.parse_http_response.

Parameters: response – The HTTP response.
Returns: The result which it has parsed from HTTP response. The data type is Generic[T].

async data_process(parsed_response: Generic[T]) → Generic[T][source]

The asynchronous version of BaseCrawler.data_process.

Parameters: parsed_response – The data which has been parsed from HTTP response object.
Returns: The result of data process. The data type is Generic[T].

async persist(data: Any) → None[source]

The asynchronous version of BaseCrawler.persist.

Parameters: data – The target data to persist. In generally, this is the data which has been parsed and handled.
Returns: None

async process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) → Any[source]

The asynchronous version of MultiRunnableCrawler.process_with_list.

Parameters

method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

async process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) → Any[source]

The asynchronous version of MultiRunnableCrawler.process_with_queue.

Parameters

method – HTTP method.
url – Queue of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

map(method: str, url: List[str], retry: int = 1, lock: bool = True, sema_value: int = 1) → Optional[source]

The asynchronous version of ExecutorCrawler.map.

Parameters

method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

The result of data process from parsed HTPP response object.

run(method: str, url: Union[List[str], Queue], retry: int = 1, lock: bool = True, sema_value: int = 1) → Optional[source]

The asynchronous version of ExecutorCrawler.run.

Parameters

method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

The result of data process from parsed HTPP response object.

property persistence_factory: PersistenceFacade

Get the instance of persistence factory object.

Returns: A PersistenceFacade type object.

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) → None

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters

http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.

Returns

None

Executor Crawler

class smoothcrawler.crawler.ExecutorCrawler(mode: RunningMode, executors: int, factory: CrawlerFactory)[source]

run(method: str, url: Union[List[str], Queue], retry: int = 1, lock: bool = True, sema_value: int = 1) → Optional[source]

Run the crawl process as multiple executor directly. It may run a little bit differently by the option url. Please consider below scenarios:

Option url is a list type value:
- If the size of value is bigger than the executor number:
separate the collection of URLs and activate the number of executors.
- If the size of value is smaller than the executor number:
activate the executors as function map.
Option url is a Queue type value:

Run the executors with the Queue object.

Parameters

method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

The result of data process from parsed HTPP response object.

map(method: str, url: List[str], retry: int = 1, lock: bool = True, sema_value: int = 1) → Optional[source]

The crawler version of builtin function map. It would activate multiple executors as many as the size of collection of URLs to run.

Parameters

method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

The result of data process from parsed HTPP response object.

crawl(method: str, url: str, retry: int = 1, *args, **kwargs) → Any

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

data_process(parsed_response: Generic[T]) → Generic[T]

The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.

Parameters: parsed_response – The data which has been parsed from HTTP response object.
Returns: The result of data process. The data type is Generic[T].

parse_http_response(response: Generic[T]) → Generic[T]

Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.

Parameters: response – The HTTP response.
Returns: The result which it has parsed from HTTP response. The data type is Generic[T].

persist(data: Any) → None

Persist the data. It could override this function to implement your own customized logic to save data.

Parameters: data – The target data to persist. In generally, this is the data which has been parsed and handled.
Returns: None

property persistence_factory: PersistenceFacade

Get the instance of persistence factory object.

Returns: A PersistenceFacade type object.

process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) → List[Any]

Handling the crawler process with List of URLs.

Parameters

method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) → List[Any]

Handling the crawler process with Queue which saving URLs.

Parameters

method – HTTP method.
url – Queue of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) → None

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters

http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.

Returns

None

send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) → Generic[T]

Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

Pool Crawler

class smoothcrawler.crawler.PoolCrawler(mode: RunningMode, pool_size: int, factory: CrawlerFactory)[source]

init(lock: bool = True, sema_value: int = 1) → None[source]

Initialize something which be needed before instantiate Pool object.

Parameters

lock –
sema_value –

Returns

apply(method: str, urls: List[str], retry: int = 1) → Optional[source]

Run the crawl process with multiple executor of Pool.

Parameters

method – HTTP method.
urls – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

async_apply(method: str, urls: List[str], retry: int = 1, callbacks: Optional[Union[Callable, List[Callable]]] = None, error_callbacks: Optional[Union[Callable, List[Callable]]] = None) → Optional[source]

Asynchronous version of PoolCrawler.apply.

Parameters

method – HTTP method.
urls – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
callbacks – A Callable type object which would be run after done the task.
error_callbacks – A Callable type object which would be run if it gets any exceptions in running.

Returns

map(method: str, urls: List[str], retry: int = 1) → Optional[source]

The Pool version of ExecutorCrawler.map.

Parameters

method – HTTP method.
urls – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

async_map(method: str, urls: List[str], retry: int = 1, callbacks: Optional[Union[Callable, List[Callable]]] = None, error_callbacks: Optional[Union[Callable, List[Callable]]] = None) → Optional[source]

Asynchronous version of PoolCrawler.map.

Parameters

method – HTTP method.
urls – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
callbacks – A Callable type object which would be run after done the task.
error_callbacks – A Callable type object which would be run if it gets any exceptions in running.

Returns

crawl(method: str, url: str, retry: int = 1, *args, **kwargs) → Any

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

data_process(parsed_response: Generic[T]) → Generic[T]

The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.

Parameters: parsed_response – The data which has been parsed from HTTP response object.
Returns: The result of data process. The data type is Generic[T].

parse_http_response(response: Generic[T]) → Generic[T]

Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.

Parameters: response – The HTTP response.
Returns: The result which it has parsed from HTTP response. The data type is Generic[T].

persist(data: Any) → None

Persist the data. It could override this function to implement your own customized logic to save data.

Parameters: data – The target data to persist. In generally, this is the data which has been parsed and handled.
Returns: None

property persistence_factory: PersistenceFacade

Get the instance of persistence factory object.

Returns: A PersistenceFacade type object.

process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) → List[Any]

Handling the crawler process with List of URLs.

Parameters

method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) → List[Any]

Handling the crawler process with Queue which saving URLs.

Parameters

method – HTTP method.
url – Queue of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) → None

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters

http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.

Returns

None

send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) → Generic[T]

Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.

Parameters

method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

terminal() → None[source]

Terminate the running of Pool.

Returns: None

close() → None[source]

Close the resource of the Pool.

Returns: None