Crawlers

module smoothcrawler.crawler

Here are the module which has many different Crawler roles for different scenarios. They also are the ‘final production’ which combines the needed components as web spider and uses that features.

So components implement what it works at each processes, crawler role implement how it works with its components.

Framework Modules

Base Crawler

class smoothcrawler.crawler.BaseCrawler(factory: Optional[BaseFactory] = None)[source]
_initial_factory() BaseFactory[source]

Initial BaseFactory object. This function would be called if value of option factory of __init__ is None.

Returns

CrawlerFactory instance.

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None[source]

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters
  • http_req_sender – The Sender component sends HTTP request.

  • http_resp_parser – The Parser component handles HTTP response.

  • data_process – The Handler component handles data process which be generated from HTTP response.

  • persistence – The Persistence component response of saving data.

Returns

None

crawl(method: str, url: str, retry: int = 1, *args, **kwargs) Any[source]

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T][source]

Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

parse_http_response(response: Generic[T]) Generic[T][source]

Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.

Parameters

response – The HTTP response.

Returns

The result which it has parsed from HTTP response. The data type is Generic[T].

data_process(parsed_response: Generic[T]) Generic[T][source]

The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.

Parameters

parsed_response – The data which has been parsed from HTTP response object.

Returns

The result of data process. The data type is Generic[T].

persist(data: Any) None[source]

Persist the data. It could override this function to implement your own customized logic to save data.

Parameters

data – The target data to persist. In generally, this is the data which has been parsed and handled.

Returns

None

MultiRunnable Crawler

class smoothcrawler.crawler.MultiRunnableCrawler(factory: Optional[BaseFactory] = None)[source]
property persistence_factory: PersistenceFacade

Get the instance of persistence factory object.

Returns

A PersistenceFacade type object.

process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) List[Any][source]

Handling the crawler process with List of URLs.

Parameters
  • method – HTTP method.

  • url – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) List[Any][source]

Handling the crawler process with Queue which saving URLs.

Parameters
  • method – HTTP method.

  • url – Queue of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

static _get_lock_feature(lock: bool = True, sema_value: int = 1) Union[LockFactory, BoundedSemaphoreFactory][source]

Initialize Lock or Semaphore. Why? because of persistence process.

Parameters
  • lock – It would initial a Lock if it’s True, or it would initial Semaphore.

  • sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

It would return LockFactory if option lock is True, or it returns BoundedSemaphoreFactory.

static _divide_urls(urls: List[str], executor_number: int) List[List[str]][source]

Divide the data list which saving URLs to be a list saving multiple lists.

Parameters
  • urls – A collection of URLs.

  • executor_number – How many executors you activate to run.

Returns

A collection of element which also is collection of URLs.

Implementation Modules

Simple Crawler

class smoothcrawler.crawler.SimpleCrawler(factory: Optional[BaseFactory] = None)[source]
run_and_save(method: str, url: Union[str, list]) None[source]

In addiction to crawl and handle the data from web, it persist the data.

Parameters
  • method – HTTP method.

  • url – One or more URLs (a collection of URLs).

Returns

None

crawl(method: str, url: str, retry: int = 1, *args, **kwargs) Any

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

data_process(parsed_response: Generic[T]) Generic[T]

The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.

Parameters

parsed_response – The data which has been parsed from HTTP response object.

Returns

The result of data process. The data type is Generic[T].

parse_http_response(response: Generic[T]) Generic[T]

Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.

Parameters

response – The HTTP response.

Returns

The result which it has parsed from HTTP response. The data type is Generic[T].

persist(data: Any) None

Persist the data. It could override this function to implement your own customized logic to save data.

Parameters

data – The target data to persist. In generally, this is the data which has been parsed and handled.

Returns

None

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters
  • http_req_sender – The Sender component sends HTTP request.

  • http_resp_parser – The Parser component handles HTTP response.

  • data_process – The Handler component handles data process which be generated from HTTP response.

  • persistence – The Persistence component response of saving data.

Returns

None

send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T]

Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

Asynchronous Simple Crawler

class smoothcrawler.crawler.AsyncSimpleCrawler(executors: int, factory: Optional[AsyncCrawlerFactory] = None)[source]
async crawl(url: str, method: str, retry: int = 1, *args, **kwargs) Any[source]

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

async send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T][source]

The asynchronous version of BaseCrawler.send_http_request.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

async parse_http_response(response: Generic[T]) Generic[T][source]

The asynchronous version of BaseCrawler.parse_http_response.

Parameters

response – The HTTP response.

Returns

The result which it has parsed from HTTP response. The data type is Generic[T].

async data_process(parsed_response: Generic[T]) Generic[T][source]

The asynchronous version of BaseCrawler.data_process.

Parameters

parsed_response – The data which has been parsed from HTTP response object.

Returns

The result of data process. The data type is Generic[T].

async persist(data: Any) None[source]

The asynchronous version of BaseCrawler.persist.

Parameters

data – The target data to persist. In generally, this is the data which has been parsed and handled.

Returns

None

async process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) Any[source]

The asynchronous version of MultiRunnableCrawler.process_with_list.

Parameters
  • method – HTTP method.

  • url – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

async process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) Any[source]

The asynchronous version of MultiRunnableCrawler.process_with_queue.

Parameters
  • method – HTTP method.

  • url – Queue of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

map(method: str, url: List[str], retry: int = 1, lock: bool = True, sema_value: int = 1) Optional[source]

The asynchronous version of ExecutorCrawler.map.

Parameters
  • method – HTTP method.

  • url – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

  • lock – It would initial a Lock if it’s True, or it would initial Semaphore.

  • sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

The result of data process from parsed HTPP response object.

run(method: str, url: Union[List[str], Queue], retry: int = 1, lock: bool = True, sema_value: int = 1) Optional[source]

The asynchronous version of ExecutorCrawler.run.

Parameters
  • method – HTTP method.

  • url – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

  • lock – It would initial a Lock if it’s True, or it would initial Semaphore.

  • sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

The result of data process from parsed HTPP response object.

property persistence_factory: PersistenceFacade

Get the instance of persistence factory object.

Returns

A PersistenceFacade type object.

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters
  • http_req_sender – The Sender component sends HTTP request.

  • http_resp_parser – The Parser component handles HTTP response.

  • data_process – The Handler component handles data process which be generated from HTTP response.

  • persistence – The Persistence component response of saving data.

Returns

None

Executor Crawler

class smoothcrawler.crawler.ExecutorCrawler(mode: RunningMode, executors: int, factory: CrawlerFactory)[source]
run(method: str, url: Union[List[str], Queue], retry: int = 1, lock: bool = True, sema_value: int = 1) Optional[source]

Run the crawl process as multiple executor directly. It may run a little bit differently by the option url. Please consider below scenarios:

  • Option url is a list type value:

    • If the size of value is bigger than the executor number:

    separate the collection of URLs and activate the number of executors.

    • If the size of value is smaller than the executor number:

    activate the executors as function map.

  • Option url is a Queue type value:

Run the executors with the Queue object.

Parameters
  • method – HTTP method.

  • url – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

  • lock – It would initial a Lock if it’s True, or it would initial Semaphore.

  • sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

The result of data process from parsed HTPP response object.

map(method: str, url: List[str], retry: int = 1, lock: bool = True, sema_value: int = 1) Optional[source]

The crawler version of builtin function map. It would activate multiple executors as many as the size of collection of URLs to run.

Parameters
  • method – HTTP method.

  • url – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

  • lock – It would initial a Lock if it’s True, or it would initial Semaphore.

  • sema_value – The value of Semaphore. This argument only work for option lock is False.

Returns

The result of data process from parsed HTPP response object.

crawl(method: str, url: str, retry: int = 1, *args, **kwargs) Any

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

data_process(parsed_response: Generic[T]) Generic[T]

The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.

Parameters

parsed_response – The data which has been parsed from HTTP response object.

Returns

The result of data process. The data type is Generic[T].

parse_http_response(response: Generic[T]) Generic[T]

Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.

Parameters

response – The HTTP response.

Returns

The result which it has parsed from HTTP response. The data type is Generic[T].

persist(data: Any) None

Persist the data. It could override this function to implement your own customized logic to save data.

Parameters

data – The target data to persist. In generally, this is the data which has been parsed and handled.

Returns

None

property persistence_factory: PersistenceFacade

Get the instance of persistence factory object.

Returns

A PersistenceFacade type object.

process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) List[Any]

Handling the crawler process with List of URLs.

Parameters
  • method – HTTP method.

  • url – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) List[Any]

Handling the crawler process with Queue which saving URLs.

Parameters
  • method – HTTP method.

  • url – Queue of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters
  • http_req_sender – The Sender component sends HTTP request.

  • http_resp_parser – The Parser component handles HTTP response.

  • data_process – The Handler component handles data process which be generated from HTTP response.

  • persistence – The Persistence component response of saving data.

Returns

None

send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T]

Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

Pool Crawler

class smoothcrawler.crawler.PoolCrawler(mode: RunningMode, pool_size: int, factory: CrawlerFactory)[source]
init(lock: bool = True, sema_value: int = 1) None[source]

Initialize something which be needed before instantiate Pool object.

Parameters
  • lock

  • sema_value

Returns

apply(method: str, urls: List[str], retry: int = 1) Optional[source]

Run the crawl process with multiple executor of Pool.

Parameters
  • method – HTTP method.

  • urls – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

async_apply(method: str, urls: List[str], retry: int = 1, callbacks: Optional[Union[Callable, List[Callable]]] = None, error_callbacks: Optional[Union[Callable, List[Callable]]] = None) Optional[source]

Asynchronous version of PoolCrawler.apply.

Parameters
  • method – HTTP method.

  • urls – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

  • callbacks – A Callable type object which would be run after done the task.

  • error_callbacks – A Callable type object which would be run if it gets any exceptions in running.

Returns

map(method: str, urls: List[str], retry: int = 1) Optional[source]

The Pool version of ExecutorCrawler.map.

Parameters
  • method – HTTP method.

  • urls – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

async_map(method: str, urls: List[str], retry: int = 1, callbacks: Optional[Union[Callable, List[Callable]]] = None, error_callbacks: Optional[Union[Callable, List[Callable]]] = None) Optional[source]

Asynchronous version of PoolCrawler.map.

Parameters
  • method – HTTP method.

  • urls – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

  • callbacks – A Callable type object which would be run after done the task.

  • error_callbacks – A Callable type object which would be run if it gets any exceptions in running.

Returns

crawl(method: str, url: str, retry: int = 1, *args, **kwargs) Any

Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

The result which it has parsed from HTTP response. The data type is Any.

data_process(parsed_response: Generic[T]) Generic[T]

The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.

Parameters

parsed_response – The data which has been parsed from HTTP response object.

Returns

The result of data process. The data type is Generic[T].

parse_http_response(response: Generic[T]) Generic[T]

Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.

Parameters

response – The HTTP response.

Returns

The result which it has parsed from HTTP response. The data type is Generic[T].

persist(data: Any) None

Persist the data. It could override this function to implement your own customized logic to save data.

Parameters

data – The target data to persist. In generally, this is the data which has been parsed and handled.

Returns

None

property persistence_factory: PersistenceFacade

Get the instance of persistence factory object.

Returns

A PersistenceFacade type object.

process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) List[Any]

Handling the crawler process with List of URLs.

Parameters
  • method – HTTP method.

  • url – A collection of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) List[Any]

Handling the crawler process with Queue which saving URLs.

Parameters
  • method – HTTP method.

  • url – Queue of URLs.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A list of result of data process.

register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None

Register SmoothCrawler’s component(s) to CrawlerFactory instance.

Parameters
  • http_req_sender – The Sender component sends HTTP request.

  • http_resp_parser – The Parser component handles HTTP response.

  • data_process – The Handler component handles data process which be generated from HTTP response.

  • persistence – The Persistence component response of saving data.

Returns

None

send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T]

Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.

Parameters
  • method – HTTP method.

  • url – URL.

  • retry – How many it would retry to send HTTP request if it gets fail when sends request.

Returns

A HTTP response object.

terminal() None[source]

Terminate the running of Pool.

Returns

None

close() None[source]

Close the resource of the Pool.

Returns

None