Crawlers
module smoothcrawler.crawler
Here are the module which has many different Crawler roles for different scenarios. They also are the ‘final production’ which combines the needed components as web spider and uses that features.
So components implement what it works at each processes, crawler role implement how it works with its components.
Framework Modules
Base Crawler
- class smoothcrawler.crawler.BaseCrawler(factory: Optional[BaseFactory] = None)[source]
- _initial_factory() BaseFactory[source]
Initial BaseFactory object. This function would be called if value of option factory of __init__ is None.
- Returns
CrawlerFactory instance.
- register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None[source]
Register SmoothCrawler’s component(s) to CrawlerFactory instance.
- Parameters
http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.
- Returns
None
- crawl(method: str, url: str, retry: int = 1, *args, **kwargs) Any[source]
Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
The result which it has parsed from HTTP response. The data type is Any.
- send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T][source]
Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A HTTP response object.
- parse_http_response(response: Generic[T]) Generic[T][source]
Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.
- Parameters
response – The HTTP response.
- Returns
The result which it has parsed from HTTP response. The data type is Generic[T].
- data_process(parsed_response: Generic[T]) Generic[T][source]
The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.
- Parameters
parsed_response – The data which has been parsed from HTTP response object.
- Returns
The result of data process. The data type is Generic[T].
MultiRunnable Crawler
- class smoothcrawler.crawler.MultiRunnableCrawler(factory: Optional[BaseFactory] = None)[source]
- property persistence_factory: PersistenceFacade
Get the instance of persistence factory object.
- Returns
A PersistenceFacade type object.
- process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) List[Any][source]
Handling the crawler process with List of URLs.
- Parameters
method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A list of result of data process.
- process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) List[Any][source]
Handling the crawler process with Queue which saving URLs.
- Parameters
method – HTTP method.
url – Queue of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A list of result of data process.
- static _get_lock_feature(lock: bool = True, sema_value: int = 1) Union[LockFactory, BoundedSemaphoreFactory][source]
Initialize Lock or Semaphore. Why? because of persistence process.
- Parameters
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.
- Returns
It would return LockFactory if option lock is True, or it returns BoundedSemaphoreFactory.
- static _divide_urls(urls: List[str], executor_number: int) List[List[str]][source]
Divide the data list which saving URLs to be a list saving multiple lists.
- Parameters
urls – A collection of URLs.
executor_number – How many executors you activate to run.
- Returns
A collection of element which also is collection of URLs.
Implementation Modules
Simple Crawler
- class smoothcrawler.crawler.SimpleCrawler(factory: Optional[BaseFactory] = None)[source]
- run_and_save(method: str, url: Union[str, list]) None[source]
In addiction to crawl and handle the data from web, it persist the data.
- Parameters
method – HTTP method.
url – One or more URLs (a collection of URLs).
- Returns
None
- crawl(method: str, url: str, retry: int = 1, *args, **kwargs) Any
Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
The result which it has parsed from HTTP response. The data type is Any.
- data_process(parsed_response: Generic[T]) Generic[T]
The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.
- Parameters
parsed_response – The data which has been parsed from HTTP response object.
- Returns
The result of data process. The data type is Generic[T].
- parse_http_response(response: Generic[T]) Generic[T]
Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.
- Parameters
response – The HTTP response.
- Returns
The result which it has parsed from HTTP response. The data type is Generic[T].
- persist(data: Any) None
Persist the data. It could override this function to implement your own customized logic to save data.
- Parameters
data – The target data to persist. In generally, this is the data which has been parsed and handled.
- Returns
None
- register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None
Register SmoothCrawler’s component(s) to CrawlerFactory instance.
- Parameters
http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.
- Returns
None
- send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T]
Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A HTTP response object.
Asynchronous Simple Crawler
- class smoothcrawler.crawler.AsyncSimpleCrawler(executors: int, factory: Optional[AsyncCrawlerFactory] = None)[source]
- async crawl(url: str, method: str, retry: int = 1, *args, **kwargs) Any[source]
Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
The result which it has parsed from HTTP response. The data type is Any.
- async send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T][source]
The asynchronous version of BaseCrawler.send_http_request.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A HTTP response object.
- async parse_http_response(response: Generic[T]) Generic[T][source]
The asynchronous version of BaseCrawler.parse_http_response.
- Parameters
response – The HTTP response.
- Returns
The result which it has parsed from HTTP response. The data type is Generic[T].
- async data_process(parsed_response: Generic[T]) Generic[T][source]
The asynchronous version of BaseCrawler.data_process.
- Parameters
parsed_response – The data which has been parsed from HTTP response object.
- Returns
The result of data process. The data type is Generic[T].
- async persist(data: Any) None[source]
The asynchronous version of BaseCrawler.persist.
- Parameters
data – The target data to persist. In generally, this is the data which has been parsed and handled.
- Returns
None
- async process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) Any[source]
The asynchronous version of MultiRunnableCrawler.process_with_list.
- Parameters
method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A list of result of data process.
- async process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) Any[source]
The asynchronous version of MultiRunnableCrawler.process_with_queue.
- Parameters
method – HTTP method.
url – Queue of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A list of result of data process.
- map(method: str, url: List[str], retry: int = 1, lock: bool = True, sema_value: int = 1) Optional[source]
The asynchronous version of ExecutorCrawler.map.
- Parameters
method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.
- Returns
The result of data process from parsed HTPP response object.
- run(method: str, url: Union[List[str], Queue], retry: int = 1, lock: bool = True, sema_value: int = 1) Optional[source]
The asynchronous version of ExecutorCrawler.run.
- Parameters
method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.
- Returns
The result of data process from parsed HTPP response object.
- property persistence_factory: PersistenceFacade
Get the instance of persistence factory object.
- Returns
A PersistenceFacade type object.
- register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None
Register SmoothCrawler’s component(s) to CrawlerFactory instance.
- Parameters
http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.
- Returns
None
Executor Crawler
- class smoothcrawler.crawler.ExecutorCrawler(mode: RunningMode, executors: int, factory: CrawlerFactory)[source]
- run(method: str, url: Union[List[str], Queue], retry: int = 1, lock: bool = True, sema_value: int = 1) Optional[source]
Run the crawl process as multiple executor directly. It may run a little bit differently by the option url. Please consider below scenarios:
Option url is a list type value:
If the size of value is bigger than the executor number:
separate the collection of URLs and activate the number of executors.
If the size of value is smaller than the executor number:
activate the executors as function map.
Option url is a Queue type value:
Run the executors with the Queue object.
- Parameters
method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.
- Returns
The result of data process from parsed HTPP response object.
- map(method: str, url: List[str], retry: int = 1, lock: bool = True, sema_value: int = 1) Optional[source]
The crawler version of builtin function map. It would activate multiple executors as many as the size of collection of URLs to run.
- Parameters
method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
lock – It would initial a Lock if it’s True, or it would initial Semaphore.
sema_value – The value of Semaphore. This argument only work for option lock is False.
- Returns
The result of data process from parsed HTPP response object.
- crawl(method: str, url: str, retry: int = 1, *args, **kwargs) Any
Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
The result which it has parsed from HTTP response. The data type is Any.
- data_process(parsed_response: Generic[T]) Generic[T]
The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.
- Parameters
parsed_response – The data which has been parsed from HTTP response object.
- Returns
The result of data process. The data type is Generic[T].
- parse_http_response(response: Generic[T]) Generic[T]
Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.
- Parameters
response – The HTTP response.
- Returns
The result which it has parsed from HTTP response. The data type is Generic[T].
- persist(data: Any) None
Persist the data. It could override this function to implement your own customized logic to save data.
- Parameters
data – The target data to persist. In generally, this is the data which has been parsed and handled.
- Returns
None
- property persistence_factory: PersistenceFacade
Get the instance of persistence factory object.
- Returns
A PersistenceFacade type object.
- process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) List[Any]
Handling the crawler process with List of URLs.
- Parameters
method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A list of result of data process.
- process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) List[Any]
Handling the crawler process with Queue which saving URLs.
- Parameters
method – HTTP method.
url – Queue of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A list of result of data process.
- register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None
Register SmoothCrawler’s component(s) to CrawlerFactory instance.
- Parameters
http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.
- Returns
None
- send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T]
Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A HTTP response object.
Pool Crawler
- class smoothcrawler.crawler.PoolCrawler(mode: RunningMode, pool_size: int, factory: CrawlerFactory)[source]
- init(lock: bool = True, sema_value: int = 1) None[source]
Initialize something which be needed before instantiate Pool object.
- Parameters
lock –
sema_value –
- Returns
- apply(method: str, urls: List[str], retry: int = 1) Optional[source]
Run the crawl process with multiple executor of Pool.
- Parameters
method – HTTP method.
urls – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
- async_apply(method: str, urls: List[str], retry: int = 1, callbacks: Optional[Union[Callable, List[Callable]]] = None, error_callbacks: Optional[Union[Callable, List[Callable]]] = None) Optional[source]
Asynchronous version of PoolCrawler.apply.
- Parameters
method – HTTP method.
urls – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
callbacks – A Callable type object which would be run after done the task.
error_callbacks – A Callable type object which would be run if it gets any exceptions in running.
- Returns
- map(method: str, urls: List[str], retry: int = 1) Optional[source]
The Pool version of ExecutorCrawler.map.
- Parameters
method – HTTP method.
urls – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
- async_map(method: str, urls: List[str], retry: int = 1, callbacks: Optional[Union[Callable, List[Callable]]] = None, error_callbacks: Optional[Union[Callable, List[Callable]]] = None) Optional[source]
Asynchronous version of PoolCrawler.map.
- Parameters
method – HTTP method.
urls – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
callbacks – A Callable type object which would be run after done the task.
error_callbacks – A Callable type object which would be run if it gets any exceptions in running.
- Returns
- crawl(method: str, url: str, retry: int = 1, *args, **kwargs) Any
Crawl web data process. It would send HTTP request, receive HTTP response and parse the content here. It ONLY does it, doesn’t do anything else.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
The result which it has parsed from HTTP response. The data type is Any.
- data_process(parsed_response: Generic[T]) Generic[T]
The data process to handle the data which has been parsed from HTTP response object. It could override this function to implement your own customized logic to do data process.
- Parameters
parsed_response – The data which has been parsed from HTTP response object.
- Returns
The result of data process. The data type is Generic[T].
- parse_http_response(response: Generic[T]) Generic[T]
Parse the HTTP response. It could override this function to implement your own customized logic to parse HTTP response.
- Parameters
response – The HTTP response.
- Returns
The result which it has parsed from HTTP response. The data type is Generic[T].
- persist(data: Any) None
Persist the data. It could override this function to implement your own customized logic to save data.
- Parameters
data – The target data to persist. In generally, this is the data which has been parsed and handled.
- Returns
None
- property persistence_factory: PersistenceFacade
Get the instance of persistence factory object.
- Returns
A PersistenceFacade type object.
- process_with_list(method: str, url: List[str], retry: int = 1, *args, **kwargs) List[Any]
Handling the crawler process with List of URLs.
- Parameters
method – HTTP method.
url – A collection of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A list of result of data process.
- process_with_queue(method: str, url: Queue, retry: int = 1, *args, **kwargs) List[Any]
Handling the crawler process with Queue which saving URLs.
- Parameters
method – HTTP method.
url – Queue of URLs.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A list of result of data process.
- register_factory(http_req_sender: Optional[BaseHTTP] = None, http_resp_parser: Optional[BaseHTTPResponseParser] = None, data_process: Optional[Union[BaseDataHandler, BaseAsyncDataHandler]] = None, persistence: Optional[PersistenceFacade] = None) None
Register SmoothCrawler’s component(s) to CrawlerFactory instance.
- Parameters
http_req_sender – The Sender component sends HTTP request.
http_resp_parser – The Parser component handles HTTP response.
data_process – The Handler component handles data process which be generated from HTTP response.
persistence – The Persistence component response of saving data.
- Returns
None
- send_http_request(method: str, url: str, retry: int = 1, *args, **kwargs) Generic[T]
Send HTTP request. It could override this function to implement your own customized logic to send HTTP request.
- Parameters
method – HTTP method.
url – URL.
retry – How many it would retry to send HTTP request if it gets fail when sends request.
- Returns
A HTTP response object.