URL
Before it runs web spider tasks, it must to prepare all the URLs. It’s very various. It maybe have option, for examples, date or datetime, unix time, some specific index, ect. This module target to handle and generate the URLs to let web spider to use.
- smoothcrawler.urls.OPTION_VAR_INDEX: str = 'index'
The option setting character of index.
URL object could generate URLs with index (0, 1, 2 …) by iterator.
from smoothcrawler.urls import URL, OPTION_VAR_INDEX
_target_url = "http:www.test.com?index={" + OPTION_VAR_INDEX + "}"
_index_urls = URL(_target_url, start=0, end=5)
_urls = _index_urls.generate()
print(_urls)
# ['http:www.test.com?index=0', 'http:www.test.com?index=1', 'http:www.test.com?index=2', 'http:www.test.com?index=3', 'http:www.test.com?index=4', 'http:www.test.com?index=5']
- smoothcrawler.urls.OPTION_VAR_DATE: str = 'date'
The option setting character of date.
URL object could generate URLs with date by iterator. It could set the format to let it generates URL via date value with its format.
from smoothcrawler.urls import URL, OPTION_VAR_DATE
_target_url = "http:www.test.com?date={" + OPTION_VAR_DATE + "}"
_date_urls = URL(_target_url, start="20220601", end="20220603", formatter="yyyymmdd")
_urls = _date_urls.generate()
print(_urls)
# ['http:www.test.com?date=20220601', 'http:www.test.com?date=20220602', 'http:www.test.com?date=20220603']
- smoothcrawler.urls.OPTION_VAR_DATETIME: str = 'datetime'
The option setting character of datetime.
URL object could generate URLs with datetime by iterator. It could set the format to let it generates URL via datetime value with its format.
from smoothcrawler.urls import URL, OPTION_VAR_DATETIME
_target_url = "http:www.test.com?datetime={" + OPTION_VAR_DATETIME + "}"
_datetime_urls = URL(_target_url, start="2022/06/01 00:00:00", end="2022/06/03 00:00:00", formatter="yyyy/mm/dd HH:MM:SS")
_urls = _datetime_urls.generate()
print(_urls)
# ['http:www.test.com?datetime=20220601000000', 'http:www.test.com?datetime=20220602000000', 'http:www.test.com?datetime=20220603000000']
- smoothcrawler.urls.OPTION_VAR_ITERATOR: str = 'iterator'
The option setting character of iterator.
URL object could generate URLs with one specific iterator object.
from smoothcrawler.urls import URL, OPTION_VAR_ITERATOR
_target_url = "http:www.test.com?index_with_iter={" + OPTION_VAR_ITERATOR + "}"
_iter_urls = URL(_target_url, iter=[i for i in range(1, 4)])
_urls = _iter_urls.generate()
print(_urls)
# ['http:www.test.com?index_with_iter=1', 'http:www.test.com?index_with_iter=2', 'http:www.test.com?index_with_iter=3']
- smoothcrawler.urls.get_option() Tuple[source]
Get all types of option which could be write in URL character.
- Returns
A tuple of all option types.
- smoothcrawler.urls.set_index_rule() str[source]
Get the index option. The index option would iterate to generate URLs with index (1, 2, 3, …).
- Returns
The setting string of index option.
- smoothcrawler.urls.set_date_rule() str[source]
Get the date option. In generally, it’s yyyymmdd. It only has year, month and day. It could iterator to generate URLs with date (for example, 20210101, 20210102, …).
- Returns
The setting string of date option.
- smoothcrawler.urls.set_datetime_rule() str[source]
Get the datetime option. In generally, it’s yyyymmddhhMMss. It has year, month, day, hour, minute and second. It could iterator to generate URLs with datetime (for example, 20210101000000, 20210101000001, …).
- Returns
The setting string of datetime option.
- smoothcrawler.urls.set_iterator_rule() str[source]
Get the iterator option. It could iterator to generate URLs with target iterator object.
- Returns
The setting string of iterator option.
URL
- class smoothcrawler.urls.URL(base: str, start: Optional[Union[int, str]] = None, end: Optional[Union[int, str]] = None, formatter: str = 'yyyymmdd', iter: Optional[Iterable] = None)[source]
- property base_url: str
An URL value. It could contain one of specific options (OPTION_VAR_INDEX, OPTION_VAR_DATE, OPTION_VAR_DATETIME and OPTION_VAR_ITERATOR) and it would be generated with the option meaning value. For example, it could set base_url as ‘https://www.google.com?date={date}’. The URL be generated would be like ‘https://www.google.com?date=20220601’.
- Returns
An URL string value.
- is_index_rule() bool[source]
Check the option setting of current URL object is index type.
- Returns
It returns True if it is, or it returns False.
- is_date_rule() bool[source]
Check the option setting of current URL object is date type.
- Returns
It returns True if it is, or it returns False.
- is_datetime_rule() bool[source]
Check the option setting of current URL object is datetime type.
- Returns
It returns True if it is, or it returns False.
- is_iterator_rule() bool[source]
Check the option setting of current URL object is iterator type.
- Returns
It returns True if it is, or it returns False.
- is_valid() bool[source]
Check the option setting of current URL object is valid.
- Returns
It returns True if it is, or it returns False.
- property period_days: int
Get the day value of period.
- Returns
Return day value and it’s a int type data.
- property period_hours: int
Get the hour value of period.
- Returns
Return hour value and it’s a int type data.
- property period_minutes: int
Get the minute value of period.
- Returns
Return minute value and it’s a int type data.
- property period_seconds: int
Get the second value of period.
- Returns
Return second value and it’s a int type data.
- set_period(days: Optional[int] = None, hours: Optional[int] = None, minutes: Optional[int] = None, seconds: Optional[int] = None) None[source]
Configure the period settings like how many days, hours, minutes or seconds.
- Parameters
days – How many days to iterate next value.
hours – How many hours to iterate next value.
minutes – How many minutes to iterate next value.
seconds – How many seconds to iterate next value.
- Returns
None
- generate() List[str][source]
Generating all the URLs we need base on the options.
- Returns
A collection of URLs.
- _index_handling(index: int) None[source]
The main process to generate URL with index.
- Parameters
index – The index value. It should be a started number.
- Returns
None
- static _is_py_datetime_format(formatter: str) bool[source]
Check whether the character format of datetime formatter is valid or not.
- Parameters
formatter –
The character format of datetime formatter. It’s usage could refer below:
Formatter
Meaning
%Y
year
%m
month
%d
day
%H
hour
%M
minute
%S
second
- Returns
It returns True if the character format of datetime formatter is valid, or it return False.
- static _convert_formatter(formatter: str) str[source]
About parameter formatter, it could be like below: 1. yyyymmdd, example: 20210101 2. yyyy/mm/dd, example: 2021/01/01 3. yyyy-mm-dd, example: 2021-01-01
- Parameters
formatter – The character format of datetime formatter.
- Returns
A string type value which be formatted with the date or datetime format.
- _date_handling(_date: datetime, days: int) None[source]
The main process to generate URL with date.
- Parameters
_date – A datetime object.
days – How many days to iterate next value.
- Returns
None
- _datetime_handling(_datetime: datetime, days: Optional[int] = None, hours: Optional[int] = None, minutes: Optional[int] = None, seconds: Optional[int] = None) None[source]
The main process to generate URL with datetime.
- Parameters
_datetime –
days – How many days to iterate next value.
hours – How many hours to iterate next value.
minutes – How many minutes to iterate next value.
seconds – How many seconds to iterate next value.
- Returns
None