URL

Before it runs web spider tasks, it must to prepare all the URLs. It’s very various. It maybe have option, for examples, date or datetime, unix time, some specific index, ect. This module target to handle and generate the URLs to let web spider to use.

smoothcrawler.urls.OPTION_VAR_INDEX: str = 'index'

The option setting character of index.

URL object could generate URLs with index (0, 1, 2 …) by iterator.

from smoothcrawler.urls import URL, OPTION_VAR_INDEX

_target_url = "http:www.test.com?index={" + OPTION_VAR_INDEX + "}"
_index_urls = URL(_target_url, start=0, end=5)
_urls = _index_urls.generate()
print(_urls)
# ['http:www.test.com?index=0', 'http:www.test.com?index=1', 'http:www.test.com?index=2', 'http:www.test.com?index=3', 'http:www.test.com?index=4', 'http:www.test.com?index=5']
smoothcrawler.urls.OPTION_VAR_DATE: str = 'date'

The option setting character of date.

URL object could generate URLs with date by iterator. It could set the format to let it generates URL via date value with its format.

from smoothcrawler.urls import URL, OPTION_VAR_DATE

_target_url = "http:www.test.com?date={" + OPTION_VAR_DATE + "}"
_date_urls = URL(_target_url, start="20220601", end="20220603", formatter="yyyymmdd")
_urls = _date_urls.generate()
print(_urls)
# ['http:www.test.com?date=20220601', 'http:www.test.com?date=20220602', 'http:www.test.com?date=20220603']
smoothcrawler.urls.OPTION_VAR_DATETIME: str = 'datetime'

The option setting character of datetime.

URL object could generate URLs with datetime by iterator. It could set the format to let it generates URL via datetime value with its format.

from smoothcrawler.urls import URL, OPTION_VAR_DATETIME

_target_url = "http:www.test.com?datetime={" + OPTION_VAR_DATETIME + "}"
_datetime_urls = URL(_target_url, start="2022/06/01 00:00:00", end="2022/06/03 00:00:00", formatter="yyyy/mm/dd HH:MM:SS")
_urls = _datetime_urls.generate()
print(_urls)
# ['http:www.test.com?datetime=20220601000000', 'http:www.test.com?datetime=20220602000000', 'http:www.test.com?datetime=20220603000000']
smoothcrawler.urls.OPTION_VAR_ITERATOR: str = 'iterator'

The option setting character of iterator.

URL object could generate URLs with one specific iterator object.

from smoothcrawler.urls import URL, OPTION_VAR_ITERATOR

_target_url = "http:www.test.com?index_with_iter={" + OPTION_VAR_ITERATOR + "}"
_iter_urls = URL(_target_url, iter=[i for i in range(1, 4)])
_urls = _iter_urls.generate()
print(_urls)
# ['http:www.test.com?index_with_iter=1', 'http:www.test.com?index_with_iter=2', 'http:www.test.com?index_with_iter=3']
smoothcrawler.urls.get_option() Tuple[source]

Get all types of option which could be write in URL character.

Returns

A tuple of all option types.

smoothcrawler.urls.set_index_rule() str[source]

Get the index option. The index option would iterate to generate URLs with index (1, 2, 3, …).

Returns

The setting string of index option.

smoothcrawler.urls.set_date_rule() str[source]

Get the date option. In generally, it’s yyyymmdd. It only has year, month and day. It could iterator to generate URLs with date (for example, 20210101, 20210102, …).

Returns

The setting string of date option.

smoothcrawler.urls.set_datetime_rule() str[source]

Get the datetime option. In generally, it’s yyyymmddhhMMss. It has year, month, day, hour, minute and second. It could iterator to generate URLs with datetime (for example, 20210101000000, 20210101000001, …).

Returns

The setting string of datetime option.

smoothcrawler.urls.set_iterator_rule() str[source]

Get the iterator option. It could iterator to generate URLs with target iterator object.

Returns

The setting string of iterator option.

URL

class smoothcrawler.urls.URL(base: str, start: Optional[Union[int, str]] = None, end: Optional[Union[int, str]] = None, formatter: str = 'yyyymmdd', iter: Optional[Iterable] = None)[source]
property base_url: str

An URL value. It could contain one of specific options (OPTION_VAR_INDEX, OPTION_VAR_DATE, OPTION_VAR_DATETIME and OPTION_VAR_ITERATOR) and it would be generated with the option meaning value. For example, it could set base_url as ‘https://www.google.com?date={date}’. The URL be generated would be like ‘https://www.google.com?date=20220601’.

Returns

An URL string value.

is_index_rule() bool[source]

Check the option setting of current URL object is index type.

Returns

It returns True if it is, or it returns False.

is_date_rule() bool[source]

Check the option setting of current URL object is date type.

Returns

It returns True if it is, or it returns False.

is_datetime_rule() bool[source]

Check the option setting of current URL object is datetime type.

Returns

It returns True if it is, or it returns False.

is_iterator_rule() bool[source]

Check the option setting of current URL object is iterator type.

Returns

It returns True if it is, or it returns False.

is_valid() bool[source]

Check the option setting of current URL object is valid.

Returns

It returns True if it is, or it returns False.

property period_days: int

Get the day value of period.

Returns

Return day value and it’s a int type data.

property period_hours: int

Get the hour value of period.

Returns

Return hour value and it’s a int type data.

property period_minutes: int

Get the minute value of period.

Returns

Return minute value and it’s a int type data.

property period_seconds: int

Get the second value of period.

Returns

Return second value and it’s a int type data.

set_period(days: Optional[int] = None, hours: Optional[int] = None, minutes: Optional[int] = None, seconds: Optional[int] = None) None[source]

Configure the period settings like how many days, hours, minutes or seconds.

Parameters
  • days – How many days to iterate next value.

  • hours – How many hours to iterate next value.

  • minutes – How many minutes to iterate next value.

  • seconds – How many seconds to iterate next value.

Returns

None

generate() List[str][source]

Generating all the URLs we need base on the options.

Returns

A collection of URLs.

_index_handling(index: int) None[source]

The main process to generate URL with index.

Parameters

index – The index value. It should be a started number.

Returns

None

static _is_py_datetime_format(formatter: str) bool[source]

Check whether the character format of datetime formatter is valid or not.

Parameters

formatter

The character format of datetime formatter. It’s usage could refer below:

Formatter

Meaning

%Y

year

%m

month

%d

day

%H

hour

%M

minute

%S

second

Returns

It returns True if the character format of datetime formatter is valid, or it return False.

static _convert_formatter(formatter: str) str[source]

About parameter formatter, it could be like below: 1. yyyymmdd, example: 20210101 2. yyyy/mm/dd, example: 2021/01/01 3. yyyy-mm-dd, example: 2021-01-01

Parameters

formatter – The character format of datetime formatter.

Returns

A string type value which be formatted with the date or datetime format.

_date_handling(_date: datetime, days: int) None[source]

The main process to generate URL with date.

Parameters
  • _date – A datetime object.

  • days – How many days to iterate next value.

Returns

None

_datetime_handling(_datetime: datetime, days: Optional[int] = None, hours: Optional[int] = None, minutes: Optional[int] = None, seconds: Optional[int] = None) None[source]

The main process to generate URL with datetime.

Parameters
  • _datetime

  • days – How many days to iterate next value.

  • hours – How many hours to iterate next value.

  • minutes – How many minutes to iterate next value.

  • seconds – How many seconds to iterate next value.

Returns

None

static _add_flag(option: str) str[source]

Get the character with the option and the specific format it defines.

Parameters

option – The option setting.

Returns

A string value.