Module grab.spider¶
-
class
grab.spider.base.
Spider
(thread_number=None, network_try_limit=None, task_try_limit=None, request_pause=<object object>, priority_mode='random', meta=None, only_cache=False, config=None, args=None, taskq=None, network_result_queue=None, parser_result_queue=None, is_parser_idle=None, shutdown_event=None, mp_mode=False, parser_pool_size=None, parser_mode=False, parser_requests_per_process=10000, http_api_port=None, transport='multicurl', grab_transport='pycurl')[source]¶ Asynchronous scraping framework.
-
check_task_limits
(task)[source]¶ Check that task’s network & try counters do not exceed limits.
Returns: * if success: (True, None) * if error: (False, reason)
-
is_valid_network_response_code
(code, task)[source]¶ Answer the question: if the response could be handled via usual task handler or the task failed and should be processed as error.
-
load_proxylist
(source, source_type=None, proxy_type='http', auto_init=True, auto_change=True)[source]¶ Load proxy list.
Parameters: - source – Proxy source.
Accepts string (file path, url) or
BaseProxySource
instance. - source_type – The type of the specified source. Should be one of the following: ‘text_file’ or ‘url’.
- proxy_type – Should be one of the following: ‘socks4’, ‘socks5’ or’http’.
- auto_change – If set to True then automatical random proxy rotation will be used.
- Proxy source format should be one of the following (for each line):
- ip:port
- ip:port:login:password
- source – Proxy source.
Accepts string (file path, url) or
-
prepare
()[source]¶ You can do additional spider customization here before it has started working. Simply redefine this method in your Spider class.
-
prepare_parser
()[source]¶ You can do additional spider customization here before it has started working. Simply redefine this method in your Spider class.
This method is called only from Spider working in parser mode that, in turn, is spawned automatically by main spider proces working in multiprocess mode.
-
process_handler_result
(result, task)[source]¶ Process result received from the task handler.
Result could be: * None * Task instance * Data instance. * dict:
- {type: “stat”, counters: [], collections: []}
- ResponseNotValid-based exception
- Arbitrary exception
-
process_next_page
(grab, task, xpath, resolve_base=False, **kwargs)[source]¶ Generate task for next page.
Parameters: - grab – Grab instance
- task – Task object which should be assigned to next page url
- xpath – xpath expression which calculates list of URLS
- **kwargs –
extra settings for new task object
Example:
self.follow_links(grab, 'topic', '//div[@class="topic"]/a/@href')
-
setup_cache
(backend='mongo', database=None, use_compression=True, **kwargs)[source]¶ Setup cache.
Parameters: - backend – Backend name Should be one of the following: ‘mongo’, ‘mysql’ or ‘postgresql’.
- database – Database name.
- kwargs – Additional credentials for backend.
-
setup_queue
(backend='memory', **kwargs)[source]¶ Setup queue.
Parameters: - backend – Backend name Should be one of the following: ‘memory’, ‘redis’ or ‘mongo’.
- kwargs – Additional credentials for backend.
-
shutdown
()[source]¶ You can override this method to do some final actions after parsing has been done.
-
stop
()[source]¶ This method set internal flag which signal spider to stop processing new task and shuts down.
-
task_generator
()[source]¶ You can override this method to load new tasks smoothly.
It will be used each time as number of tasks in task queue is less then number of threads multiplied on 2 This allows you to not overload all free memory if total number of tasks is big.
-