Link Extractors¶
Link extractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response
objects) which will be eventually followed.
There is scrapy.linkextractors import LinkExtractor
available in Scrapy, but you can create your own custom Link Extractors to suit your needs by implementing a simple interface.
每个LinkExtractor有唯一的公共方法是 extract_links
,其接收 一个 Response
对象, 并返回一个scrapy.link.Link
对象列表。Link Extractors只实例化一次,其 extract_links
方法会根据不同的response被调用多次来提取链接。
Link Extractors在 CrawlSpider
类(在Scrapy可用)中使用, 通过一套规则,但你也可以用它在你的Spider中,即使你不是从 CrawlSpider
继承的子类, 因为它的目的很简单: 提取链接。
内置Link Extractor 参考¶
Scrapy 自带的Link Extractors类在 scrapy.linkextractors
模块提供。
默认的link extractor 是 LinkExtractor
,其实就是 LxmlLinkExtractor
:
from scrapy.linkextractors import LinkExtractor
There used to be other link extractor classes in previous Scrapy versions, but they are deprecated now.
LxmlLinkExtractor¶
- class
scrapy.linkextractors.lxmlhtml.
LxmlLinkExtractor
(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)¶ LxmlLinkExtractor是推荐的链接提取器与方便的过滤选项。It is implemented using lxml’s robust HTMLParser.
Parameters: - allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. 如果未给出(或为空),它将匹配所有链接。
- deny(正则表达式(或其列表)) - 单个正则表达式(或正则表达式列表),(absolute)urls必须匹配才能排除(即。未提取)。它优先于
allow
参数。如果未给出(或为空),则不排除任何链接。 - allow_domains(str或list) - 单个值或包含将被考虑用于提取链接的域的字符串列表
- deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
- deny_extensions (list) – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to the
IGNORED_EXTENSIONS
list defined in the scrapy.linkextractors package. - restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
- restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as
restrict_xpaths
. - tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to
('a', 'area')
. - attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the
tags
parameter). Defaults to('href',)
- canonicalize(boolean) - 规范每个提取的网址(使用w3lib.url.canonicalize_url)。默认为
True
。 - unique(boolean) - 是否对所提取的链接应用重复过滤。
- process_value (callable) –
一个函数,它接收从扫描过的标签和属性提取的每个值而且能够修改该值并返回一个新值, 它或者返回
None
来完全忽略链接。如果没有给出,process_value
默认是lambda x: x
。For example, to extract links from this code:
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
You can use the following function in
process_value
:def process_value(value): m = re.search("javascript:goToPage\('(.*?)'", value) if m: return m.group(1)