Link Extractors¶

Link extractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.

There is scrapy.linkextractors import LinkExtractor available in Scrapy, but you can create your own custom Link Extractors to suit your needs by implementing a simple interface.

每个LinkExtractor有唯一的公共方法是 extract_links，其接收一个 Response对象，并返回一个scrapy.link.Link对象列表｡Link Extractors只实例化一次，其 extract_links方法会根据不同的response被调用多次来提取链接｡

Link Extractors在 CrawlSpider类(在Scrapy可用)中使用, 通过一套规则,但你也可以用它在你的Spider中,即使你不是从 CrawlSpider继承的子类, 因为它的目的很简单: 提取链接｡

内置Link Extractor 参考¶

Scrapy 自带的Link Extractors类在 scrapy.linkextractors模块提供｡

默认的link extractor 是 LinkExtractor，其实就是 LxmlLinkExtractor：

from scrapy.linkextractors import LinkExtractor

There used to be other link extractor classes in previous Scrapy versions, but they are deprecated now.

LxmlLinkExtractor¶

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)¶

LxmlLinkExtractor是推荐的链接提取器与方便的过滤选项。It is implemented using lxml’s robust HTMLParser.

Parameters:

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. 如果未给出（或为空），它将匹配所有链接。
deny（正则表达式（或其列表）） - 单个正则表达式（或正则表达式列表），（absolute）urls必须匹配才能排除（即。未提取）。它优先于allow参数。如果未给出（或为空），则不排除任何链接。
allow_domains（str或list） - 单个值或包含将被考虑用于提取链接的域的字符串列表
deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
deny_extensions (list) – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractors package.
restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as restrict_xpaths.
tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
canonicalize（boolean） - 规范每个提取的网址（使用w3lib.url.canonicalize_url）。默认为True。
unique（boolean） - 是否对所提取的链接应用重复过滤。
process_value (callable) –
一个函数，它接收从扫描过的标签和属性提取的每个值而且能够修改该值并返回一个新值, 它或者返回None来完全忽略链接｡如果没有给出, process_value默认是lambda x: x｡
For example, to extract links from this code:
```
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
```
You can use the following function in process_value:
```
def process_value(value):
    m = re.search("javascript:goToPage\('(.*?)'", value)
    if m:
        return m.group(1)
```