Scrapy shell

The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider. It’s meant to be used for testing data extraction code, but you can actually use it for testing any kind of code as it is also a regular Python shell.

The shell is used for testing XPath or CSS expressions and see how they work and what data they extract from the web pages you’re trying to scrape. It allows you to interactively test your expressions while you’re writing your spider, without having to run the spider to test every change.

Once you get familiarized with the Scrapy shell, you’ll see that it’s an invaluable tool for developing and debugging your spiders.

配置shell

如果你安装了IPython,Scrapy shell将使用它(替代标准Python控制台)。The IPython console is much more powerful and provides smart auto-completion and colorized output, among other things.

We highly recommend you install IPython, specially if you’re working on Unix systems (where IPython excels). See the IPython installation guide for more info.

Scrapy还支持bpython,并且在IPython不可用时请尝试使用bpython。

通过scrapy的设置,您可以使用ipythonbpython或标准python shell中的任一种方法去配置shell。通过设置 SCRAPY_PYTHON_SHELL环境变量; 或者在你的scrapy.cfg中定义:

[settings]
shell = bpython

启动shell

To launch the Scrapy shell you can use the shell command like this:

scrapy shell <url>

Where the <url> is the URL you want to scrape.

shell还适用于本地文件。This can be handy if you want to play around with a local copy of a web page. shell understands the following syntaxes for local files:

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

Note

When using relative file paths, be explicit and prepend them with ./ (or ../ when relevant). scrapy shell index.html will not work as one might expect (and this is by design, not a bug).

Because shell favors HTTP URLs over File URIs, and index.html being syntactically similar to example.com, shell will treat index.html as a domain name and trigger a DNS lookup error:

$ scrapy shell index.html
[ ... scrapy shell starts ... ]
[ ... traceback ... ]
twisted.internet.error.DNSLookupError: DNS lookup failed:
address 'index.html' not found: [Errno -5] No address associated with hostname.

shell will not test beforehand if a file called index.html exists in the current directory. 再次,请更显式。

使用shell

Scrapy shell仅仅是一个普通的Python控制台(或IPython 控制台),它提供一些方便的额外快捷方式。

可用的快捷方式

  • shelp() - print a help with the list of available objects and shortcuts
  • fetch(request_or_url) - fetch a new response from the given request or URL and update all related objects accordingly.
  • view(response) - open the given response in your local web browser, for inspection. This will add a <base> tag to the response body in order for external links (such as images and style sheets) to display properly. Note, however, that this will create a temporary file in your computer, which won’t be removed automatically.

可用的Scrapy对象

The Scrapy shell automatically creates some convenient objects from the downloaded page, like the Response object and the Selector objects (for both HTML and XML content).

Those objects are:

  • crawler - the current Crawler object.
  • spider - the Spider which is known to handle the URL, or a Spider object if there is no spider found for the current URL
  • request - a Request object of the last fetched page. 您可以使用replace()修改该request。或者 使用 fetch 快捷方式来获取新的request。
  • response - a Response object containing the last fetched page
  • settings - the current Scrapy settings

shell会话示例

下面给出一个典型的终端会话的例子, 我们首先爬取了http://scrapy.org的页面,而后接着爬取https://reddit.com的页面。Finally, we modify the (Reddit) request method to POST and re-fetch it getting an error. We end the session by typing Ctrl-D (in Unix systems) or Ctrl-Z in Windows.

Keep in mind that the data extracted here may not be the same when you try it, as those pages are not static and could have changed by the time you test this. The only purpose of this example is to get you familiarized with how the Scrapy shell works.

First, we launch the shell:

scrapy shell 'http://scrapy.org' --nolog

Then, the shell fetches the URL (using the Scrapy downloader) and prints the list of available objects and useful shortcuts (you’ll notice that these lines all start with the [s] prefix):

[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
[s]   item       {}
[s]   request    <GET http://scrapy.org>
[s]   response   <200 http://scrapy.org>
[s]   settings   <scrapy.settings.Settings object at 0x2bfd650>
[s]   spider     <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>>

After that, we can start playing with the objects:

>>> response.xpath('//title/text()').extract_first()
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

>>> fetch("http://reddit.com")
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fb3ed9c9c90>
[s]   item       {}
[s]   request    <GET http://reddit.com>
[s]   response   <200 https://www.reddit.com/>
[s]   settings   <scrapy.settings.Settings object at 0x7fb3ed9c9c10>
[s]   spider     <DefaultSpider 'default' at 0x7fb3ecdd3390>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>> response.xpath('//title/text()').extract()
[u'reddit: the front page of the internet']

>>> request = request.replace(method="POST")

>>> fetch(request)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>>

在spider中启动shell来查看response

Sometimes you want to inspect the responses that are being processed in a certain point of your spider, if only to check that response you expect is getting there.

This can be achieved by using the scrapy.shell.inspect_response function.

以下是如何在spider中调用该函数的例子:

import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.

When you run the spider, you will get something similar to this:

2014-01-23 17:48:31-0400 [scrapy] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://example.org'

Then, you can check if the extraction code is working:

>>> response.xpath('//h1[@class="fn"]')
[]

Nope, it doesn’t. So you can open the response in your web browser and see if it’s the response you were expecting:

>>> view(response)
True

Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:

>>> ^D
2014-01-23 17:50:03-0400 [scrapy] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...

Note that you can’t use the fetch shortcut here since the Scrapy engine is blocked by the shell. However, after you leave the shell, the spider will continue crawling where it stopped, as shown above.