Items

The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.

To define common output data format Scrapy provides the Item class. Item objects are simple containers used to collect the scraped data. 提供了类似于词典的API以及用于声明可用字段的简单语法。

Various Scrapy components use extra information provided by Items: exporters look at declared fields to figure out columns to export, serialization can be customized using Item fields metadata, trackref tracks Item instances to help finding memory leaks (see Debugging memory leaks with trackref), etc.

声明Item

Items are declared using a simple class definition syntax and Field objects. 下面是一个示例︰

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

熟悉 Django 的朋友一定会注意到Scrapy Item定义方式与Django Models很类似, 不过没有那么多不同的字段类型(Field type),更为简单。

Item字段

Field objects are used to specify metadata for each field. For example, the serializer function for the last_updated field illustrated in the example above.

You can specify any kind of metadata for each field. There is no restriction on the values accepted by Field objects. For this same reason, there is no reference list of all available metadata keys. Each key defined in Field objects could be used by a different component, and only those components know about it. You can also define and use any other Field key in your project too, for your own needs. The main goal of Field objects is to provide a way to define all field metadata in one place. Typically, those components whose behaviour depends on each field use certain field keys to configure that behaviour. You must refer to their documentation to see which metadata keys are used by each component.

It’s important to note that the Field objects used to declare the item do not stay assigned as class attributes. Instead, they can be accessed through the Item.fields attribute.

与Item配合

接下来使用上述声明Product item来演示一些item的操作。你会注意到该API非常类似于 dict API.

创建item

>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)

获取字段值

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

>>> product['price']
1000

>>> product['last_updated']
Traceback (most recent call last):
    ...
KeyError: 'last_updated'

>>> product.get('last_updated', 'not set')
not set

>>> product['lala'] # getting unknown field
Traceback (most recent call last):
    ...
KeyError: 'lala'

>>> product.get('lala', 'unknown field')
'unknown field'

>>> 'name' in product  # is name field populated?
True

>>> 'last_updated' in product  # is last_updated populated?
False

>>> 'last_updated' in product.fields  # is last_updated a declared field?
True

>>> 'lala' in product.fields  # is lala a declared field?
False

设置字段值

>>> product['last_updated'] = 'today'
>>> product['last_updated']
today

>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

访问所有填充的值

若要访问所有填充的值,只需使用典型的dict API:

>>> product.keys()
['price', 'name']

>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]

其他常见的任务

Copying items:

>>> product2 = Product(product)
>>> print product2
Product(name='Desktop PC', price=1000)

>>> product3 = product2.copy()
>>> print product3
Product(name='Desktop PC', price=1000)

Creating dicts from items:

>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}

Creating items from dicts:

>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')

>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

扩展Item

You can extend Items (to add more fields or to change some metadata for some fields) by declaring a subclass of your original Item.

For example:

class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()

You can also extend field metadata by using the previous field metadata and appending more values, or changing existing values, like this:

class SpecificProduct(Product):
    name = scrapy.Field(Product.fields['name'], serializer=my_serializer)

That adds (or replaces) the serializer metadata key for the name field, keeping all the previously existing metadata values.

Item对象

class scrapy.item.Item([arg])

Return a new Item optionally initialized from the given argument.

Item复制了标准的dict API,包括初始化函数也相同。The only additional attribute provided by Items is:

fields

一个包含了item所有声明的字段的字典,而不仅仅是获取到的字段。The keys are the field names and the values are the Field objects used in the Item declaration.

字段对象

class scrapy.item.Field([arg])

Field类仅仅是内置dict类的一个别名,并没有提供额外的方法或者属性。In other words, Field objects are plain-old Python dicts. A separate class is used to support the item declaration syntax based on class attributes.