Scrapy Step by Step

Scrapy Version: 0.14.4
Example Site: https://wedge.applicantpro.com/jobs/

This is an easy to understand how to scrapy works. We need just 3 simple steps to know about basic functionalities of scrapy. I have provided explanation of each step at the end of each step in very short method.

STEP: 1

- Open Terminal and create a new project :

scrapy startproject parser_wedge

Explanation: We need to create a new project with the command. The parser_wedge is project name and creates a directory with this name.

STEP: 2

- Now open directory parser_wedge and open /parser_wedge/items.py.

- Import the item module at the start of the page

from scrapy.item import Item,Field

- Now, add the following 2 items in class which you want:

title = Field()
link = Field()

- add comment ahead of "pass" keyword

Explanation: I have assigned items which I want to parse. Here I needed 2 items (job link and title), you can assign more items as per your requirements.

STEP: 3

- Create a new file called wedge.py and save in Spiders folder.

- Copy and Paste the following code:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from parser_wedge.items import ParserWedgeItem

class WedgeSpider(BaseSpider):

name = "wedge"

allow_domains = ["wedge.applicantpro.com"]

start_urls = ["http://wedge.applicantpro.com/jobs/"]

def parse(self, response):

hxs = HtmlXPathSelector(response)

sites = hxs.select('//div[@id="job_listings"]//a')

item = []

for site in sites:

data = ParserWedgeItem()

data['title'] = site.select('.//h4/text()').extract()

data['link'] = site.select('./@href').extract()

item.append(data)

return item

- Now, go terminal and run the following 2 commands:
cd parser_wedge
scrapy crawl wedge

Explanation: This is main part of the project which crawl the site based on the code. We need the following modules to import.
- BaseSpider: It identifies the spider that this is a base spider. You will get more information later.
- HtmlXPathSelector: It specifies the method/language to parse item, it is for the XPath selector.
- ParserWedgeItem: It specifies that you can access "link" and "title" items which are defined in items.py.

Now, name = "wedge" is the name of the spider which is using when you run spider.

Scrapy Step by Step

Tuesday, 10 November 2015

Step by Step Example