Tuesday 10 November 2015

Step by Step Example

Scrapy Version: 0.14.4
Example Site: https://wedge.applicantpro.com/jobs/

This is an easy to understand how to scrapy works. We need just 3 simple steps to know about basic functionalities of scrapy.  I have provided explanation of each step at the end of each step in very short method. 


STEP: 1

- Open Terminal and create a new project :

scrapy startproject parser_wedge

Explanation: We need to create a new project with the command. The parser_wedge is project name and creates a directory with this name.  

STEP: 2


- Now open directory parser_wedge and open /parser_wedge/items.py.

- Import the item module at the start of the page

from scrapy.item import Item,Field

- Now, add the following 2 items in class which you want:

    title = Field() 
    link = Field()

- add comment ahead of "pass" keyword

Explanation:  I have assigned items which I want to parse. Here I needed 2 items (job link and title), you can assign more items as per your requirements.

STEP: 3

- Create a new file called wedge.py and save in Spiders folder.

- Copy and Paste the following code:


from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from parser_wedge.items import ParserWedgeItem

class WedgeSpider(BaseSpider):
    name = "wedge"
    allow_domains = ["wedge.applicantpro.com"]
    start_urls = ["http://wedge.applicantpro.com/jobs/"]
    
    
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@id="job_listings"]//a')
        item = []
        for site in sites:
           data = ParserWedgeItem()
           data['title'] = site.select('.//h4/text()').extract()
           data['link'] = site.select('./@href').extract()
           item.append(data)
        return item

- Now, go terminal and run the following 2 commands:
  cd parser_wedge
  scrapy crawl wedge

Explanation: This is main part of the project which crawl the site based on the code. We need the following modules to import.
    - BaseSpider: It identifies the spider that this is a base spider. You will get more information later.
   - HtmlXPathSelector: It specifies the method/language to parse item, it is for the XPath selector. 
   - ParserWedgeItem: It specifies that you can access "link" and "title" items which are defined in items.py.

Now, name = "wedge" is the name of the spider which is using when you run spider.