Scrapy Version: 0.14.4
Example Site: https://wedge.applicantpro.com/jobs/
This is an easy to understand how to scrapy works. We need just 3 simple steps to know about basic functionalities of scrapy. I have provided explanation of each step at the end of each step in very short method.
STEP: 1
- Open Terminal and create a new project :
scrapy startproject parser_wedge
Explanation: We need to create a new project with the command. The parser_wedge is project name and creates a directory with this name.
STEP: 2
- Import the item module at the start of the page
from scrapy.item import Item,Field
- Now, add the following 2 items in class which you want:
title = Field()
link = Field()
- add comment ahead of "pass" keyword
Explanation: I have assigned items which I want to parse. Here I needed 2 items (job link and title), you can assign more items as per your requirements.
STEP: 3
- Create a new file called wedge.py and save in Spiders folder.
- Copy and Paste the following code:
- Now, go terminal and run the following 2 commands:
cd parser_wedge
scrapy crawl wedge
Explanation: This is main part of the project which crawl the site based on the code. We need the following modules to import.
- BaseSpider: It identifies the spider that this is a base spider. You will get more information later.
- HtmlXPathSelector: It specifies the method/language to parse item, it is for the XPath selector.
- ParserWedgeItem: It specifies that you can access "link" and "title" items which are defined in items.py.
Now, name = "wedge" is the name of the spider which is using when you run spider.
Example Site: https://wedge.applicantpro.com/jobs/
This is an easy to understand how to scrapy works. We need just 3 simple steps to know about basic functionalities of scrapy. I have provided explanation of each step at the end of each step in very short method.
STEP: 1
- Open Terminal and create a new project :
scrapy startproject parser_wedge
Explanation: We need to create a new project with the command. The parser_wedge is project name and creates a directory with this name.
STEP: 2
- Now open directory parser_wedge and open /parser_wedge/items.py.
- Import the item module at the start of the page
from scrapy.item import Item,Field
- Now, add the following 2 items in class which you want:
title = Field()
link = Field()
- add comment ahead of "pass" keyword
Explanation: I have assigned items which I want to parse. Here I needed 2 items (job link and title), you can assign more items as per your requirements.
STEP: 3
- Create a new file called wedge.py and save in Spiders folder.
- Copy and Paste the following code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from parser_wedge.items import ParserWedgeItem
class WedgeSpider(BaseSpider):
name = "wedge"
allow_domains = ["wedge.applicantpro.com"]
start_urls = ["http://wedge.applicantpro.com/jobs/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="job_listings"]//a')
item = []
for site in sites:
data = ParserWedgeItem()
data['title'] = site.select('.//h4/text()').extract()
data['link'] = site.select('./@href').extract()
item.append(data)
return item
- Now, go terminal and run the following 2 commands:
cd parser_wedge
scrapy crawl wedge
Explanation: This is main part of the project which crawl the site based on the code. We need the following modules to import.
- BaseSpider: It identifies the spider that this is a base spider. You will get more information later.
- HtmlXPathSelector: It specifies the method/language to parse item, it is for the XPath selector.
- ParserWedgeItem: It specifies that you can access "link" and "title" items which are defined in items.py.
Now, name = "wedge" is the name of the spider which is using when you run spider.