web scraping - Using python scrapy to extract link and text -

May 15, 2012

i new python , scrapy. extract information website http://www.vodafone.com.au/about/legal/critical-information-summary/plans including link document, name , valid to.

i tried code, not work. appreciated if explain , me.

here file vodafone.py

import scrapy  scrapy.linkextractor import linkextractor scrapy.spiders import rule, crawlspider vodafone_scraper.items import vodafonescraperitem   class vodafonespider(scrapy.spider):     name = 'vodafone'     allowed_domains = ['vodafone.com.au']     start_urls = ['http://www.vodafone.com.au/about/legal/critical-information-summary/plans']      def parse(self, response):         sel in response.xpath('//tbody/tr/td[1]/a'):             item = vodafonescraperitem()             item['link'] = sel.xpath('href').extract()             item['name'] = sel.xpath('text()').extract_first()              yield item

it doesn't work because page content generated dynamically javascript. elements try extract data not present in html source scrapy receives response (you can see when open page source code in browser).

you have 2 options:

try if won't find api page uses. in browser's developer tools xhr requests on network tab. luckily, concrete page seems data requests http://www.vodafone.com.au/rest/cis?field:plancategory:equals=mobile%20plans&field:planfromdate:lessthaneq=20/08/2017. returns json can parse.
the other option render page including javascript , parse then. recommend using splash it's seamless integration scrapy via scrapy-splash library.

Search This Blog

Force Net

web scraping - Using python scrapy to extract link and text -

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -