web scraping - Using python scrapy to extract link and text -
i new python , scrapy. extract information website http://www.vodafone.com.au/about/legal/critical-information-summary/plans including link document, name , valid to.
i tried code, not work. appreciated if explain , me.
here file vodafone.py
import scrapy scrapy.linkextractor import linkextractor scrapy.spiders import rule, crawlspider vodafone_scraper.items import vodafonescraperitem class vodafonespider(scrapy.spider): name = 'vodafone' allowed_domains = ['vodafone.com.au'] start_urls = ['http://www.vodafone.com.au/about/legal/critical-information-summary/plans'] def parse(self, response): sel in response.xpath('//tbody/tr/td[1]/a'): item = vodafonescraperitem() item['link'] = sel.xpath('href').extract() item['name'] = sel.xpath('text()').extract_first() yield item
it doesn't work because page content generated dynamically javascript. elements try extract data not present in html source scrapy receives response (you can see when open page source code in browser).
you have 2 options:
- try if won't find api page uses. in browser's developer tools xhr requests on network tab. luckily, concrete page seems data requests http://www.vodafone.com.au/rest/cis?field:plancategory:equals=mobile%20plans&field:planfromdate:lessthaneq=20/08/2017. returns json can parse.
- the other option render page including javascript , parse then. recommend using splash it's seamless integration scrapy via scrapy-splash library.
Comments
Post a Comment