python - Scrape Pinboard recursively with Scrapy - “Spider must return Request” error -


in effort hone python , spark graphx skills, have been trying build graph of pinboard users , bookmarks. in order so, scrape pinboard bookmarks recursively in following fashion:

  1. start user , scrape bookmarks
  2. for each bookmark, identified url_slug, find users have saved same bookmark.
  3. for each user step 2, repeat process (go 1, ...)

despite having tried suggestions several threads here (including using rules), when try implement logic, following error:

error: spider must return request, baseitem, dict or none, got 'generator'

which suspect has mix of yield/return in code.

here quick description of code:

my main parse method finds bookmark items 1 user (also following previous pages bookmarks of same user) , yields the parse_bookmark method scrape these bookmarks.

class pinspider(scrapy.spider):     name = 'pinboard'      # before = datetime after 1970-01-01 in seconds, used separate bookmark pages of user     def __init__(self, user='notiv', before='3000000000', *args, **kwargs):         super(pinspider, self).__init__(*args, **kwargs)         self.start_urls = ['https://pinboard.in/u:%s/before:%s' % (user, before)]         self.before = before      def parse(self, response):         # fetches json representation of bookmarks instead of using css or xpath         bookmarks = re.findall('bmarks\[\d+\] = (\{.*?\});', response.body.decode('utf-8'), re.dotall | re.multiline)          b in bookmarks:             bookmark = json.loads(b)             yield self.parse_bookmark(bookmark)          # bookmarks in previous pages         previous_page = response.css('a#top_earlier::attr(href)').extract_first()         if previous_page:             previous_page = response.urljoin(previous_page)             yield scrapy.request(previous_page, callback=self.parse) 

this method scrapes information bookmark, including corresponding url_slug, stores in pinscrapyitem , yields scrapy.request parse url_slug:

def parse_bookmark(self, bookmark):     pin = pinscrapyitem()      pin['url_slug'] = bookmark['url_slug']     pin['title'] = bookmark['title']     pin['author'] = bookmark['author']      # if remove following line parsing of 1 user works (step 1) no step 2 performed       yield scrapy.request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)      return pin 

finally parse_url_slug method finds other users saved bookmark , recursively yields scrape.request parse each 1 of them.

def parse_url_slug(self, response):     url_slug = urlslugitem()      if response.body:         soup = beautifulsoup(response.body, 'html.parser')          users = soup.find_all("div", class_="bookmark")         user_list = [re.findall('/u:(.*)/t:', element.a['href'], re.dotall) element in users]         user_list_flat = sum(user_list, []) # change list of lists list          url_slug['user_list'] = user_list_flat          user in user_list:             yield scrapy.request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse)      return url_slug 

(in order present code in more concise manner, removed parts store other interesting fields or check duplicates etc.)

any appreciated!

the problem below block of code

yield self.parse_bookmark(bookmark) 

since in parse_bookmark have below 2 line

# if remove following line parsing of 1 user works (step 1) no step 2 performed   yield scrapy.request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)  return pin 

since have yield return value of function generator. , yield generator scrapy , doesn't not know it.

fix simple. change code below

yield self.parse_bookmark(bookmark) 

this yield 1 value @ time generator instead of generator itself. or can this

for ret in self.parse_bookmark(bookmark):     yield ret 

edit-1

change functions yield items first

yield pin yield scrapy.request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug) 

and other 1 too

    url_slug['user_list'] = user_list_flat     yield url_slug     user in user_list:         yield scrapy.request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse) 

yielding later schedule lot of other requests first take time when start seeing scraped items. ran above code changes , scrapes me

2017-08-20 14:02:38 [scrapy.core.scraper] debug: scraped <200 https://pinboard.in/u:%5b'semanticdreamer'%5d/before:3000000000> {'url_slug': 'e1ff3a9fb18873e494ec47d806349d90fec33c66', 'title': 'flair conky offers dark & light version linux distributions - noobslab | ubuntu/linux news, reviews, tutorials, apps', 'author': 'semanticdreamer'} 2017-08-20 14:02:38 [scrapy.core.scraper] debug: scraped <200 https://pinboard.in/url:d9c16292ec9019fdc8411e02fe4f3d6046185c58> {'user_list': ['ronert', 'notiv']} 

Comments

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -