python - Scrape Pinboard recursively with Scrapy - “Spider must return Request” error -
in effort hone python , spark graphx skills, have been trying build graph of pinboard users , bookmarks. in order so, scrape pinboard bookmarks recursively in following fashion:
- start user , scrape bookmarks
- for each bookmark, identified url_slug, find users have saved same bookmark.
- for each user step 2, repeat process (go 1, ...)
despite having tried suggestions several threads here (including using rules), when try implement logic, following error:
error: spider must return request, baseitem, dict or none, got 'generator'
which suspect has mix of yield
/return
in code.
here quick description of code:
my main parse method finds bookmark items 1 user (also following previous pages bookmarks of same user) , yields the parse_bookmark
method scrape these bookmarks.
class pinspider(scrapy.spider): name = 'pinboard' # before = datetime after 1970-01-01 in seconds, used separate bookmark pages of user def __init__(self, user='notiv', before='3000000000', *args, **kwargs): super(pinspider, self).__init__(*args, **kwargs) self.start_urls = ['https://pinboard.in/u:%s/before:%s' % (user, before)] self.before = before def parse(self, response): # fetches json representation of bookmarks instead of using css or xpath bookmarks = re.findall('bmarks\[\d+\] = (\{.*?\});', response.body.decode('utf-8'), re.dotall | re.multiline) b in bookmarks: bookmark = json.loads(b) yield self.parse_bookmark(bookmark) # bookmarks in previous pages previous_page = response.css('a#top_earlier::attr(href)').extract_first() if previous_page: previous_page = response.urljoin(previous_page) yield scrapy.request(previous_page, callback=self.parse)
this method scrapes information bookmark, including corresponding url_slug, stores in pinscrapyitem , yields scrapy.request
parse url_slug:
def parse_bookmark(self, bookmark): pin = pinscrapyitem() pin['url_slug'] = bookmark['url_slug'] pin['title'] = bookmark['title'] pin['author'] = bookmark['author'] # if remove following line parsing of 1 user works (step 1) no step 2 performed yield scrapy.request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug) return pin
finally parse_url_slug
method finds other users saved bookmark , recursively yields scrape.request
parse each 1 of them.
def parse_url_slug(self, response): url_slug = urlslugitem() if response.body: soup = beautifulsoup(response.body, 'html.parser') users = soup.find_all("div", class_="bookmark") user_list = [re.findall('/u:(.*)/t:', element.a['href'], re.dotall) element in users] user_list_flat = sum(user_list, []) # change list of lists list url_slug['user_list'] = user_list_flat user in user_list: yield scrapy.request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse) return url_slug
(in order present code in more concise manner, removed parts store other interesting fields or check duplicates etc.)
any appreciated!
the problem below block of code
yield self.parse_bookmark(bookmark)
since in parse_bookmark
have below 2 line
# if remove following line parsing of 1 user works (step 1) no step 2 performed yield scrapy.request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug) return pin
since have yield
return value of function generator. , yield generator scrapy , doesn't not know it.
fix simple. change code below
yield self.parse_bookmark(bookmark)
this yield 1 value @ time generator instead of generator itself. or can this
for ret in self.parse_bookmark(bookmark): yield ret
edit-1
change functions yield items first
yield pin yield scrapy.request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)
and other 1 too
url_slug['user_list'] = user_list_flat yield url_slug user in user_list: yield scrapy.request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse)
yielding later schedule lot of other requests first take time when start seeing scraped items. ran above code changes , scrapes me
2017-08-20 14:02:38 [scrapy.core.scraper] debug: scraped <200 https://pinboard.in/u:%5b'semanticdreamer'%5d/before:3000000000> {'url_slug': 'e1ff3a9fb18873e494ec47d806349d90fec33c66', 'title': 'flair conky offers dark & light version linux distributions - noobslab | ubuntu/linux news, reviews, tutorials, apps', 'author': 'semanticdreamer'} 2017-08-20 14:02:38 [scrapy.core.scraper] debug: scraped <200 https://pinboard.in/url:d9c16292ec9019fdc8411e02fe4f3d6046185c58> {'user_list': ['ronert', 'notiv']}
Comments
Post a Comment