javascript - Download entire webpage (html, image, JS) by Selenium Python -

July 15, 2013

i have download source code of website www.humkinar.pk in simple html form. content on site dynamically generated. have tried driver.page_source function of selenium not download page such image , javascript files left. how can download complete page. there better , easy solution in python available?

using selenium

i know question selenium, experience telling selenium recommended testing , not scraping. slow. multiple instances of headless browsers (chrome situation), result delaying much.

recommendation

python 2, 3

this trio lot , save bunch of time.

do not use parser of dryscrape, slow , buggy. situation, 1 can use beautifulsoup lxml parser. use dryscrape scrape javascript generated content, plain html , images.

if scraping lot of links simultaneously, highly recommend using threadpoolexecutor

edit #1

dryscrape + beautifulsoup usage (python 3+)

from dryscrape import start_xvfb dryscrape.session import session dryscrape.mixins import waittimeouterror bs4 import beautifulsoup  def new_session():     session = session()     session.set_attribute('auto_load_images', false)     session.set_header('user-agent', 'someuseragent')     return session   def session_reset(session):     return session.reset()   def session_visit(session, url, check):     session.visit(url)     # ensure market table visible first     if check:         try:             session.wait_for(lambda: session.at_css(                 'some#css.selector.here'))         except waittimeouterror:             pass     body = session.body()     session_reset(session)     return body  # start xvfb in case no x running (server) start_xvfb()  session = new_session() url = 'https://stackoverflow.com/questions/45796411/download-entire-webpage-html-image-js-by-selenium-python/45824047#45824047' check = false  body = session_visit(session, url, check) soup = beautifulsoup(body, 'lxml')  result = soup.find('div', {'id': 'answer-45824047'})  print(result)

Search This Blog

Force Net