javascript - Download entire webpage (html, image, JS) by Selenium Python -
i have download source code of website www.humkinar.pk in simple html form. content on site dynamically generated. have tried driver.page_source function of selenium not download page such image , javascript files left. how can download complete page. there better , easy solution in python available?
using selenium
i know question selenium, experience telling selenium recommended testing , not scraping. slow. multiple instances of headless browsers (chrome situation), result delaying much.
recommendation
python 2, 3
this trio lot , save bunch of time.
do not use parser of dryscrape, slow , buggy. situation, 1 can use beautifulsoup
lxmlparser. use dryscrape scrape javascript generated content, plain html , images.if scraping lot of links simultaneously, highly recommend using threadpoolexecutor
edit #1
dryscrape + beautifulsoup usage (python 3+)
from dryscrape import start_xvfb dryscrape.session import session dryscrape.mixins import waittimeouterror bs4 import beautifulsoup def new_session(): session = session() session.set_attribute('auto_load_images', false) session.set_header('user-agent', 'someuseragent') return session def session_reset(session): return session.reset() def session_visit(session, url, check): session.visit(url) # ensure market table visible first if check: try: session.wait_for(lambda: session.at_css( 'some#css.selector.here')) except waittimeouterror: pass body = session.body() session_reset(session) return body # start xvfb in case no x running (server) start_xvfb() session = new_session() url = 'https://stackoverflow.com/questions/45796411/download-entire-webpage-html-image-js-by-selenium-python/45824047#45824047' check = false body = session_visit(session, url, check) soup = beautifulsoup(body, 'lxml') result = soup.find('div', {'id': 'answer-45824047'}) print(result)
Comments
Post a Comment