您现在的位置是：网站首页> 编程资料编程资料

pyppeteer执行js绕过webdriver监测方法上_python_

2023-05-26 443人已围观

简介 pyppeteer执行js绕过webdriver监测方法上_python_

Pyppeteer简介

Puppeteer 是 Google 基于 Node.js 开发的一个工具，有了它我们可以通过 JavaScript 来控制 Chrome 浏览器的一些操作，当然也可以用作网络爬虫上，其 API 极其完善，功能非常强大。而 Pyppeteer 又是什么呢？它实际上是 Puppeteer 的 Python 版本的实现，但他不是 Google 开发的，是一位来自于日本的工程师依据 Puppeteer 的一些功能开发出来的非官方版本。

官方文档: https://miyakogi.github.io/pyppeteer/reference.html

下载

pip install pyppeteer

打开网页并截图

import asyncio from pyppeteer import launch async def main(): browser = await launch() page = await browser.newPage() await page.goto('http://example.com') await page.screenshot({'path': 'example.png'}) await browser.close() asyncio.get_event_loop().run_until_complete(main())

评估页面上的脚本

import asyncio from pyppeteer import launch async def main(): browser = await launch() page = await browser.newPage() await page.goto('http://example.com') await page.screenshot({'path': 'example.png'}) dimensions = await page.evaluate('''() => { return { width: document.documentElement.clientWidth, height: document.documentElement.clientHeight, deviceScaleFactor: window.devicePixelRatio, } }''') print(dimensions) # >>> {'width': 800, 'height': 600, 'deviceScaleFactor': 1} await browser.close() asyncio.get_event_loop().run_until_complete(main())

关键字参数的选项

{'headless': True} # 默认为True无头 {'headless': False} # 改为False变成有头 browser = await launch({'headless': False})

选择器

Page.querySelector() # CSS选择器 Page.querySelectorAll() # CSS选择器选所有 Page.xpath() # xpath选择器

参数Page.evaluate()和Page.querySelectorEval()

添加force_expr=True选项，这会强制pyppeteer将字符串视为表达式。

获取页面内容的示例：

content = await page.evaluate('document.body.textContent', force_expr=True) import asyncio from pyppeteer import launch async def main(): browser = await launch({'headless': False}) page = await browser.newPage() await page.goto('https://www.cnblogs.com/guyouyin123/p/12669430.html#selenium%E9%80%89%E6%8B%A9%E5%99%A8%E9%80%89%E6%8B%A9') content = await page.evaluate('document.body.textContent', force_expr=True) print(content) await browser.close() asyncio.get_event_loop().run_until_complete(main())

获取元素内部文本的示例：

element = await page.querySelector('h1') title = await page.evaluate('(element) => element.textContent', element)

基础用法

import asyncio from pyppeteer import launch async def main(): # headless参数设为False，则变成有头模式 # Pyppeteer支持字典和关键字传参，Puppeteer只支持字典传参 # 指定引擎路径 # exepath = r'C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\575458\chrome-win32/chrome.exe' # browser = await launch({'executablePath': exepath, 'headless': False, 'slowMo': 30}) browser = await launch( # headless=False, {'headless': False} ) page = await browser.newPage() # 设置页面视图大小 await page.setViewport(viewport={'width': 1280, 'height': 800}) # 是否启用JS，enabled设为False，则无渲染效果 await page.setJavaScriptEnabled(enabled=True) # 超时间见 1000 毫秒 res = await page.goto('https://www.toutiao.com/', options={'timeout': 1000}) resp_headers = res.headers # 响应头 resp_status = res.status # 响应状态 # 等待 await asyncio.sleep(2) # 第二种方法，在while循环里强行查询某元素进行等待 while not await page.querySelector('.t'): pass # 滚动到页面底部 await page.evaluate('window.scrollBy(0, document.body.scrollHeight)') await asyncio.sleep(2) # 截图 保存图片 await page.screenshot({'path': 'toutiao.png'}) # 打印页面cookies print(await page.cookies()) """ 打印页面文本 """ # 获取所有 html 内容 print(await page.content()) # 在网页上执行js 脚本 dimensions = await page.evaluate(pageFunction='''() => { return { width: document.documentElement.clientWidth, // 页面宽度 height: document.documentElement.clientHeight, // 页面高度 deviceScaleFactor: window.devicePixelRatio, // 像素比 1.0000000149011612 } }''', force_expr=False) # force_expr=False 执行的是函数 print(dimensions) # 只获取文本 执行 js 脚本 force_expr 为 True 则执行的是表达式 content = await page.evaluate(pageFunction='document.body.textContent', force_expr=True) print(content) # 打印当前页标题 print(await page.title()) # 抓取新闻内容 可以使用 xpath 表达式 """ # Pyppeteer 三种解析方式 Page.querySelector() # 选择器 Page.querySelectorAll() Page.xpath() # xpath 表达式 # 简写方式为： Page.J(), Page.JJ(), and Page.Jx() """ element = await page.querySelector(".feed-infinite-wrapper > ul>li") # 纸抓取一个 print(element) # 获取所有文本内容 执行 js content = await page.evaluate('(element) => element.textContent', element) print(content) # elements = await page.xpath('//div[@class="title-box"]/a') elements = await page.querySelectorAll(".title-box a") for item in elements: print(await item.getProperty('textContent')) #  # 获取文本 title_str = await (await item.getProperty('textContent')).jsonValue() # 获取链接 title_link = await (await item.getProperty('href')).jsonValue() print(title_str) print(title_link) # 关闭浏览器 await browser.close() asyncio.get_event_loop().run_until_complete(main()) import asyncio import pyppeteer from collections import namedtuple Response = namedtuple("rs", "title url html cookies headers history status") async def get_html(url): browser = await pyppeteer.launch(headless=True, args=['--no-sandbox']) page = await browser.newPage() res = await page.goto(url, options={'timeout': 3000}) data = await page.content() title = await page.title() resp_cookies = await page.cookies() # cookie resp_headers = res.headers # 响应头 resp_status = res.status # 响应状态 print(data) print(title) print(resp_headers) print(resp_status) return title if __name__ == '__main__': url_list = ["https://www.toutiao.com/", "http://jandan.net/ooxx/page-8#comments", "https://www.12306.cn/index/" ] task = [get_html(url) for url in url_list] loop = asyncio.get_event_loop() results = loop.run_until_complete(asyncio.gather(*task)) for res in results: print(res) headers = {'date': 'Sun, 28 Apr 2019 06:50:20 GMT', 'server': 'Cmcc', 'x-frame-options': 'SAMEORIGIN\nSAMEORIGIN', 'last-modified': 'Fri, 26 Apr 2019 09:58:09 GMT', 'accept-ranges': 'bytes', 'cache-control': 'max-age=43200', 'expires': 'Sun, 28 Apr 2019 18:50:20 GMT', 'vary': 'Accept-Encoding,User-Agent', 'content-encoding': 'gzip', 'content-length': '19823', 'content-type': 'text/html', 'connection': 'Keep-alive', 'via': '1.1 ID-0314217270751344 uproxy-17'}

模拟输入

 # 模拟输入 账号密码 {'delay': rand_int()} 为输入时间 await page.type('#TPL_username_1', "sadfasdfasdf") await page.type('#TPL_password_1', "123456789", ) await page.waitFor(1000) await page.click("#J_SubmitStatic")

使用 tkinter 获取页面高度宽度

def screen_size(): """使用tkinter获取屏幕大小""" import tkinter tk = tkinter.Tk() width = tk.winfo_screenwidth() height = tk.winfo_screenheight() tk.quit() return width, height

爬取京东商城

import requests from bs4 import BeautifulSoup from pyppeteer import launch import asyncio def screen_size(): """使用tkinter获取屏幕大小""" import tkinter tk = tkinter.Tk() width = tk.winfo_screenwidth() height = tk.winfo_screenheight() tk.quit() return width, height async def main(url): # browser = await launch({'headless': False, 'args': ['--no-sandbox'], }) browser = await launch({'args': ['--no-sandbox'], }) page = await browser.newPage() width, height = screen_size() await page.setViewport(viewport={"width": width, "height": height}) await page.setJavaScriptEnabled(enabled=True) await page.setUserAgent( 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299') await page.goto(url) # await asyncio.sleep(2) await page.evaluate('window.scrollBy(0, document.body.scrollHeight)') await asyncio.sleep(1) # content = await page.content() li_list = await page.xpath('//*[@id="J_goodsList"]/ul/li') # print(li_list) item_list = [] for li in li_list: a = await li.xpath('.//div[@class="p-img"]/a') detail_url = await (await a[0].getProperty("href")).jsonValue() promo_words = await (await a[0].getProperty("title")).jsonValue() a_ = await li.xpath('.//div[@class="p-commit"]/strong/a') p_commit = await (await a_[0].getProperty("textContent")).jsonValue() i = await li.xpath('./div/div[3]/strong/i') price = await (await i[0].getProperty("textContent")).jsonValue() em = await li.xpath('./div/div[4]/a/em') title = a
                提示：
                    本文由神整理自网络，如有侵权请联系本站删除！
                    

                    本站声明： 

                    1、本站所有资源均来源于互联网，不保证100%完整、不提供任何技术支持； 

                    2、本站所发布的文章以及附件仅限用于学习和研究目的;不得将用于商业或者非法用途；否则由此产生的法律后果，本站概不负责！
                
                
                
                                            上一篇：pyppeteer执行js绕过webdriver监测方法下_python_
                                                                下一篇：教你十行代码实现python向手机推送通知功能_python_

您现在的位置是：网站首页> 编程资料编程资料

pyppeteer执行js绕过webdriver监测方法上_python_

目录

Pyppeteer简介

下载

打开网页并截图

评估页面上的脚本

关键字参数的选项

选择器

基础用法

模拟输入

使用 tkinter 获取页面高度宽度

爬取京东商城

相关内容

点击排行

本栏推荐

猜你喜欢

您现在的位置是：网站首页> 编程资料编程资料

pyppeteer执行js绕过webdriver监测方法上_python_

目录

Pyppeteer简介

下载

打开网页并截图

评估页面上的脚本

关键字参数的选项

选择器

基础用法

模拟输入

使用 tkinter 获取页面高度 宽度

爬取京东商城

相关内容

点击排行

本栏推荐

猜你喜欢

使用 tkinter 获取页面高度宽度