[2024重置] Python爬虫看这篇就够了03 selenium4.5 模拟点击

经过 2 个 Python 项目的洗礼(加一起 2k 行代码了), 现在再回过头看以前的文章, 实在看不下去, 所以现在有空就重制一下该系列

selenium4 我用得不多, 上次是拿来爬 outlook 所有 4 字母未注册的邮箱

环境：python3.11

文档：https://www.selenium.dev/zh-cn/documentation/webdriver/

简单使用

安装

pip install selenium==4.20.0

from selenium import webdriver

# 创建 Options 对象
options = webdriver.ChromeOptions()

# 不自动关闭浏览器
options.add_experimental_option("detach", True)

browser = webdriver.Chrome(options=options)

browser.get('https://baidu.com')

设置 UA、cookie、代理

这里要注意以下，必须先进入网页get(url)，才能开始添加 cookies，不然 domin 会出错。

from selenium import webdriver

# 创建 Options 对象
options = webdriver.ChromeOptions()

# 不自动关闭浏览器
options.add_experimental_option("detach", True)

# UA 加不加都可以
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")

# HTTP代理
options.add_argument("--proxy-server=http://127.0.0.1:7778")

browser = webdriver.Chrome(options=options)

browser.get('https://google.com')

cookies = [
    {"name": "123", "value": "123", }
]

# 添加cookies
for cookie in cookies:
    browser.add_cookie(cookie)

# 再次进入页面
browser.get('https://google.com')

EC&Wait

重点来了, 我们如何获取元素呢？

下面一种最基础的获取方式, 假如网页是异步加载, 那么就获取不到了.

span_element = browser.find_element(By.CSS_SELECTOR, '')
span_element.click()

我们就可以引出 EC&Wait
- 有什么用？我们要清楚，现在很多页面都是异步加载的，如果我们在页面一加载的时候就提取内容，可能会发生提取不到的报错。
- 因此我们需要等待页面加载完成后在获取内容。

文档：https://www.selenium.dev/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html?highlight=expected

首先我们以 ant 的官网作为例子：https://ant-design.antgroup.com/components/checkbox-cn

2023-04-18-selenium_ecwait

我想知道 input 输入框中是否 type 属性

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 创建 Options 对象
options = webdriver.ChromeOptions()

# 不自动关闭浏览器
options.add_experimental_option("detach", True)

browser = webdriver.Chrome(options=options)

browser.get('https://ant-design.antgroup.com/components/checkbox-cn')

try:
    # 该 input 输入框中是否 type 属性
    Test_element_attribute_to_include = WebDriverWait(browser, 10).until(
        EC.element_attribute_to_include(
            (By.CSS_SELECTOR, '#components-checkbox-demo-basic input.ant-checkbox-input'), 'type')
    )
    print(Test_element_attribute_to_include)  # True
except Exception as e:
    print(e)

补充一些知识点
- WebElement 对象是指 browser.find_element() 获取的对象
- locator：定位器是指 (By.CSS_SELECTOR, '')
- attribute_：属性

实例

接下来我们就不以网站作为测试了, 我们在本地目录新建一个 index.html 测试

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <title>Title</title>
  </head>
  <body>
    <div id="sex_radio">
      <input type="radio" name="性别" value="男" />男<br />
      <input type="radio" name="性别" value="女" checked="checked" />女<br />
    </div>
    <div id="fruit_checkbox">
      <input type="checkbox" name="fruit" value="香蕉" checked="checked" />
      香蕉
      <br />
      <input type="checkbox" name="fruit" value="苹果" />苹果<br />
      <input type="checkbox" name="fruit" value="水蜜桃" />水蜜桃<br />
      <input type="checkbox" name="fruit" value="樱桃" checked="checked" />
      樱桃
      <br />
    </div>
    <select name="selectomatic">
      <option selected="selected" id="non_multi_option" value="one">One</option>
      <option value="two">Two</option>
      <option value="four">Four</option>
      <option value="still learning how to count, apparently">
        Still learning how to count, apparently
      </option>
    </select>
    <select name="multi" id="multi" multiple="multiple">
      <option selected="selected" value="eggs">Eggs</option>
      <option value="ham">Ham</option>
      <option selected="selected" value="sausages">Sausages</option>
      <option value="onion gravy">Onion gravy</option>
    </select>
    <button id="submit_button">submit_button</button>
  </body>
</html>

import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 创建 Options 对象
options = webdriver.ChromeOptions()

# 不自动关闭浏览器
options.add_experimental_option("detach", True)

browser = webdriver.Chrome(options=options)

# 获取当前目录的绝对路径
current_dir = os.path.abspath(os.path.dirname(__file__))

browser.get(f"file://{os.path.join(current_dir, 'index.html')}")

获取元素

判断文本内容

EC.text_to_be_present_in_element(locator, text)：判断特定文本是否存在于元素内
- 判断元素的里面的文本, 并不是属性
EC.text_to_be_present_in_element_value(locator, text)
- 判断元素的 value 属性值

# 判断 #sex_radio 整个元素中 "女" 这个文本
try:
    element = WebDriverWait(browser, 5).until(
        EC.text_to_be_present_in_element(
            (By.CSS_SELECTOR, '#sex_radio'), '女')
    )
    print(element)  # True
except TimeoutException:
    print("等待超时")
    element = False

判断元素是否可见

presence_of_element_located(loator)：特定元素是否存在于页面 DOM 树中
- 是就返回元素本身，否则报错

# presence_of_element_located: 判断特定元素是否存在于页面 DOM 树中
fruit_checkbox = WebDriverWait(browser, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '#fruit_checkbox'))
)
print("存在 fruit_checkbox 元素:", fruit_checkbox)

presence_of_all_elements_located(locator)：匹配所有存在页面的元素，返回 List
- 返回元素列表

# presence_of_all_elements_located: 获取所有匹配的元素
checkboxes = WebDriverWait(browser, 5).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'input[type="checkbox"]'))
)
print("存在的复选框数量:", len(checkboxes)) # 4

visibility_of_element_located(locator)：判断是否可见，指宽高不等于 0
- 返回该元素（单个元素），否则报错

# visibility_of_element_located: 判断元素是否可见
selectomatic_dropdown = WebDriverWait(browser, 5).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'select[name="selectomatic"]'))
)
print("selectomatic 下拉框可见:", selectomatic_dropdown)

invisibility_of_element_located(locator)：判断不在 Dom，于上面的意思相反
- 返回: True/False

# invisibility_of_element_located: 判断元素是否不在 DOM 中
non_existent_element = WebDriverWait(browser, 5).until(
EC.invisibility_of_element_located((By.ID, 'non_existent_element'))
)
print("non_existent_element 不在 DOM 中:", non_existent_element) # True

visibility_of(element)：是否在 DOM 树可见
- 返回该元素（单个元素），否则报错

# visibility_of: 判断元素是否在 DOM 树可见
visible_select_element = WebDriverWait(browser, 5).until(
EC.visibility_of(selectomatic_dropdown)
)
print("selectomatic_dropdown 在 DOM 树可见:", visible_select_element)

element_to_be_clickable(locator)：判断元素是否可以点击，用于判断按钮
- 可以点击返回元素本身，否则 False

# element_to_be_clickable: 判断元素是否可点击
submit_button = WebDriverWait(browser, 5).until(
EC.element_to_be_clickable((By.ID, 'submit_button'))
)
print("submit_button 可点击:", submit_button)

判断元素是否被选中

element_to_be_selected(element)：判断某个元素是否被选中,一般用于 select 下拉表
- 传入 WebElement 对象，返回当前选中的 message
element_located_to_be_selected(locator)：和上面一样，也是用于 select，不过是传入 locator
element_selection_state_to_be(element, is_selected)：判断元素的选择状态是否处于该状态，单选框
- 相等返回 True，否则返回 False
element_located_selection_state_to_be(locator, is_selected)：和上面一样，不过是传入 WebElement
element_attribute_to_include(locator, attribute_)：检查属性是否包含在元素内
frame_to_be_available_and_switch_to_it(locator)：这个条件判断 frame 是否可切入

多条件判断

EC 是函数，他可以调用以下方法，我列举一些常用的
- all_of：检查多个条件是否为真
- any_of：检查任一条件是否为真

try:
    # 定义多个条件
    condition1 = EC.presence_of_element_located((By.CSS_SELECTOR, '#sex_radio'))
    condition2 = EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#sex_radio'), '女')

    # 组合多个条件使用 EC.all_of
    all_conditions = EC.all_of(condition1, condition2)

    # 使用 WebDriverWait 等待所有条件满足，最长等待时间为 5 秒钟
    element = WebDriverWait(browser, 5).until(all_conditions)
    print("元素存在且文本包含 '女'：", element)

except TimeoutException:
    print("等待超时，未找到元素或元素文本不包含 '女'")
    element = False

except Exception as e:
    print("发生异常：", e)

操作元素

现在你已经知道了, 如何获取到 element 元素了, 接下来我们来看看如何点击/读取元素内容

对话框

alert = browser.switch_to.alert    #获取对话框（浏览器自带）
print(alert.text)                 #获取对话框的文本内容并打印
alert.send_keys("想输入的内容")     #在对话框的输入框中输入内容
alert.accept()                    #确认或确认并关闭对话框（无确认/取消按钮的对话框则只会是关闭）
alert.dismiss()                   #取消并关闭对话框

Radio 单选框

# 定位到被选中的元素并打印它的值
element = browser.find_element(
    By.CSS_SELECTOR, '#sex_radio input[checked=checked]')
print('当前选中的性别是' + element.get_attribute('value'))

# 点选想切换的选项
browser.find_element(By.CSS_SELECTOR, '#sex_radio input[value="男"]').click()

CheckBox 框（多选框）

# 定位到被选中的元素并点击，使其取消选中状态
elements = browser.find_elements(
    By.CSS_SELECTOR, '#fruit_checkbox input[checked=checked]')
for element in elements:
    element.click()

# 点选想选中的选项
browser.find_element(
    By.CSS_SELECTOR, '#fruit_checkbox input[value="苹果"]').click()
browser.find_element(
    By.CSS_SELECTOR, '#fruit_checkbox input[value="水蜜桃"]').click()

Select 选择框

from selenium.webdriver.support.ui import Select

# 创建 Select 对象
select_element = browser.find_element(By.NAME, 'selectomatic')
select = Select(select_element)

# 1. 选项列表
# 获取元素中所有选项列表:
option_list = select.options

# 获取选中的选项列表
selected_option_list = select.all_selected_options

# 2. 文本
# 根据其可见文本选择选项
select.select_by_visible_text('Four')

# 根据其值属性选择选项
select.select_by_value('two')

# 根据其在列表中的位置选择选项
select.select_by_index(3)

# 3. 取消选择项, 只有复选类型才可以
select.deselect_by_value('eggs')

切换 frame

wd.switch_to.frame("frame的id或属性name")  #切换frame
wd.switch_to.default_content()            #切换到frame外层

#记录当前窗口，方便切换到原来的窗口
mainWindow = wd.current_window_handle

# 点击打开新窗口的链接
link = wd.find_element_by_css_selector('footer > div > a[href]')
link.click()

# wd.title属性是当前窗口的标题栏 文本
print(wd.title)

# wd.current_url是当前窗口网站的URL地址
print(wd.current_url)

#通过 当前窗口的标题栏文本 来判断是否已跳转到要跳转的网页
for handle in wd.window_handles:
    wd.switch_to.window(handle)
    if '百度' in wd.title:
        break

#切换回360
wd.switch_to.window(mainWindow)
for handle in wd.window_handles:
    wd.switch_to.window(handle)
    if '360' in wd.title:
        break

#因为记录了原来的窗口，也可以直接切换到原窗口
wd.switch_to.window(mainWindow)

隐藏浏览器指纹

现在只推荐这一种方式 https://github.com/requireCool/stealth.min.js/blob/main/stealth.min.js

https://cdn.jsdelivr.net/gh/requireCool/stealth.min.js/stealth.min.j

from selenium import webdriver
import time

# 创建 Options 对象
options = webdriver.ChromeOptions()

options.add_argument('--enable-chrome-browser-cloud-management')
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")


options.add_argument("--disable-blink-features=AutomationControlled")

# 启动浏览器
driver = webdriver.Chrome(options=options)

with open('stealth.min.js') as f:
    js = f.read()

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
  "source": js
})


time.sleep(2)

driver.get('https://bot.sannysoft.com/')

browser.set_window_size(1600, browser.execute_script("return document.body.scrollHeight"))

time.sleep(2)

driver.save_screenshot('screenshot.png')