Python + Selenium 自动化爬取途牛动态网页-世界杯世锦赛-世界杯预选赛赛程表_世界杯韩国

admin 2025-11-14 21:45:26 世界杯世锦赛

1. 引言在互联网数据采集领域，动态网页（即通过JavaScript异步加载数据的网页）的爬取一直是一个挑战。传统的requests+BeautifulSoup组合适用于静态页面，但对于动态渲染的内容（如途牛旅游网的酒店、景点、评论等）则难以直接获取。

Selenium 是一个强大的浏览器自动化工具，可以模拟用户操作（如点击、滚动、输入等），并获取动态渲染后的完整HTML。本文将详细介绍如何使用 Python + Selenium 自动化爬取途牛旅游网的动态数据，并提供完整的代码实现。

2. 环境准备在开始之前，我们需要安装必要的Python库：

此外，Selenium需要浏览器驱动（如ChromeDriver）。请确保已安装 Chrome浏览器，并下载对应版本的 ChromeDriver（下载地址）。

3. Selenium基础操作3.1 初始化浏览器驱动代码语言：txt复制from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.common.keys import Keys

import time

# 配置ChromeDriver路径

driver_path = "你的ChromeDriver路径" # 例如：/usr/local/bin/chromedriver

service = Service(driver_path)

# 启动浏览器（无头模式可选）

options = webdriver.ChromeOptions()

options.add_argument('--headless') # 无头模式，不显示浏览器窗口

driver = webdriver.Chrome(service=service, options=options)

3.2 访问网页并等待加载代码语言：txt复制url = "https://www.tuniu.com/"

driver.get(url)

time.sleep(3) # 等待页面加载

3.3 查找元素并交互Selenium提供多种元素定位方式：

find_element(By.ID, "id")find_element(By.CLASS_NAME, "class")find_element(By.XPATH, "xpath")例如，搜索“北京”旅游线路：

代码语言：txt复制search_box = driver.find_element(By.ID, "search-input")

search_box.send_keys("北京")

search_box.send_keys(Keys.RETURN) # 模拟回车

time.sleep(5) # 等待搜索结果加载

4. 爬取途牛旅游数据实战4.1 目标分析假设我们要爬取途牛旅游网的热门旅游线路，包括：

线路名称价格出发地行程天数用户评分4.2 获取动态渲染的HTML由于途牛的数据是动态加载的，直接requests.get()无法获取完整HTML。使用Selenium获取渲染后的页面：

tifulSoup）

代码语言：txt复制from bs4 import BeautifulSoup

import pandas as pd

soup = BeautifulSoup(html, 'html.parser')

tours = []

for item in soup.select('.trip-item'): # 根据实际HTML结构调整选择器

name = item.select_one('.title').text.strip()

price = item.select_one('.price').text.strip()

departure = item.select_one('.departure').text.strip()

days = item.select_one('.days').text.strip()

rating = item.select_one('.rating').text.strip()

tours.append({

'name': name,

'price': price,

'departure': departure,

'days': days,

'rating': rating

})

# 存储为DataFrame

df = pd.DataFrame(tours)

print(df.head())

4.4 翻页爬取途牛旅游数据通常是分页加载的，我们可以模拟点击“下一页”：

代码语言：txt复制while True:

try:

next_page = driver.find_element(By.CSS_SELECTOR, '.next-page')

next_page.click()

time.sleep(3) # 等待新页面加载

html = driver.page_source

# 继续解析...

except:

break # 没有下一页时退出

5. 反爬策略应对途牛可能会检测Selenium爬虫，常见的反反爬措施：

修改User-Agent

禁用自动化标志

使用代理IP

随机等待时间

6. 完整代码示例代码语言：txt复制from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.common.keys import Keys

from bs4 import BeautifulSoup

import pandas as pd

import time

import random

# 代理配置

proxyHost = "www.16yun.cn"

proxyPort = "5445"

proxyUser = "16QMSOML"

proxyPass = "280651"

# 初始化浏览器

driver_path = "你的ChromeDriver路径"

service = Service(driver_path)

options = webdriver.ChromeOptions()

# 设置代理

proxy_options = f"--proxy-server=http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}"

options.add_argument(proxy_options)

# 其他选项

options.add_argument('--headless') # 无头模式

options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')

# 绕过代理认证弹窗（如果需要）

options.add_argument('--proxy-bypass-list=*')

options.add_argument('--ignore-certificate-errors')

driver = webdriver.Chrome(service=service, options=options)

# 访问途牛旅游网

url = "https://www.tuniu.com/"

driver.get(url)

time.sleep(3)

# 搜索"北京"旅游线路

search_box = driver.find_element(By.ID, "search-input")

search_box.send_keys("北京")

search_box.send_keys(Keys.RETURN)

time.sleep(5)

# 爬取多页数据

tours = []

for _ in range(3): # 爬取3页

html = driver.page_source

soup = BeautifulSoup(html, 'html.parser')

for item in soup.select('.trip-item'):

name = item.select_one('.title').text.strip()

price = item.select_one('.price').text.strip()

departure = item.select_one('.departure').text.strip()

days = item.select_one('.days').text.strip()

rating = item.select_one('.rating').text.strip()

tours.append({

'name': name,

'price': price,

'departure': departure,

'days': days,

'rating': rating

})

# 翻页

try:

next_page = driver.find_element(By.CSS_SELECTOR, '.next-page')

next_page.click()

time.sleep(random.uniform(2, 5))

except:

break

# 存储数据

df = pd.DataFrame(tours)

df.to_csv('tuniu_tours.csv', index=False, encoding='utf-8-sig')

# 关闭浏览器

driver.quit()

print("数据爬取完成，已保存至 tuniu_tours.csv")

7. 总结

本文介绍了如何使用 Python + Selenium 自动化爬取途牛旅游网的动态数据，包括：

1Selenium基础操作（启动浏览器、查找元素、模拟点击）

2动态页面解析（结合BeautifulSoup提取数据）

3翻页爬取（自动点击“下一页”）

4反爬策略（User-Agent、代理IP、随机等待）

Selenium虽然强大，但速度较慢，适合小规模爬取。如需更高效率，可研究 Playwright 或 Scrapy + Splash 方案。