用selenium给我的flask商品中心服务整点样例数据 - SRE回忆录

共计 4254 个字符，预计需要花费 11 分钟才能阅读完成。

前几天用flask模拟了个简单的商品中心服务，缺了点数据，于是就上某东捞了一点

捞数据上某东

怎么捞呢？首选使用requests直接捞，结果捞不到，都是空数据，只能在浏览器渲染时有数据，看来得用selenium来折腾了

无头浏览器配置

博主此处使用谷歌浏览器，所以需要下载chromedriver，下载地址：https://registry.npmmirror.com/binary.html?path=chromedriver/

请根据自己当前浏览器版本选择对应的chromedriver，解压后放到项目Scripts内

解压放置到scripts目录后就可愉快的使用了，以下是简单的使用方式

import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.baidu.com")

time.sleep(8)
wd.quit()

定位搜索框并搜索内容

对输入框进行定位

发现它的id选择器是key，此时我们的代码

# 通过presence_of_element_located定位id选择器
mywait.until(EC.presence_of_element_located((By.ID, 'key')))

定位到输入框后可以输入内容

# 搜索我们想要的产品
mywait.until(EC.presence_of_element_located((By.ID, 'key'))).send_keys('笔记本 电脑整机')

定位搜索按钮并点击

# 点击定位到的搜索按钮元素
mywait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'button'))).click()

至此我们就完成页面自动搜索，接下来就是解析页面内容了

获取html数据

获取当前页面html，可以直接使用webdrive对象的page_source属性，此处使用了bs4来解析html，读者需要提前安装

def get_html(wd):
    html = wd.page_source
    # print(html)
    soup_obj = BeautifulSoup(html, "html.parser")
    return soup_obj

浏览器走到这个步骤会有问题，因为selenium跳到商品页后，停留在页面顶部，此时页面内容不完全。此时就需要控制页面滑到底部来加载数据

滑动窗口

滑动到底部

# 模拟鼠标滚轮，滑动页面至底部
js = "window.scrollTo(0, document.body.scrollHeight)" 
wd.execute_script(js)

滑动到顶部

# 模拟鼠标滚轮，滑动页面至顶部
js = "window.scrollTo(0, 0)"
driver.execute_script(js)

滑动至具体位置

wd.execute_script("window.scrollTo(x, y)")  # 滑动到具体位置
js = "window.scrollBy(0, 500)"   # 向下滑动500个像素
js = "window.scrollBy(0, -500)"　# 向上滚动500个像素
js = "window.scrollBy(500, 0)"   # 向右滑动500个像素
js = "window.scrollBy(-500, 0)"　# 向左滚动500个像素

博主手动在页面滑动到中间位置时，页面会再继续异步加载一次，所以需要滑动两次来实现页面加载到底部

js1 = "window.scrollTo(0, 5000)"
js2 = "window.scrollTo(0, document.body.scrollHeight)" 
wd.execute_script(js1)
time.sleep(3)
wd.execute_script(js2)
time.sleep(3)

到了这步以为可以拿到这个页面的全部数据了，没想到还是有数据没有拿到，主要是因为滑得太快，这些数据得在页面渲染才有，一下子滑倒底部，那些数据根本就不会在html内生成，所以我们需要改写下滑动为缓慢滑动

def scroll_slowly(wd):
    # 缓慢滑动到底部，避免页面异步未渲染数据，从而导致获取html数据为空
    js = "return action=document.body.scrollHeight"
    # 初始化现在滚动条所在高度为0
    height = 0
    # 当前窗口总高度
    new_height = wd.execute_script(js)
    # print(new_height)
    while height < new_height:
        # 将滚动条调整至页面底部
        for i in range(height, new_height, 100):
            # print(i)
            wd.execute_script('window.scrollTo(0, {})'.format(i))
            time.sleep(0.1)
        # 将当前总高度赋值给初始高度
        height = new_height
        time.sleep(1)
        # 重新获取新的总高度
        new_height = wd.execute_script(js)
    time.sleep(3)

解析页面内容过滤数据

根据上面开始定位解析我们想要得数据，没有太多得容错判断，因为数据可能获取为空导致过滤子对象时报错，博主直接外面套try ... except ...，错误得数据直接不要了~

def get_products(soup_obj):
    products_items = soup_obj.select('li.gl-item')
    products_list = []

    for i in products_items:
        # print("=" * 200)
        try:
            product = {
                "spu": i.get('data-spu'),
                "sku": i.get('data-sku'),
                "img": "https:" + i.select_one('.p-img > a > img').get('src'),
                "price": i.select_one('.p-price > strong > i').get_text(),
                "title": i.select_one('.p-name > a > em').get_text(),
                "description": i.select_one('.p-name > a').get('title'),
                "commit": i.select_one('.p-commit > strong > a').get_text(),
                "shop": i.select_one('.p-shop > span > a').get_text(),
            }
            products_list.append(product)
            print('='*120)
            print(product)
        except:
            continue
    return products_list

此处我们先获取商品得spu，sku，img，prices，title，description，commit，shop，后面其他服务早好了再看看其他字段数据吧。这些字段对应我们得商品中心表

mysql> desc products.it_products;
+-------------+---------------+------+-----+---------+----------------+
| Field       | Type          | Null | Key | Default | Extra          |
+-------------+---------------+------+-----+---------+----------------+
| id          | int(11)       | NO   | PRI | NULL    | auto_increment |
| spu         | varchar(255)  | NO   |     | NULL    |                |
| sku         | varchar(255)  | NO   | UNI | NULL    |                |
| img         | varchar(255)  | NO   |     | NULL    |                |
| price       | decimal(10,2) | NO   |     | NULL    |                |
| title       | varchar(255)  | NO   |     | NULL    |                |
| description | text          | YES  |     | NULL    |                |
| commit      | varchar(255)  | NO   |     | NULL    |                |
| shop        | varchar(255)  | NO   |     | NULL    |                |
+-------------+---------------+------+-----+---------+----------------+
9 rows in set (0.01 sec)

收集数据到mysql

前面表中用sku定义为不可重复，有重复得则为更新数据，否则插入新数据

def insert_mysql(product):
    conn = pymysql.connect(
        host='',
        port=3306,
        user='products',
        password='',
        database='products',
        charset='utf8mb4'
    )
    # 创建一个游标对象
    cursor = conn.cursor()

    # 定义 SQL 语句和要插入的数据
    # 定义SQL语句模板，重复则更新数据
    sql_template = '''
        INSERT INTO it_products (spu, sku, img, price, title, description, commit, shop)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
        ON DUPLICATE KEY UPDATE
        img=VALUES(img), price=VALUES(price), title=VALUES(title), description=VALUES(description), commit=VALUES(commit), shop=VALUES(shop)
    '''

    data = tuple(product.values())
    # 执行 SQL 语句
    cursor.execute(sql_template, data)

    # 提交事务
    conn.commit()

    # 关闭游标和数据库连接
    cursor.close()
    conn.close()

下一页数据得获取

def next_page():
    butten_next = mywait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#J_bottomPage > span.p-num > a.pn-next')))
    butten_next.click()

最终样例数据

以下便是样例数据