雪重 Python Developer 知识 汗水 灵感 机遇
2023年10月7日   similarity

基于结构和样式相似度的html聚类研究

参考论文:https://sci-hub.se/10.1109/IRI.2016.30

论文中提到的几点,网页结构相似度,样式相似度,聚类。本文依据论文思路和相关算法,用代码大致复现。

结构相似度

文中提到了APTED全路径树编辑距离算法。

from lxml import html
from apted import APTED, helpers


def lxml_to_tree_structure(html_content):
    tree = html.fromstring(html_content)

    def parse_node(node):
        if isinstance(node, html.HtmlElement):
            tag = node.tag
            children = [parse_node(child) for child in node.iterchildren()]
            return '{%s%s}' % (tag, ''.join(children))
        elif isinstance(node, html.HtmlComment):
            return ""
        else:
            return ""

    return parse_node(tree)


def structural_similarity(html_1, html_2):
    # 转换为APTED的树表示
    tree1 = helpers.Tree.from_text(lxml_to_tree_structure(html_1))
    tree2 = helpers.Tree.from_text(lxml_to_tree_structure(html_2))

    apted_class = APTED(tree1, tree2)

    # 计算编辑距离
    distance = apted_class.compute_edit_distance()
    # print(distance)

    # 计算相似度
    return 1 - distance / (apted_class.it1.tree_size + apted_class.it2.tree_size)

但当我们实际使用时,发现算法复杂度极高,尤其对于当前时代的网页结构,dom节点更为丰富的情况下,可尝试优化为使用 html 标签的序列比较来计算相似度。

样式相似度

from lxml import etree

def get_classes(html):
    doc = etree.HTML(html)
    classes = set(doc.xpath('//*[@class]/@class'))
    result = set()
    for cls in classes:
        for _cls in cls.split():
            result.add(_cls)
    return result


def jaccard_similarity(set1, set2):
    set1 = set(set1)
    set2 = set(set2)
    if len(set1) == 0 and len(set2) == 0:
        return 1
    intersection = set1.intersection(set2)
    union = set1.union(set2)

    return len(intersection) / len(union)


def style_similarity(page1, page2):
    """
    Computes CSS style Similarity between two DOM trees

    A = classes(Document_1)
    B = classes(Document_2)

    style_similarity = |A & B| / (|A| + |B| - |A & B|)

    :param page1: html of the page1
    :param page2: html of the page2
    :return: Number between 0 and 1. If the number is next to 1 the page are really similar.
    """
    classes_page1 = get_classes(page1)
    classes_page2 = get_classes(page2)
    return jaccard_similarity(classes_page1, classes_page2)

聚类

文中使用的是共享最近邻算法,复现时使用DBSCAN,一种基于密度的聚类方法。

import numpy as np
from sklearn.cluster import DBSCAN
htmls = ["<html></html>",]
threshold = 0.9
similarity_matrix = np.zeros((len(htmls), len(htmls)))

# 需要计算相似度矩阵, similarity_matrix由结构相似度和样式相似度计算得到
"""
code...
"""


# 假设你有一个距离矩阵
# 注意,DBSCAN需要一个距离矩阵,而不是相似度矩阵
# 你可能需要将你的相似度矩阵转换为距离矩阵,例如,通过 d = 1 - s
distance_matrix = 1 - similarity_matrix

# 使用DBSCAN进行聚类
# eps是邻域的大小,min_samples是形成核心对象所需要的最小邻居数目
# 你可能需要调整这些参数来得到好的聚类结果
clustering = DBSCAN(eps=1 - threshold, min_samples=2, metric='precomputed')
labels = clustering.fit_predict(distance_matrix)

结果

从cc数据中抽取了部分域名的网站,聚类结果大致如下:

image-20231007145932631

image-20231007150022247