Bloom过滤器算法的实现

题目

采用Bloom过滤器设计一个邮件地址过滤器。要求：存储1000000个垃圾邮件地址，检测错误率小于0.00001，如果采用k=8个映射函数，则Bloom过滤器的最小长度是多少？

思路

创建一个rubbish.txt文件储存垃圾邮件地址，垃圾邮件地址由程序自动生成，然后生成一个input.txt文件用于储存输入的邮箱地址，过滤后邮箱地址由output.txt储存。

使用python的bitarray来进行位操作，更加方便，主要是使用的空间少，如何使用numpy.array需要34GiB，电脑无法分配如此巨量的空间。然后使用开源的8种哈希函数进行散列，然后取最后N位，N为位的个数，进行下标设置1操作。

原理

布隆过滤器的核心就是哈希函数，可以将其看作是对 bitmap 的拓展，通过将添加的元素经过 k 个哈希函数映射到一个很长的 bit 向量中的 k 个位置，并将它们置为 1。检索时，通过判断这个位置是否都为 1 就能大概知道集合中是否存在查询的元素。具体规则是：

如果这些位置中有任何一个位置为 0，那么被检索的元素一定不在
如果这些位置都是 1，那么被检索的元素可能存在。

运行结果

*Warning：关于测试方法，由于需要用8个散列，而且测试数据为百万级，输入数据为十万级（因为要求误差不能超过0.00001），每次运行会消耗较长时间，因此本文首先测试一万条数据，筛选出合适范围，然后进行更高级别测试*

1、一万条数据测试

设置位数为1000000（一百万位）

一万组正常数据只有866个通过过滤，误删率极高

设置位数为1000000000（十亿位）

一万组数据全部通过，正确率为100%

设置位数为100000000（一亿位）

一万组数据全部通过，正确率为100%

设置位数为10000000（一千万）

一万组数据通过9962条，正确率为99.62%

通过以上测试，决定使用一亿位进行测试

2、十万条数据测试

设置位数为100000000位（一亿位）

经过运行，正确过滤数量为1000000条，正确率为100%，符合运算要求

设置位数为90000000位（9千万位）

正确率为0.008341

结论

由上文可知，至少将位数大小设置为100000000（一亿）

源代码

from bitarray import bitarray
import random

length = 90000000  # 位数
input_number = 1000000  # 输入数量


# 以下为开源的hash函数

def rs_hash(key):
    a = 378551
    b = 63689
    hash_value = 0
    for i in range(len(key)):
        hash_value = hash_value * a + ord(key[i])
        a = a * b
    return hash_value


def js_hash(key):
    hash_value = 1315423911
    for i in range(len(key)):
        hash_value ^= ((hash_value << 5) + ord(key[i]) + (hash_value >> 2))
    return hash_value


def pjw_hash(key):
    bits_in_unsigned_int = 4 * 8
    three_quarters = (bits_in_unsigned_int * 3) / 4
    one_eighth = bits_in_unsigned_int / 8
    high_bits = 0xFFFFFFFF << int(bits_in_unsigned_int - one_eighth)
    hash_value = 0
    test = 0

    for i in range(len(key)):
        hash_value = (hash_value << int(one_eighth)) + ord(key[i])
        test = hash_value & high_bits
    if test != 0:
        hash_value = ((hash_value ^ (test >> int(three_quarters))) & (~high_bits))
    return hash_value & 0x7FFFFFFF


def elf_hash(key):
    hash_value = 0
    for i in range(len(key)):
        hash_value = (hash_value << 4) + ord(key[i])
        x = hash_value & 0xF0000000
        if x != 0:
            hash_value ^= (x >> 24)
        hash_value &= ~x
    return hash_value


def bkdr_hash(key):
    seed = 131  # 31 131 1313 13131 131313 etc..
    hash_value = 0
    for i in range(len(key)):
        hash_value = (hash_value * seed) + ord(key[i])
    return hash_value


def sdbm_hash(key):
    hash_value = 0
    for i in range(len(key)):
        hash_value = ord(key[i]) + (hash_value << 6) + (hash_value << 16) - hash_value;
    return hash_value


def djb_hash(key):
    hash_value = 5381
    for i in range(len(key)):
        hash_value = ((hash_value << 5) + hash_value) + ord(key[i])
    return hash_value


def dek_hash(key):
    hash_value = len(key);
    for i in range(len(key)):
        hash_value = ((hash_value << 5) ^ (hash_value >> 27)) ^ ord(key[i])
    return hash_value


def bp_hash(key):
    hash_value = 0
    for i in range(len(key)):
        hash_value = hash_value << 7 ^ ord(key[i])
    return hash_value


def fnv_hash(key):
    fnv_prime = 0x811C9DC5
    hash_value = 0
    for i in range(len(key)):
        hash_value *= fnv_prime
        hash_value ^= ord(key[i])
    return hash_value


def ap_hash(key):
    hash_value = 0xAAAAAAAA
    for i in range(len(key)):
        if (i & 1) == 0:
            hash_value ^= ((hash_value << 7) ^ ord(key[i]) * (hash_value >> 3))
        else:
            hash_value ^= (~((hash_value << 11) + ord(key[i]) ^ (hash_value >> 5)))
    return hash_value


def get_index(hash_value):
    return int((hash_value % 10000000)/10*9)


if __name__ == '__main__':
    # 创建输入文件，会有input_number个
    file = open('input.txt', 'w')
    for i in range(input_number):
        file.write(str(random.randint(0, 1000000000)) + "@163.com\n")
    file.close()

    print("input file have created\n")
    file = open('rubbish.txt', 'r')
    hash_map = bitarray(length)
    hash_map.setall(0)
    # 哈希函数表
    hash_list = [rs_hash, js_hash, pjw_hash, elf_hash, bkdr_hash, sdbm_hash, djb_hash, dek_hash]
    cnt = 0
    for i in range(1000000):
        cnt += 1
        if cnt == 10000:
            print("process is " + str(int(i / 10000)) + "%")
            cnt = 0
        key = file.readline()
        for hash_function in hash_list:
            hash_value = hash_function(key)
            index = get_index(hash_value)
            hash_map[index] = True

    file.close()
    print("hash_map have created\n")
    file_input = open('input.txt', 'r')
    output_file = open('output.txt', 'w')
    cnt = 0
    process = 0
    while True:
        cnt += 1
        if int(cnt / (input_number / 100)) > process:
            print("process is " + str(process) + "%")
            process = cnt / (input_number / 100)
        key = file_input.readline()
        judge = 0
        if not key:
            break
        for hash_function in hash_list:
            hash_value = hash_function(key)
            index = get_index(hash_value)
            if hash_map[index]:
                judge += 1
        if judge != 8:
            output_file.write(key)
    file.close()
    file = open('hash_map.txt', 'w')
    for i in range(100000):
        file.write(str(hash_map[i]))  # 查看散列表的前100000位，方便了解散列密度

    print("output file has created\n")