Ex­plor­ing Prob­a­bilis­tic Data Struc­tures: Bloom Fil­ters

Bloom fil­ters, con­ceived by Bur­ton Howard Bloom in 1970, are prob­a­bilis­tic data struc­tures that are used to test whether an el­e­ment be­longs to a set or not.

OpenSource For You - - Contents -

Pro­cess­ing a large set of en­ti­ties be­comes an ex­pen­sive and te­dious task, par­tic­u­larly for set op­er­a­tions, such as the union and in­ter­sec­tion of el­e­ments in a par­tic­u­lar set. An ex­am­ple would be the pro­cess­ing of a very large set of num­bers (e.g., 100 million) and ex­e­cut­ing a query to de­ter­mine whether a num­ber be­longs to the group or not. The naive method of ex­e­cut­ing such a query is by it­er­at­ing over all set mem­bers. How­ever, this is ex­pen­sive, con­sumes pre­cious CPU cy­cles and also pol­lutes the cache mem­ory.

Prob­a­bilis­tic data struc­tures el­e­gantly solve such prob­lems with min­i­mal re­sources and with an ac­cept­able er­ror rate. They out­per­form reg­u­lar so­lu­tions like HashMaps and HashTables, which have large mem­ory foot­prints. In this ar­ti­cle, we will dis­cuss one such data struc­ture—the Bloom fil­ter. It is one of the most pop­u­lar data struc­tures due to its sim­plic­ity of im­ple­men­ta­tion, small mem­ory foot­print, and ef­fi­ciency. It is used for re­quest fil­ter­ing over a large set of data, and used by Web browsers such as Chrome, Web caches of Aka­mai, and web­sites like Quora and Medium.

What is a Bloom fil­ter

A Bloom fil­ter was in­vented by a sci­en­tist called Bur­ton Howard Bloom in 1970, to test whether or not an el­e­ment be­longs to a set. The prin­ci­pal prop­erty of a Bloom fil­ter is based on prob­a­bil­ity, in which a pos­i­tive re­sponse of the fil­ter means that the given el­e­ment could be­long to the set. How­ever, a neg­a­tive re­sponse guar­an­tees that the el­e­ment is not present in the set. The ter­mi­nol­ogy for the sce­nario in which an el­e­ment is present in the set but does not re­turn ‘True’ is termed a false pos­i­tive or the er­ror rate of the fil­ter.

Defin­ing a Bloom fil­ter

As­sume we have an in­put set S = [ x1, x2,...xn ] of n el­e­ments. A Bloom fil­ter B can be rep­re­sented by an ar­ray of m bits (ze­roini­tialised). The size of Bloom fil­ter is a ma­jor fac­tor in­flu­enc­ing its er­ror rate. A large size fil­ter usu­ally has a lower er­ror rate. B= [ b1, b2 ...bm-1, bm ]

A Bloom fil­ter uses k in­de­pen­dent hash func­tions h1...hk that are as­sumed to be uni­formly dis­trib­uted over the range 1 . ... m.

A stan­dard Bloom fil­ter has two op­er­a­tions de­fined as fol­lows:

ƒ add(x) ­ Add to the Bloom fil­ter. For 1≤i≤kB [hi(x)]=1 x ∈S ƒ (x) - Check if x is in the Bloom fil­ter. For 1 ≤ i ≤ k ifB[hi(x)]=1 re­turnTrue, else re­turn False. To rep­re­sent set S by a Bloom fil­ter B, we need to process each mem­ber of S with add func­tion. Here is a sim­ple

ex­am­ple to un­der­stand the op­er­a­tions of Bloom fil­ter. ƒ

For an in­put set, S = [1,2,3,4]

ƒNum­ber of hash func­tions, k=2 ;

ƒSize of Bloom Fil­ter, m=8 ;

ƒWe have two hash func­tions: h1,h2

Bloom fil­ter prepa­ra­tion

The prepa­ra­tion phase in­volves set­ting an en­try us­ing the add op­er­a­tion in the Bloom fil­ter cor­re­spond­ing to a set mem­ber. The add op­er­a­tion maps the in­put el­e­ment us­ing the two hash func­tions. Each hash func­tion out­put is an in­dex into the Bloom fil­ter ar­ray B. After cal­cu­lat­ing the in­dex, it is set to one. This process will be re­peated for all the in­put en­tries. The add op­er­a­tion is de­fined as fol­lows:

add(x): i = h1(x) j = h2(x) B[i] = 1 B[j] = 1

The Bloom fil­ter bit ar­ray is as fol­lows:

add (2) h1(2) = 6 h2(2) = 7

add (3) h1(3) = 4 h2(3) = 1

To de­ter­mine the mem­ber­ship of an el­e­ment in an in­put set S, the check op­er­a­tion is used. This op­er­a­tion is de­fined as fol­lows:

bool check(x): i = h1(x) j = h2(x) re­turn i && j

The in­put value is sub­jected to the same hash func­tions, and the cor­re­spond­ing bits at the in­dices are checked if they are set. If all the val­ues are set, then the in­put value might be present in the set. This is al­ways prob­a­ble, but not cer­tain be­cause the bits on the po­si­tions in ques­tion might be set by other add op­er­a­tions of other el­e­ments.

When the fil­ter re­turns False for a given in­put, it is cer­tain that the in­put el­e­ment does not ex­ist in the set. The con­di­tion of a false re­turn will oc­cur when one or more hashed in­dices have zero value. No add op­er­a­tion ever sets those po­si­tions to be 1, and hence the in­put el­e­ment is not present in the set.

check (7) h1(3) = 4 h2(3) = 0

In the above op­er­a­tion, since a check would re­turn False, the el­e­ment is def­i­nitely not present in the set. This is the util­ity of the Bloom fil­ter, which elim­i­nates spu­ri­ous lookup op­er­a­tions.

check (5) h1(3) = 6 h2(3) = 7

In the op­er­a­tion given above, since the check re­turns True, it is pos­si­ble that 5 is present in the set. This is a false pos­i­tive re­sult and an er­ror of the fil­ter.

The er­ror rate of a Bloom fil­ter

The ex­pected value of false pos­i­tive prob­a­bil­ity p de­cides the op­ti­mal val­ues for Bloom fil­ter size m, and the num­ber of hash func­tions k, for an in­put list of el­e­ments. We as­sume that all the k hash func­tions are uni­form and will choose all m po­si­tions in the bit ar­ray B of Bloom Fil­ter with equal prob­a­bil­ity of 1 . m

The prob­a­bil­ity of choos­ing a po­si­tion among m po­si­tions is m. 1 The prob­a­bil­ity of not choos­ing a po­si­tion is (1-(m)). 1 Since there are k hash func­tions, the prob­a­bil­ity of not choos­ing that po­si­tion for a po­si­tion is (1-(m))-k 1 .

For n en­tries, we re­peat the above process n times and get the prob­a­bil­ity of not set­ting that po­si­tion as (1-(m))-kn. 1 So, the prob­a­bil­ity that the po­si­tion is set after the above process is (1-(1(m))-kn) 1 . Con­sid­er­ing all the po­si­tions that are set, the fi­nal prob­a­bil­ity we get is (1­(1(m))-kn)k 1 which is ap­prox­i­mately (1-e ( -kn m ))k.

The ef­fec­tive­ness of the Bloom fil­ter de­pends on the val­ues of p, k, n and m. We need to find op­ti­mal val­ues of m and k de­pend­ing on the in­put val­ues of p and n. By the use of dif­fer­en­tial cal­cu­lus, we get the fol­low­ing re­sults, for

op­ti­mal val­ues.

Ex­per­i­men­tal re­sults

The Bloom fil­ter tech­nique is used by Web browsers to fil­ter ma­li­cious URLs. It helps to solve the prob­lem of fast lookup from a list of such URLs. A browser has to get the in­for­ma­tion from a re­mote server, with­out a lo­cal lookup so­lu­tion, which is slow and ex­pen­sive. One so­lu­tion is to use a HashTable to store these URLs. But HashTables have large mem­ory re­quire­ments for huge lists such as those con­tain­ing 10 million en­tries, which need 250MB to 300MB of mem­ory.

A Bloom fil­ter ef­fec­tively han­dles such prob­lems and of­fers con­stant time O(1) op­er­a­tions while us­ing neg­li­gi­ble mem­ory. A Bloom fil­ter of a size of one million en­tries would need only 1MB of mem­ory. An in­put URL is checked in the Bloom fil­ter em­bed­ded in the browser. If the fil­ter re­turns False, then one can be sure that the URL is not present in the list of ma­li­cious sites, since the false neg­a­tive case is not pos­si­ble in a Bloom fil­ter. If the fil­ter re­turns True, the in­put URL could or could not be present in the ma­li­cious URL list. So the browser has to make a re­mote call to de­ter­mine the va­lid­ity of the URL.

The Bloom fil­ter is very use­ful es­pe­cially when the in­put search is a spu­ri­ous value be­cause the fil­ter will iden­tify it as ‘not present’ in the tar­get set. This re­duces the load of re­mote calls and de­creases la­tency of the browser lookup op­er­a­tions.

We ran an ex­per­i­ment in which there was a list of one million ma­li­cious URLs gen­er­ated ran­domly. The in­com­ing URL was checked against the ma­li­cious URL list us­ing a Bloom fil­ter. We found that the Bloom fil­ter suc­cess­fully iden­ti­fied 88 per cent of ma­li­cious URLs. The code for the Bloom fil­ter im­ple­men­ta­tion and the test ex­per­i­ment is avail­able at GitHub (https://github.com/souravchk/Al­go­rithm­sand­Data­Struc­tures/tree/master/bloom­fil­ters).

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.