public class MinHash extends Object implements Serializable
| Constructor and Description |
|---|
MinHash(double error,
int dict_size)
Initializes hash function to compute MinHash signatures for sets built
from a dictionary of dict_size elements, with a given similarity
estimation error.
|
MinHash(double error,
int dict_size,
long seed)
Initializes hash function to compute MinHash signatures for sets built
from a dictionary of dict_size elements, with a given similarity
estimation error.
|
MinHash(int size,
int dict_size)
Initializes hash functions to compute MinHash signatures for sets built
from a dictionary of dict_size elements.
|
MinHash(int size,
int dict_size,
long seed)
Initializes hash functions to compute MinHash signatures for sets built
from a dictionary of dict_size elements.
|
| Modifier and Type | Method and Description |
|---|---|
static Set<Integer> |
convert2Set(boolean[] array)
Convert a set represented as an array of booleans to a set of integer.
|
double |
error()
Computes the expected error of similarity computed using signatures.
|
long[][] |
getCoefficients()
Get the coefficients used by hash function hi.
|
static double |
jaccardIndex(boolean[] s1,
boolean[] s2)
Compute the exact jaccard index between two sets, represented as
arrays of booleans.
|
static double |
jaccardIndex(Set<Integer> s1,
Set<Integer> s2)
Compute the jaccard index between two sets.
|
int[] |
signature(boolean[] vector)
Computes the signature for this set The input set is represented as an
vector of booleans.
|
int[] |
signature(Set<Integer> set)
Computes the signature for this set.
|
double |
similarity(int[] sig1,
int[] sig2)
Computes an estimation of Jaccard similarity (the number of elements in
common) between two sets, using the MinHash signatures of these two sets.
|
static int |
size(double error)
Computes the size of the signature required to achieve a given error in
similarity estimation.
|
public MinHash(int size,
int dict_size)
size - the number of hash functions (and the size of resulting
signatures)dict_size - public MinHash(double error,
int dict_size)
error - dict_size - public MinHash(int size,
int dict_size,
long seed)
size - the number of hash functions (and the size of resulting
signatures)dict_size - seed - random number generator seed. using the same value will
guarantee identical hashes across object instantiationspublic MinHash(double error,
int dict_size,
long seed)
error - dict_size - seed - random number generator seed. using the same value will
guarantee identical hashes across object instantiationspublic static double jaccardIndex(Set<Integer> s1, Set<Integer> s2)
s1 - s2 - public static double jaccardIndex(boolean[] s1,
boolean[] s2)
s1 - s2 - public static Set<Integer> convert2Set(boolean[] array)
array - public static int size(double error)
error - public final int[] signature(boolean[] vector)
vector - public final int[] signature(Set<Integer> set)
set - public final double similarity(int[] sig1,
int[] sig2)
sig1 - MinHash signature of set1sig2 - MinHash signature of set2 (produced using the same
coefficients)public final double error()
public final long[][] getCoefficients()
Copyright © 2019. All rights reserved.