There is given an alphabet of C characters, (i.e. "abcde"), and I sample up to S characters from this alphabet (with replacement), i.e. getting words like "cacb".
From the sampled characters the order is unimportant (i.e. "cacd" is the same as "dacc", but different from "daac" or "acd").
I want to "hash" each possible word, i.e. give them a unique number (disregarding to the ordering). As the hash will be used as a memory address, the hash should be efficient to calculate (from word->hash, the other direction is never used) - and more important as dense as possible (i.e. after all possible words of S characters are hashed, all numbers should fill an interval completely - to save memory).
The unimportant ordering can be dealt with, as one could normalize each word in a sequence of raising characters (i..e. "cacd" is normalized to "accd") or counting the occurences of each character (i.e. "cacd" would be then the tuple (1,0,2,1,0) or something similar).
Does someone know of some simple scheme like this ?
Edit: oops, I missed the efficient hash and limited length requirement. Look into combinatorial number systems and partitions.
Edit 2: And look into multisets. I recall this was mentioned recently here somewhere. See:
http://en.wikipedia.org/wiki/Multiset
http://en.wikipedia.org/wiki/Combinatorial_number_system
I think what you want to do is (a) select your string, which is really a multiset, (b) map the multiset to a combination (see the cites), then (c) use a combinatorial number system to convert the combination to an integer. That should generate a perfect hash.
Select row_number() over(order by a.letter,b.letter,c.letter,d.letter) rownum,
a.letter c1,
b.letter c2,
c.letter c3,
d.letter c4
from HASH a
inner join HASH b
on a.letter <= b.letter
inner join HASH c
on b.letter <= c.letter
inner join HASH d
on c.letter <= d.letter
Where d.letter < 'f'
Order By 1,2,3,4
Rownum | c1 | c2 | c3 | c4 |
---|---|---|---|---|
1 | a | a | a | a |
2 | a | a | a | b |
3 | a | a | a | c |
4 | a | a | a | d |
5 | a | a | a | e |
6 | a | a | b | b |
7 | a | a | b | c |
8 | a | a | b | d |
9 | a | a | b | e |
10 | a | a | c | c |
11 | a | a | c | d |
12 | a | a | c | e |
13 | a | a | d | d |
14 | a | a | d | e |
15 | a | a | e | e |
16 | a | b | b | b |
17 | a | b | b | c |
18 | a | b | b | d |
19 | a | b | b | e |
20 | a | b | c | c |
21 | a | b | c | d |
22 | a | b | c | e |
23 | a | b | d | d |
24 | a | b | d | e |
25 | a | b | e | e |
26 | a | c | c | c |
27 | a | c | c | d |
28 | a | c | c | e |
29 | a | c | d | d |
30 | a | c | d | e |
31 | a | c | e | e |
32 | a | d | d | d |
33 | a | d | d | e |
34 | a | d | e | e |
35 | a | e | e | e |
36 | b | b | b | b |
37 | b | b | b | c |
38 | b | b | b | d |
39 | b | b | b | e |
40 | b | b | c | c |
41 | b | b | c | d |
42 | b | b | c | e |
43 | b | b | d | d |
44 | b | b | d | e |
45 | b | b | e | e |
46 | b | c | c | c |
47 | b | c | c | d |
48 | b | c | c | e |
49 | b | c | d | d |
50 | b | c | d | e |
51 | b | c | e | e |
52 | b | d | d | d |
53 | b | d | d | e |
54 | b | d | e | e |
55 | b | e | e | e |
56 | c | c | c | c |
57 | c | c | c | d |
58 | c | c | c | e |
59 | c | c | d | d |
60 | c | c | d | e |
61 | c | c | e | e |
62 | c | d | d | d |
63 | c | d | d | e |
64 | c | d | e | e |
65 | c | e | e | e |
66 | d | d | d | d |
67 | d | d | d | e |
68 | d | d | e | e |
69 | d | e | e | e |
70 | e | e | e | e |
Does your problem have anything to do with quick poker-hand evaluation?Quote: MangoJI have a small problem
Quote: MathExtremistIt's just a base C number system. 8 letters would map to octal numbers, etc., using the tuple construct you already have, and then you can convert to decimal to work with (for array indexing, etc.) Each base C number is S digits.
Yes this would be the brute force method, but will require space of C^S, which is ... large, and most of the space will be empty due to the ordering identity when S is comparable to C.
I'm currently thinking about some kind of tree with C links at each node. One would traverse the tree following the lowest characters in the given word, discard the character from the word, and then proceed from there recursively until the word is empty. the good point would be: access time would be linear to S, but addressing is slow, with S reads access across the whole memory. Space efficiency would depend on C compared to the size of the data stored for each word.
Not sure if I like it enough to settle for this approach - need to sleep over it.
Edit:
@MathExtremist: Wow, this is spot on. The word is indeed a mutiset. I didn't fully understood your second article as there are many unfamiliar terms used - Seems I will just play around with the stated bijective formula.
Edit: Argh.... the ordering bijection is only for distinct characters....
Quote: teliotDoes your problem have anything to do with quick poker-hand evaluation?
Not sure if it could be applied to addressing poker hands.
The project is about money management in tournaments for a game of chance, where your performance would depends on how well you bet in relation to each others current bankrolls. It is pretty simple to calculate the optimum betsize for the final round of the tournament, and giving you good estimate of success.
The possible bankrolls form the alphabet, each opponent is thus represented by a character, and all opponents left form the word. "character ordering" is discarded as you don't care about the specific identity of your opponents. I just picked "words", "alphabet" and "characters" for a clearer concept of what the problem is.
Anyway, once you know all EV estimates of success for the final round for all (reasonable) bankroll configurations, you can calculate optimum betsize for the round before the final. Do it again for any bankroll configuration - and in principle you could calculate the overall tournament strategy down to the very first bet.
Fast caching is crucial here, but once you have a good cache routine you can calculate any round from the round following. Cache must also be small enough so you can make the bankroll binning finer (i.e. more accurate). The last issue is calculating time preparing the caches...
Quote: MangoJ@MathExtremist: Wow, this is spot on. The word is indeed a mutiset. I didn't fully understood your second article as there are many unfamiliar terms used - Seems I will just play around with the stated bijective formula.
Edit: Argh.... the ordering bijection is only for distinct characters....
No, distinct characters makes it a regular set, not a multiset. You're going from right-to-left per the Wikipedia picture. The bijection is from a first multiset (which you already have) to a second set of combinations (of same size, just from a larger range). So you just need to write a conversion function from your multiset item to a corresponding combination item, then run the combination through the combinadic to return its index.
Quote: MathExtremistNo, distinct characters makes it a regular set, not a multiset. You're going from right-to-left per the Wikipedia picture. The bijection is from a first multiset (which you already have) to a second set of combinations (of same size, just from a larger range). So you just need to write a conversion function from your multiset item to a corresponding combination item, then run the combination through the combinadic to return its index.
I thought only English was to be spoken in the forum ?
Quote: MathExtremistNo, distinct characters makes it a regular set, not a multiset. You're going from right-to-left per the Wikipedia picture. The bijection is from a first multiset (which you already have) to a second set of combinations (of same size, just from a larger range). So you just need to write a conversion function from your multiset item to a corresponding combination item, then run the combination through the combinadic to return its index.
Thanks, I missed the bijection between the multiset and the combination. In the meanwhile I found a much simpler approach in some research paper. If someone cares, here is the relevant code (for an alphabet of 9 characters and words up to 6 characters).
const unsigned long int magic[6][9] = {
{3003, 1726, 924, 462, 210, 84, 28, 7, 1},
{ 1287, 792, 462, 252, 126, 56, 21, 6, 1},
{ 495, 330, 210, 126, 70, 35, 15, 5, 1},
{ 165, 120, 84, 56, 35, 20, 10, 4, 1},
{ 45, 36, 28, 21, 15, 10, 6, 3, 1},
{ 9, 8, 7, 6, 5, 4, 3, 2, 1}
};
unsigned long int address = 0;
for(int i=0, int k=0; k<9; k++) {
for(int c = 0; c < word[k]; c++) {
address += magic[i++][k];
}
}
where word[k] is the word as a normalized tuple, i.e. "bacc" is the array with elements {1,1,2,0,0,0}, and "address" will be the resulting natural number.
The magic numbers table is constructed quite simple: The last row and rightmost column is trivial, and all other entries follow magic[n][m] = magic[n][m+1] + magic[n+1][m]. (Of course this can be initialized to any given fixed dimensions in a loop.)
Quote: MangoJNot sure if it could be applied to addressing poker hands.
I suspect he asked because that is a common usage: http://www.codingthewheel.com/archives/poker-hand-evaluator-roundup
Quote: MangoJ
{3003, 1726, 924, 462, 210, 84, 28, 7, 1},
The 1726 should be 1716.
I love when this site opens my eyes to new things. Thanks MangoJ and ME.
int calcMultiSetAddress2(int[] word, int maxWordLength)
{
int address = 0;
int alphabetSize = word.length;
for(int k=0; k<alphabetSize; k++)
{
int i = maxWordLength;
for(int c = word[k]-1; c >=0; c--)
{
address += combin(alphabetSize-k+i-1,i--);
}
}
return address;
}
Did you test your code ? I see no problem iterating i and c backwards, but it seems in your code i gets initialized for every iteration of k, while the original code doesn't do it. Also, I'm more of a fan of precomputed lookup-tables than live function calls when it comes to performance, especially when the parameter space is well defined and small enough. Of course unless there are mistakes in the table^^
Since you understand the algorithm much better than I do, is it possible to have addiitional properties like: for n<m the address of a n-character word is always less than the address of am m-character word ? I.e. in address space address 0 would be assigned to empty word, address range 1..S would be assigned by the single-character word, address (S+1)...(something) is assigned by all 2-character words ?
This way the "maxWordLength" would not be fixed beforehand, and the increase in maxWordLength would only increase the total address space, but no rearrangement. This could be a very handy property....
Quote: MangoJThanks for the correction!
Did you test your code ?
I messed up the code with declaring i as I did. It should be as you had it. The results of my function match yours.
Quote:
I see no problem iterating i and c backwards, but it seems in your code i gets initialized for every iteration of k, while the original code doesn't do it.
see above
Quote:
Also, I'm more of a fan of precomputed lookup-tables than live function calls when it comes to performance, especially when the parameter space is well defined and small enough. Of course unless there are mistakes in the table^^
I completely agree. If you know the size of the alphabet and the max word size, you should precompute.
Quote:
Since you understand the algorithm much better than I do, is it possible to have addiitional properties like: for n<m the address of a n-character word is always less than the address of am m-character word ? I.e. in address space address 0 would be assigned to empty word, address range 1..S would be assigned by the single-character word, address (S+1)...(something) is assigned by all 2-character words ?
This way the "maxWordLength" would not be fixed beforehand, and the increase in maxWordLength would only increase the total address space, but no rearrangement. This could be a very handy property....
I'll look at this.
int calcMultiSetAddress(int[] word)
{
int address = 0;
int alphabetSize = word.length-1;
int wordSize = 0;
for(int i = 0; i < word.length; i++) wordSize+=word;
for(int i = 0; i < wordSize; i++) address += combin(alphabetSize+i,i);
for(int i = wordSize, k=0; k<alphabetSize; k++)
{
for(int c = 0; c < word[k]; c++)
{
address += combin(alphabetSize-k+i-1,i--);
}
}
return address;
}
word address
[0, 0, 0, 0, 0, 0] 0
[0, 0, 0, 0, 0, 1] 1
[0, 0, 0, 0, 1, 0] 2
[0, 0, 0, 1, 0, 0] 3
[0, 0, 1, 0, 0, 0] 4
[0, 1, 0, 0, 0, 0] 5
[1, 0, 0, 0, 0, 0] 6
[0, 0, 0, 0, 0, 2] 7
[0, 0, 0, 0, 1, 1] 8
[0, 0, 0, 0, 2, 0] 9
[0, 0, 0, 1, 0, 1] 10
[0, 0, 0, 1, 1, 0] 11
[0, 0, 0, 2, 0, 0] 12
[0, 0, 1, 0, 0, 1] 13
[0, 0, 1, 0, 1, 0] 14
[0, 0, 1, 1, 0, 0] 15
[0, 0, 2, 0, 0, 0] 16
[0, 1, 0, 0, 0, 1] 17
[0, 1, 0, 0, 1, 0] 18
[0, 1, 0, 1, 0, 0] 19
[0, 1, 1, 0, 0, 0] 20
[0, 2, 0, 0, 0, 0] 21
[1, 0, 0, 0, 0, 1] 22
[1, 0, 0, 0, 1, 0] 23
[1, 0, 0, 1, 0, 0] 24
[1, 0, 1, 0, 0, 0] 25
[1, 1, 0, 0, 0, 0] 26
[2, 0, 0, 0, 0, 0] 27
[0, 0, 0, 0, 0, 3] 28
[0, 0, 0, 0, 1, 2] 29
[0, 0, 0, 0, 2, 1] 30
[0, 0, 0, 0, 3, 0] 31
[0, 0, 0, 1, 0, 2] 32
[0, 0, 0, 1, 1, 1] 33
[0, 0, 0, 1, 2, 0] 34
[0, 0, 0, 2, 0, 1] 35
[0, 0, 0, 2, 1, 0] 36
[0, 0, 0, 3, 0, 0] 37
[0, 0, 1, 0, 0, 2] 38
[0, 0, 1, 0, 1, 1] 39
[0, 0, 1, 0, 2, 0] 40
[0, 0, 1, 1, 0, 1] 41
[0, 0, 1, 1, 1, 0] 42
[0, 0, 1, 2, 0, 0] 43
[0, 0, 2, 0, 0, 1] 44
[0, 0, 2, 0, 1, 0] 45
[0, 0, 2, 1, 0, 0] 46
[0, 0, 3, 0, 0, 0] 47
[0, 1, 0, 0, 0, 2] 48
[0, 1, 0, 0, 1, 1] 49
[0, 1, 0, 0, 2, 0] 50
[0, 1, 0, 1, 0, 1] 51
[0, 1, 0, 1, 1, 0] 52
[0, 1, 0, 2, 0, 0] 53
[0, 1, 1, 0, 0, 1] 54
[0, 1, 1, 0, 1, 0] 55
[0, 1, 1, 1, 0, 0] 56
[0, 1, 2, 0, 0, 0] 57
[0, 2, 0, 0, 0, 1] 58
[0, 2, 0, 0, 1, 0] 59
[0, 2, 0, 1, 0, 0] 60
[0, 2, 1, 0, 0, 0] 61
[0, 3, 0, 0, 0, 0] 62
[1, 0, 0, 0, 0, 2] 63
[1, 0, 0, 0, 1, 1] 64
[1, 0, 0, 0, 2, 0] 65
[1, 0, 0, 1, 0, 1] 66
[1, 0, 0, 1, 1, 0] 67
[1, 0, 0, 2, 0, 0] 68
[1, 0, 1, 0, 0, 1] 69
[1, 0, 1, 0, 1, 0] 70
[1, 0, 1, 1, 0, 0] 71
[1, 0, 2, 0, 0, 0] 72
[1, 1, 0, 0, 0, 1] 73
[1, 1, 0, 0, 1, 0] 74
[1, 1, 0, 1, 0, 0] 75
[1, 1, 1, 0, 0, 0] 76
[1, 2, 0, 0, 0, 0] 77
[2, 0, 0, 0, 0, 1] 78
[2, 0, 0, 0, 1, 0] 79
[2, 0, 0, 1, 0, 0] 80
[2, 0, 1, 0, 0, 0] 81
[2, 1, 0, 0, 0, 0] 82
[3, 0, 0, 0, 0, 0] 83
As far as I see, the second loop could be eliminated by "address += combin(alphabetSize + wordSize, wordSize - 1)".
Then it would be efficient to keep track of wordSize in the word array itself (i.e. as [0] index, with the character frequencies starting at [1])
Then the code could look like
int calcMultiSetAddress(int[] word) {
int address = combin(word.length + word[0], word[0]-1);
for(int i = word[0], k=1; k<=word.length; k++) {
for(int c = 0; c < word[k]; c++) {
address += combin(word.length-k+i,i--);
}
}
return address;
}
What do you think ?
Quote: MangoJHey, that's pretty awesome man!
What do you think ?
Thanks!
I like removing the extra loops. Your comment about removing the address+= loop jogged my memory that I don't need to do the loop at all (but it ended up a little different from your suggestion), I just had to realize that c(n,m) + c(n-1,m-1) + c(n-2,m-2) ... = c(n+1,m).
So, here is what I came up with and tested, with identical results as before:
public static int calcMultiSetAddress(int[] word) {
int address = (int) combin(word.length - 2 + word[0], word[0] - 1);
for (int i = word[0], k = 1; k < word.length - 1; k++) {
for (int c = 0; c < word[k]; c++) {
address += combin(word.length - 2 - k + i, i--);
}
}
return address;
}
I'm happy to help, but I'm sure it will help me out some day too.