Enormous node memory usage for tiny values

zv commented

2013-02-21 23:33:09 +00:00

(Migrated from github.com)

Hi gburd, I'm ecstatic that someone built this for and would love to use this in place of an existing gb_tree storing and searching thousands of keys but I noticed that I start getting pagefaults at storing 12k dwords on a linux machine with 16GB of RAM running R15B, making each hamt_node clock in at something like 1250000/8 bytes. I haven't yet run it through erlang:debugger(), any idea what might give?

gburd commented

2013-02-21 23:49:03 +00:00

(Migrated from github.com)

I just got it working today, it's only a week old! ;-P There are a ton of places to improve performance, this really is just my learning exercise at this point. I'll continue to make this better as I work toward building a real ctrie for Erlang. We (the Erlang community) need HAMT/Ctrie, no more being jealous of Haskell and Clojure. I did manage in tests to store 1M {int(), int()} pairs, but that was just a quick test.

zv commented

2013-02-21 23:53:27 +00:00

(Migrated from github.com)

Ok, I am going to look into this tonight and see if I can figure out the origin of this and replicate it on R16.

gburd commented

2013-02-22 00:54:51 +00:00

(Migrated from github.com)

You are the man, thanks for the help. BTW, there is at least one issue I know in the delete code where an cnode isn't collapsed when it only has one Node left. Here's how to trigger it:

2> hamt:delete(<<5>>, hamt:put([{<>, <>} || X <- lists:seq(1,6)], hamt:new())).
{hamt,{cnode,17629440,
[{snode,<<3>>,<<3>>},
{snode,<<6>>,<<6>>},
{snode,<<1>>,<<1>>},
{cnode,131072,[{snode,<<2>>,<<2>>}]},
{snode,<<4>>,<<4>>}]}}

the {cnode,131072,[{snode,<<2>>,<<2>>}]} should have collapsed to an {snode, <<2>>, <<2>>}.

You are the man, thanks for the help. BTW, there is at least one issue I know in the delete code where an cnode isn't collapsed when it only has one Node left. Here's how to trigger it: 2> hamt:delete(<<5>>, hamt:put([{<<X>>, <<X>>} || X <- lists:seq(1,6)], hamt:new())). {hamt,{cnode,17629440, [{snode,<<3>>,<<3>>}, {snode,<<6>>,<<6>>}, {snode,<<1>>,<<1>>}, {cnode,131072,[{snode,<<2>>,<<2>>}]}, {snode,<<4>>,<<4>>}]}} the {cnode,131072,[{snode,<<2>>,<<2>>}]} should have collapsed to an {snode, <<2>>, <<2>>}.

gburd commented

2013-02-22 19:46:18 +00:00

(Migrated from github.com)

BTW, FWIW I've fixed the delete issue but not yet looked closely at memory consumption yet.

zv commented

2013-02-22 19:55:36 +00:00

(Migrated from github.com)

Awesome!I couldn't replicate the issue on either R16 or R15 at home (very strange), I'm working on a NIF now to try to implement CTPOP to see if that helps me here.

gburd commented

2013-02-22 20:16:15 +00:00

(Migrated from github.com)

Cool, you should consider this C implementation (http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) or, if you're game for optimizing down to the ASM instruction to do the popcnt on those architectures that support it.
https://github.com/coapp-packages/nasm/blob/master/test/popcnt.asm
https://patchwork.kernel.org/patch/81479/
https://subversion.assembla.com/svn/min-loss-hashing/trunk/utils/bitops.h using __builtin_popcount*
or on some architectures there exists a POPCNT instruction in the CPU. This is a bit like the wonderful world of crc... it's deeper than expected.

The thing to do is to lobby for a popcnt BIF. :)

Cool, you should consider this C implementation (http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) or, if you're game for optimizing down to the ASM instruction to do the popcnt on those architectures that support it. https://github.com/coapp-packages/nasm/blob/master/test/popcnt.asm https://patchwork.kernel.org/patch/81479/ https://subversion.assembla.com/svn/min-loss-hashing/trunk/utils/bitops.h using __builtin_popcount* or on some architectures there exists a POPCNT instruction in the CPU. This is a bit like the wonderful world of crc... it's deeper than expected. The thing to do is to lobby for a popcnt BIF. :)

gburd commented

2013-02-22 21:15:16 +00:00

(Migrated from github.com)

Last thing regarding bitpop, I never did fully debug why this wasn't working, I could spend a bit more time looking into that... to avoid the need for a NIF.

bitpop(0) -> 0;
bitpop(X) when is_integer(X), X > 0, X =< 16#FFFFFFFF ->
X1 = X - ((X bsr 1) band 16#55555555),
X2 = (X1 band 16#33333333) + ((X1 bsr 2) band 16#33333333),
((X2 + (X2 bsr 4) band 16#0F0F0F0F) * 16#01010101) bsr 24;
bitpop(X) when is_integer(X) -> %, X > 16#FFFFFFFF, X =< 16#FFFFFFFFFFFFFFFF ->
M1 = 16#5555555555555555, %% 0101...
M2 = 16#3333333333333333, %% 00110011..
M4 = 16#0f0f0f0f0f0f0f0f, %% 4 zeros, 4 ones ...
H01 = 16#0101010101010101, %% the sum of 256 to the power of 0,1,2,3...

%% 1. put count of each 2 bits into those 2 bits
X1 = X - (X bsr 1) band M1,
%% 2. put count of each 4 bits into those 4 bits
X2 = (X1 band M2) + ((X1 bsr 2) band M2),
%% 3. put count of each 8 bits into those 8 bits
X3 = (X2 + (X2 bsr 4)) band M4,
%% 4. returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
(X3 * H01) bsr 56.

Last thing regarding bitpop, I never did fully debug why this wasn't working, I could spend a bit more time looking into that... to avoid the need for a NIF. bitpop(0) -> 0; bitpop(X) when is_integer(X), X > 0, X =< 16#FFFFFFFF -> X1 = X - ((X bsr 1) band 16#55555555), X2 = (X1 band 16#33333333) + ((X1 bsr 2) band 16#33333333), ((X2 + (X2 bsr 4) band 16#0F0F0F0F) \* 16#01010101) bsr 24; bitpop(X) when is_integer(X) -> %, X > 16#FFFFFFFF, X =< 16#FFFFFFFFFFFFFFFF -> M1 = 16#5555555555555555, %% 0101... M2 = 16#3333333333333333, %% 00110011.. M4 = 16#0f0f0f0f0f0f0f0f, %% 4 zeros, 4 ones ... H01 = 16#0101010101010101, %% the sum of 256 to the power of 0,1,2,3... ``` %% 1. put count of each 2 bits into those 2 bits X1 = X - (X bsr 1) band M1, %% 2. put count of each 4 bits into those 4 bits X2 = (X1 band M2) + ((X1 bsr 2) band M2), %% 3. put count of each 8 bits into those 8 bits X3 = (X2 + (X2 bsr 4)) band M4, %% 4. returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ... (X3 * H01) bsr 56. ```

Enormous node memory usage for tiny values #1