Unable to preload libumem to replace libc malloc #9

Closed
opened 2016-11-25 03:02:46 +00:00 by kellabyte · 7 comments
kellabyte commented 2016-11-25 03:02:46 +00:00 (Migrated from github.com)

I'm trying to use LD_PRELOAD with libumem to replace the libc malloc(). I'm seeing it get loaded but with perf top I'm not seeing malloc from libc being used when I performance profile.

I'm benchmarking using Haywire

git clone https://github.com/gburd/libumem.git lib/libumem
cd lib/libumem
./autogen.sh
./configure
make
cd ..

LD_PRELOAD=./lib//libumem/.libs/libumem.so /usr/bin/ldd ./build/hello_world
        linux-vdso.so.1 =>  (0x00007fffdfdfb000)
	../libumem/.libs/libumem.so (0x00007f4a45ec0000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4a45c9c000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f4a45a93000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4a456c9000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4a454c5000)
	/lib64/ld-linux-x86-64.so.2 (0x0000555dc31f9000)

perf top

   9.70%  hello_world         [.] http_parser_execute
   8.16%  libc-2.21.so        [.] malloc
   4.62%  libc-2.21.so        [.] free
   3.68%  libc-2.21.so        [.] __libc_calloc
   3.18%  hello_world         [.] http_request_buffer_reassign_pin
   3.14%  hello_world         [.] http_request_buffer_pin
   2.95%  libc-2.21.so        [.] 0x00000000001452a0
   2.43%  hello_world         [.] http_request_buffer_locate
   2.04%  libc-2.21.so        [.] 0x000000000014d6b0
   1.89%  [kernel]            [k] native_queued_spin_lock_slowpath
   1.67%  libc-2.21.so        [.] 0x000000000014d86c
   1.36%  [kernel]            [k] tcp_sendmsg
   1.34%  libc-2.21.so        [.] 0x00000000000806dd
   1.29%  libc-2.21.so        [.] 0x0000000000081890

I thought perhaps I needed to load libumem_malloc.so but if I try the following calloc() is running over and over even before I run any kind of benchmark against the service and the service never starts correctly.

LD_PRELOAD="./lib/libumem/.libs/libumem.so ./lib/libumem/.libs/libumem_malloc.so" ./build/hello_world

perf top
  96.40%  libumem_malloc.so.0.0.0  [.] calloc
   0.60%  libumem_malloc.so.0.0.0  [.] calloc@plt
   0.21%  [kernel]                 [k] native_write_msr_safe
   0.20%  [kernel]                 [k] rcu_check_callbacks
   0.15%  [kernel]                 [k] ktime_get
   0.12%  [kernel]                 [k] apic_timer_interrupt
   0.11%  [kernel]                 [k] cpu_needs_another_gp
   0.08%  [kernel]                 [k] update_sd_lb_stats
   0.08%  [kernel]                 [k] ktime_get_update_offsets_now
   0.07%  [kernel]                 [k] menu_select
   0.07%  [kernel]                 [k] native_queued_spin_lock_slowpath
   0.07%  [kernel]                 [k] cpuidle_enter_state
   0.06%  [kernel]                 [k] _raw_spin_lock_irqsave
   0.06%  [kernel]                 [k] note_gp_changes
I'm trying to use `LD_PRELOAD` with `libumem` to replace the `libc malloc()`. I'm seeing it get loaded but with perf top I'm not seeing malloc from libc being used when I performance profile. I'm benchmarking using [Haywire](http://github.com/haywire/haywire) ``` git clone https://github.com/gburd/libumem.git lib/libumem cd lib/libumem ./autogen.sh ./configure make cd .. LD_PRELOAD=./lib//libumem/.libs/libumem.so /usr/bin/ldd ./build/hello_world linux-vdso.so.1 => (0x00007fffdfdfb000) ../libumem/.libs/libumem.so (0x00007f4a45ec0000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4a45c9c000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f4a45a93000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4a456c9000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4a454c5000) /lib64/ld-linux-x86-64.so.2 (0x0000555dc31f9000) perf top 9.70% hello_world [.] http_parser_execute 8.16% libc-2.21.so [.] malloc 4.62% libc-2.21.so [.] free 3.68% libc-2.21.so [.] __libc_calloc 3.18% hello_world [.] http_request_buffer_reassign_pin 3.14% hello_world [.] http_request_buffer_pin 2.95% libc-2.21.so [.] 0x00000000001452a0 2.43% hello_world [.] http_request_buffer_locate 2.04% libc-2.21.so [.] 0x000000000014d6b0 1.89% [kernel] [k] native_queued_spin_lock_slowpath 1.67% libc-2.21.so [.] 0x000000000014d86c 1.36% [kernel] [k] tcp_sendmsg 1.34% libc-2.21.so [.] 0x00000000000806dd 1.29% libc-2.21.so [.] 0x0000000000081890 ``` I thought perhaps I needed to load `libumem_malloc.so` but if I try the following `calloc()` is running over and over even before I run any kind of benchmark against the service and the service never starts correctly. ``` LD_PRELOAD="./lib/libumem/.libs/libumem.so ./lib/libumem/.libs/libumem_malloc.so" ./build/hello_world perf top 96.40% libumem_malloc.so.0.0.0 [.] calloc 0.60% libumem_malloc.so.0.0.0 [.] calloc@plt 0.21% [kernel] [k] native_write_msr_safe 0.20% [kernel] [k] rcu_check_callbacks 0.15% [kernel] [k] ktime_get 0.12% [kernel] [k] apic_timer_interrupt 0.11% [kernel] [k] cpu_needs_another_gp 0.08% [kernel] [k] update_sd_lb_stats 0.08% [kernel] [k] ktime_get_update_offsets_now 0.07% [kernel] [k] menu_select 0.07% [kernel] [k] native_queued_spin_lock_slowpath 0.07% [kernel] [k] cpuidle_enter_state 0.06% [kernel] [k] _raw_spin_lock_irqsave 0.06% [kernel] [k] note_gp_changes ```
gburd commented 2016-11-30 03:18:25 +00:00 (Migrated from github.com)

I've replicated your issue by doing exactly what you did (and what anyone would do, it's in the INSTALL directions after all!) and discovered that changing one step in your process fixes the issue, why I don't know as yet (but I have some guesses I'm looking into). Configuring with the CFLAGS set to -O0 or -Og eliminates the issue (I've been testing with gcc version 6.2.0 20160901 (Ubuntu 6.2.0-3ubuntu11~16.04)).

$ ./configure CFLAGS="-g -Og -march=native -mtune=native"

A simple "hello, world!" application with a single call to malloc 1KB of memory on entry to main() was used to test this.

$ gcc -g -Og hello.c -o hello -L/usr/local/lib -lumem_malloc
$ ldd hello
	linux-vdso.so.1 =>  (0x00007ffd86395000)
	libumem_malloc.so.0 => /usr/local/lib/libumem_malloc.so.0 (0x00007ff4f1b94000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff4f17cb000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff4f15ad000)
	libumem.so.0 => /usr/local/lib/libumem.so.0 (0x00007ff4f130d000)
	/lib64/ld-linux-x86-64.so.2 (0x000055621bad8000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff4f1109000)
$ ./hello
Hello World
$ rm hello
$ gcc hello.c -o hello
$ ldd hello
	linux-vdso.so.1 =>  (0x00007ffec14e6000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f021517f000)
	/lib64/ld-linux-x86-64.so.2 (0x000055ccfbc26000)
$ LD_PRELOAD="${PWD}/.libs/libumem_malloc.so" ./hello
Hello World
$ LD_PRELOAD="${PWD}/.libs/libumem_malloc.so" printenv LD_PRELOAD
/home/ubuntu/ws/libumem/.libs/libumem_malloc.so

So, somewhere GCC is optimizing (reordering, eliminating or whatever) some critical section of the code. Feel free to test Haywire with umem built without optimizations while I continue to look for the root cause.

I've replicated your issue by doing exactly what you did (and what anyone would do, it's in the INSTALL directions after all!) and discovered that changing one step in your process fixes the issue, why I don't know as yet (but I have some guesses I'm looking into). Configuring with the `CFLAGS` set to `-O0` or `-Og` eliminates the issue (I've been testing with `gcc version 6.2.0 20160901 (Ubuntu 6.2.0-3ubuntu11~16.04)`). `$ ./configure CFLAGS="-g -Og -march=native -mtune=native"` A simple "hello, world!" application with a single call to malloc 1KB of memory on entry to main() was used to test this. ``` $ gcc -g -Og hello.c -o hello -L/usr/local/lib -lumem_malloc $ ldd hello linux-vdso.so.1 => (0x00007ffd86395000) libumem_malloc.so.0 => /usr/local/lib/libumem_malloc.so.0 (0x00007ff4f1b94000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff4f17cb000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff4f15ad000) libumem.so.0 => /usr/local/lib/libumem.so.0 (0x00007ff4f130d000) /lib64/ld-linux-x86-64.so.2 (0x000055621bad8000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff4f1109000) $ ./hello Hello World $ rm hello $ gcc hello.c -o hello $ ldd hello linux-vdso.so.1 => (0x00007ffec14e6000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f021517f000) /lib64/ld-linux-x86-64.so.2 (0x000055ccfbc26000) $ LD_PRELOAD="${PWD}/.libs/libumem_malloc.so" ./hello Hello World $ LD_PRELOAD="${PWD}/.libs/libumem_malloc.so" printenv LD_PRELOAD /home/ubuntu/ws/libumem/.libs/libumem_malloc.so ``` So, somewhere GCC is optimizing (reordering, eliminating or whatever) some critical section of the code. Feel free to test Haywire with umem built without optimizations while I continue to look for the root cause.
kellabyte commented 2016-11-30 07:19:13 +00:00 (Migrated from github.com)

Yup I think that's working now. Performance is terrible but we did turn off optimizations :)

Yup I think that's working now. Performance is terrible but we did turn off optimizations :)
gburd commented 2016-11-30 14:57:13 +00:00 (Migrated from github.com)

Any profiling data that pointed at hotspots in umem would be helpful to me, if you have the time to collect the data.

Any profiling data that pointed at hotspots in umem would be helpful to me, if you have the time to collect the data.
rayrapetyan commented 2018-04-18 18:27:36 +00:00 (Migrated from github.com)

-O0 trick doesn't work with clang compiler :(
What I discovered is: when app starts, there is a call to calloc (from a thread creation, before main). Then, calloc calls umem_alloc, which in turn calls umem_cache_alloc, which creates a thread, which calls calloc... Indefinite loop and app crashes with a stack overflow...
So we need to link our app with pthread compiled against non-umem calloc somehow. Any thoughts?

-O0 trick doesn't work with clang compiler :( What I discovered is: when app starts, there is a call to calloc (from a thread creation, before main). Then, calloc calls umem_alloc, which in turn calls umem_cache_alloc, which creates a thread, which calls calloc... Indefinite loop and app crashes with a stack overflow... So we need to link our app with pthread compiled against non-umem calloc somehow. Any thoughts?
jepio commented 2018-06-08 21:07:04 +00:00 (Migrated from github.com)

I stumbled across the issue with getting stuck in calloc. A recent gcc converts the call to malloc+memset into a call to calloc, which makes calloc recurse into itself. I renamed malloc into malloc_rpl and then called that in calloc and added added a malloc that forwards to malloc_rpl and that seems to work around the issue.

I stumbled across the issue with getting stuck in calloc. A recent gcc converts the call to malloc+memset into a call to calloc, which makes calloc recurse into itself. I renamed `malloc` into `malloc_rpl` and then called that in `calloc` and added added a `malloc` that forwards to `malloc_rpl` and that seems to work around the issue.
earthelf commented 2020-06-22 12:27:18 +00:00 (Migrated from github.com)

just try -fno-builtin-calloc to prevent gcc optimize malloc+memset as calloc

just try `-fno-builtin-calloc` to prevent gcc optimize malloc+memset as calloc
gburd commented 2020-06-24 13:46:17 +00:00 (Migrated from github.com)

fixed e1eb2c8413

fixed e1eb2c84132e3c623d492195e66f3d0facbec6d5
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: greg/libumem#9
No description provided.