There's a patch to make the current conservative GC a precise GC, which will tot...

pcwalton · on July 4, 2012

That's awesome, kudos. How do you deal with maintaining the distinction between pointers and values on the stack and in registers? I didn't see any code to do that, skimming through that patch. Since it's a precise GC, you need to maintain the distinction.

In Rust we have a very large patchset to LLVM to allow us to do this; it involves modifying nearly every stage of the LLVM pipeline. Curious how you did that in gccgo.

Edit: I just looked at the source and it still seems conservative for the stack and registers. So it's not a precise GC yet and will not eliminate false positives. (This is of course totally fine, I just wanted to clarify for myself.)

enneff · on July 4, 2012

This isn't my patch, so I can't speak to the specifics of it. Also, it isn't for gccgo, but the gc compiler.

atombender · on July 4, 2012

That's too bad. Hopefully someone will improve `gccgo` as well. I actually started out experimenting with Go using the `gc` compiler, and found performance (basically manipulating large arrays sequentially) atrocious. Moving to `gccgo` improved performance four-fold, making my unoptimized Go test code in some cases faster than my unoptimized C test code.

luriel · on July 5, 2012

How long ago was that? The gc compilers have seen lots of improvement recently, and more will be merged soon.

Also, if the problem is so pathological, it might be worth filling an issue.

atombender · on July 5, 2012

Go 1.0.2, which is the newest stable release. I will see if I have some extra time to file a report.

Slightly off topic (but relevant to my comment above), do you know why this:

    func TraverseInline() {
      var ints := make([]int, N)
      for i := range ints {
        sum += i
      }
    }

should be slower than:

    func TraverseArray(ints []int) int {
      sum := 0
      for i := range ints {
        sum += i
      }
      return sum
    }

    func TraverseWithFunc() {
      var ints := make([]int, N)
      sum := TraverseArray(ints)
    }

Seeing a huge difference here. With an array of 300000000 ints, TraverseInline() takes about 750ms, whereas TraverseWithFunc() takes 394ms. With gccgo (GCC 4.7.1), the different is much slighter, but there's still a difference: 196ms compared to 172ms. (These are best-case figures, I'm doing lots of loops.) Go is compiled, not JIT, so I don't see why a function should offer better performance.

genbattle · on July 5, 2012

I don't have any immediate insight for the difference in performance you're seeing (my hunch is that it's to do with where and how memory is being allocated), but I would suggest you ask about it on the Go-Nuts mailing list: https://groups.google.com/forum/?fromgroups#!forum/golang-nu...

The Go authors should be very receptive and informative on such an issue.

dchest · on July 5, 2012

Unrelated to your question, but your code doesn't traverse array. All it does is:

      for i := 0; i < len(ints); i++ {
        sum += i
      }

atombender · on July 5, 2012

Sorry, that was a typo. The actual test program I use is correct. It was obviously supposed to be:

    sum += ints[i]

Djehngo · on July 9, 2012

Not relevant to the performance test but you can do this

for index, value := range mySlice { }

and you get the index and the value, or you can do

for _, value := range mySlice { }

If you only need the value.

agentS · on July 5, 2012

Not sure why the difference in performance between function and non-function, particularly considering this looks inline-able.

2 theories on why gcc is doing better. Maybe its unrolling the loop, and avoids a bunch of jumps. Alternatively, it could be using SSE, but this theory is a much less likely, because I would expect it to be 4 times faster not twice as fast. gc doesn't use SSE instructions yet.

dmit · on July 5, 2012

I'm not seeing any difference between the two functions using the 1.0.2 x64 gc compiler. Here's the code I used (slightly modified to compile): http://play.golang.org/p/gcf0BOGncQ

For more complex functions, you could try profiling using the runtime/pprof package. More info here: http://blog.golang.org/2011/06/profiling-go-programs.html

If you want to dig deeper and are not averse to assembly, you can pass -gcflags -S to 'go build' and see the compiler output. Here's what I got for the above code:

  --- prog list "TraverseInline" ---
  0000 (/tmp/tmp.rkM3kSJ6Hd/trav.go:10) TEXT    TraverseInline+0(SB),$40-8
  0001 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVQ    $type.[]int+0(SB),(SP)
  0002 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVQ    $300000000,8(SP)
  0003 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVQ    $300000000,16(SP)
  0004 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) CALL    ,runtime.makeslice+0(SB)
  0005 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVQ    24(SP),BX
  0006 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVL    32(SP),SI
  0007 (/tmp/tmp.rkM3kSJ6Hd/trav.go:11) MOVL    36(SP),BX
  0008 (/tmp/tmp.rkM3kSJ6Hd/trav.go:12) MOVL    $0,DX
  0009 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) MOVL    $0,AX
  0010 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) JMP     ,12
  0011 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) INCL    ,AX
  0012 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) CMPL    AX,SI
  0013 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) JGE     ,16
  0014 (/tmp/tmp.rkM3kSJ6Hd/trav.go:14) ADDL    AX,DX
  0015 (/tmp/tmp.rkM3kSJ6Hd/trav.go:13) JMP     ,11
  0016 (/tmp/tmp.rkM3kSJ6Hd/trav.go:16) MOVL    DX,.noname+0(FP)
  0017 (/tmp/tmp.rkM3kSJ6Hd/trav.go:16) RET     ,

  --- prog list "TraverseArray" ---
  0018 (/tmp/tmp.rkM3kSJ6Hd/trav.go:19) TEXT    TraverseArray+0(SB),$0-24
  0019 (/tmp/tmp.rkM3kSJ6Hd/trav.go:20) MOVL    $0,DX
  0020 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) MOVL    $0,AX
  0021 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) MOVL    ints+8(FP),SI
  0022 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) JMP     ,24
  0023 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) INCL    ,AX
  0024 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) CMPL    AX,SI
  0025 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) JGE     ,28
  0026 (/tmp/tmp.rkM3kSJ6Hd/trav.go:22) ADDL    AX,DX
  0027 (/tmp/tmp.rkM3kSJ6Hd/trav.go:21) JMP     ,23
  0028 (/tmp/tmp.rkM3kSJ6Hd/trav.go:24) MOVL    DX,.noname+16(FP)
  0029 (/tmp/tmp.rkM3kSJ6Hd/trav.go:24) RET     ,

  --- prog list "TraverseWithFunc" ---
  0030 (/tmp/tmp.rkM3kSJ6Hd/trav.go:27) TEXT    TraverseWithFunc+0(SB),$40-8
  0031 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVQ    $type.[]int+0(SB),(SP)
  0032 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVQ    $300000000,8(SP)
  0033 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVQ    $300000000,16(SP)
  0034 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) CALL    ,runtime.makeslice+0(SB)
  0035 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVQ    24(SP),DX
  0036 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVL    32(SP),CX
  0037 (/tmp/tmp.rkM3kSJ6Hd/trav.go:28) MOVL    36(SP),AX
  0038 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) LEAQ    (SP),BX
  0039 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) MOVQ    DX,(BX)
  0040 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) MOVL    CX,8(BX)
  0041 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) MOVL    AX,12(BX)
  0042 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) CALL    ,TraverseArray+0(SB)
  0043 (/tmp/tmp.rkM3kSJ6Hd/trav.go:29) MOVL    16(SP),AX
  0044 (/tmp/tmp.rkM3kSJ6Hd/trav.go:30) MOVL    AX,.noname+0(FP)
  0045 (/tmp/tmp.rkM3kSJ6Hd/trav.go:30) RET     ,

atombender · on July 5, 2012

Here is my test, with test output for gc and gccgo:

https://gist.github.com/3053930

The numbers are from Go 1.0 as that's what is available on Ubuntu Precise, my test box. (Getting similar numbers with 1.0.2 on a different box where I don't have gccgo.)

I realized that the first loop was using "i < len(ints)" as a loop test, which turns out to be more expensive than I thought, and the compiler doesn't optimize it into a constant (which would be expecting too much, I guess). After rewriting the test, the function call case is only slightly faster, although it is still significantly faster with gccgo.

sausagefeet · on July 5, 2012

Is this patch going to be part of LLVM? Is one out of luck if they want to implement a precise collector in LLVM today?

pcwalton · on July 5, 2012

I've talked with Chris Lattner about upstreaming it and the LLVM developers seem open to it when it's ready. In the meantime you can find the most recent work here:

https://github.com/elliottslaughter/llvm/tree/noteroots-ir

Currently, you are mostly out of luck if you try to use mainline, unless you're okay with the llvm.gcroot intrinsics, which pin things to the stack so they can never live in registers (causing a corresponding performance loss). Probably your best bet is to do what Go will do once the patch enneff was referring to is merged: scan conservatively on the stack and precisely on the heap. You can still get false positives and leaks, and you can't do moving GC that way, but it's better than fully conservative GC.

By far, the hardest part of GC is precisely marking the stack and registers. Yet you have to be able to do it if you want to prevent leaks. Our goal is to put in the hard work so that new languages will be able to have proper GC without having to roll a custom code generator or target the JVM or CLR.

sausagefeet · on July 5, 2012

> Our goal is to put in the hard work so that new languages will be able to have proper GC without having to roll a custom code generator or target the JVM or CLR.

As someone who has ambitions of writing a fun little language this is exactly what I want out of LLVM. Can't wait for your patches to hopefully get to mainline.

thebluesky · on July 5, 2012

32bit memleak is by far the main reason I stopped looking at go. Many people still have plenty of 32 bit hosts.

When will this be in the official release??

enneff · on July 5, 2012

It'll be in Go 1.1, around the end of the year.