Go -调度器
JulianneRit
6年前
<p>写得稍微有点乱,主要是按自己看代码的顺序来记录的,也不是出书,就这样吧。</p> <p>PS: 新人不推荐刚学 Golang 就去看调度器代码,这部分代码个人觉得写得很乱。</p> <h2>调度</h2> <h2>基本数据结构</h2> <p>goroutine 在 runtime 中的数据结构:</p> <pre> <code class="language-go">// stack 描述的是 Go 的执行栈,下界和上界分别为 [lo, hi] // 如果从传统内存布局的角度来讲,Go 的栈实际上是分配在 C 语言中的堆区的 // 所以才能比 ulimit -s 的 stack size 还要大(1GB) type stack struct { lo uintptr hi uintptr } // g 的运行现场 type gobuf struct { sp uintptr // sp 寄存器 pc uintptr // pc 寄存器 g guintptr // g 指针 ctxt unsafe.Pointer // 这个似乎是用来辅助 gc 的 ret sys.Uintreg lr uintptr // 这是在 arm 上用的寄存器,不用关心 bp uintptr // 开启 GOEXPERIMENT=framepointer,才会有这个 } type g struct { // 简单数据结构,lo 和 hi 成员描述了栈的下界和上界内存地址 stack stack // 在函数的栈增长 prologue 中用 sp 寄存器和 stackguard0 来做比较 // 如果 sp 比 stackguard0 小(因为栈向低地址方向增长),那么就触发栈拷贝和调度 // 正常情况下 stackguard0 = stack.lo + StackGuard // 不过 stackguard0 在需要进行调度时,会被修改为 StackPreempt // 以触发抢占s stackguard0 uintptr // stackguard1 是在 C 栈增长 prologue 作对比的对象 // 在 g0 和 gsignal 栈上,其值为 stack.lo+StackGuard // 在其它的栈上这个值是 ~0(按 0 取反)以触发 morestack 调用(并 crash) stackguard1 uintptr _panic *_panic _defer *_defer m *m // 当前与 g 绑定的 m sched gobuf // goroutine 的现场 syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc stktopsp uintptr // expected sp at top of stack, to check in traceback param unsafe.Pointer // wakeup 时的传入参数 atomicstatus uint32 stackLock uint32 // sigprof/scang lock; TODO: fold in to atomicstatus goid int64 // goroutine id waitsince int64 // g 被阻塞之后的近似时间 waitreason string // if status==Gwaiting schedlink guintptr preempt bool // 抢占标记,这个为 true 时,stackguard0 是等于 stackpreempt 的 throwsplit bool // must not split stack raceignore int8 // ignore race detection events sysblocktraced bool // StartTrace has emitted EvGoInSyscall about this goroutine sysexitticks int64 // syscall 返回之后的 cputicks,用来做 tracing traceseq uint64 // trace event sequencer tracelastp puintptr // last P emitted an event for this goroutine lockedm muintptr // 如果调用了 LockOsThread,那么这个 g 会绑定到某个 m 上 sig uint32 writebuf []byte sigcode0 uintptr sigcode1 uintptr sigpc uintptr gopc uintptr // 创建该 goroutine 的语句的指令地址 startpc uintptr // goroutine 函数的指令地址 racectx uintptr waiting *sudog // sudog structures this g is waiting on (that have a valid elem ptr); in lock order cgoCtxt []uintptr // cgo traceback context labels unsafe.Pointer // profiler labels timer *timer // time.Sleep 缓存的定时器 selectDone uint32 // 该 g 是否正在参与 select,是否已经有人从 select 中胜出 }</code></pre> <p>当 g 遇到阻塞,或需要等待的场景时,会被打包成 sudog 这样一个结构。一个 g 可能被打包为多个 sudog 分别挂在不同的等待队列上:</p> <pre> <code class="language-go">// sudog 代表在等待列表里的 g,比如向 channel 发送/接收内容时 // 之所以需要 sudog 是因为 g 和同步对象之间的关系是多对多的 // 一个 g 可能会在多个等待队列中,所以一个 g 可能被打包为多个 sudog // 多个 g 也可以等待在同一个同步对象上 // 因此对于一个同步对象就会有很多 sudog 了 // sudog 是从一个特殊的池中进行分配的。用 acquireSudog 和 releaseSudog 来分配和释放 sudog type sudog struct { // 之后的这些字段都是被该 g 所挂在的 channel 中的 hchan.lock 来保护的 // shrinkstack depends on // this for sudogs involved in channel ops. g *g // isSelect 表示一个 g 是否正在参与 select 操作 // 所以 g.selectDone 必须用 CAS 来操作,以胜出唤醒的竞争 isSelect bool next *sudog prev *sudog elem unsafe.Pointer // data element (may point to stack) // 下面这些字段则永远都不会被并发访问 // 对于 channel 来说,waitlink 只会被 g 访问 // 对于信号量来说,所有的字段,包括上面的那些字段都只在持有 semaRoot 锁时才可以访问 acquiretime int64 releasetime int64 ticket uint32 parent *sudog // semaRoot binary tree waitlink *sudog // g.waiting list or semaRoot waittail *sudog // semaRoot c *hchan // channel }</code></pre> <p>线程在 runtime 中的结构,对应一个 pthread,pthread 也会对应唯一的内核线程(task_struct):</p> <pre> <code class="language-go">type m struct { g0 *g // 用来执行调度指令的 goroutine morebuf gobuf // gobuf arg to morestack divmod uint32 // div/mod denominator for arm - known to liblink // Fields not known to debuggers. procid uint64 // for debuggers, but offset not hard-coded gsignal *g // signal-handling g goSigStack gsignalStack // Go-allocated signal handling stack sigmask sigset // storage for saved signal mask tls [6]uintptr // thread-local storage (for x86 extern register) mstartfn func() curg *g // 当前运行的用户 goroutine caughtsig guintptr // goroutine running during fatal signal p puintptr // attached p for executing go code (nil if not executing go code) nextp puintptr id int64 mallocing int32 throwing int32 preemptoff string // 该字段不等于空字符串的话,要保持 curg 始终在这个 m 上运行 locks int32 softfloat int32 dying int32 profilehz int32 helpgc int32 spinning bool // m 失业了,正在积极寻找工作~ blocked bool // m 正阻塞在 note 上 inwb bool // m 正在执行 write barrier newSigstack bool // minit on C thread called sigaltstack printlock int8 incgo bool // m 正在执行 cgo call freeWait uint32 // if == 0, safe to free g0 and delete m (atomic) fastrand [2]uint32 needextram bool traceback uint8 ncgocall uint64 // cgo 调用总计数 ncgo int32 // 当前正在执行的 cgo 订单计数 cgoCallersUse uint32 // if non-zero, cgoCallers in use temporarily cgoCallers *cgoCallers // cgo traceback if crashing in cgo call park note alllink *m // on allm schedlink muintptr mcache *mcache lockedg guintptr createstack [32]uintptr // stack that created this thread. freglo [16]uint32 // d[i] lsb and f[i] freghi [16]uint32 // d[i] msb and f[i+16] fflag uint32 // floating point compare flags lockedExt uint32 // tracking for external LockOSThread lockedInt uint32 // tracking for internal lockOSThread nextwaitm muintptr // 正在等待锁的下一个 m waitunlockf unsafe.Pointer // todo go func(*g, unsafe.pointer) bool waitlock unsafe.Pointer waittraceev byte waittraceskip int startingtrace bool syscalltick uint32 thread uintptr // thread handle freelink *m // on sched.freem // these are here because they are too large to be on the stack // of low-level NOSPLIT functions. libcall libcall libcallpc uintptr // for cpu profiler libcallsp uintptr libcallg guintptr syscall libcall // 存储 windows 平台的 syscall 参数 mOS }</code></pre> <p>抽象数据结构,可以认为是 processor 的抽象,代表了任务执行时的上下文,m 必须获得 p 才能执行:</p> <pre> <code class="language-go">type p struct { lock mutex id int32 status uint32 // one of pidle/prunning/... link puintptr schedtick uint32 // 每次调用 schedule 时会加一 syscalltick uint32 // 每次系统调用时加一 sysmontick sysmontick // 上次 sysmon 观察到的 tick 时间 m muintptr // 和相关联的 m 的反向指针,如果 p 是 idle 的话,那这个指针是 nil mcache *mcache racectx uintptr deferpool [5][]*_defer // pool of available defer structs of different sizes (see panic.go) deferpoolbuf [5][32]*_defer // Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen. goidcache uint64 goidcacheend uint64 // runnable 状态的 goroutine。访问时是不加锁的 runqhead uint32 runqtail uint32 runq [256]guintptr // runnext 非空时,代表的是一个 runnable 状态的 G, // 这个 G 是被 当前 G 修改为 ready 状态的, // 并且相比在 runq 中的 G 有更高的优先级 // 如果当前 G 的还有剩余的可用时间,那么就应该运行这个 G // 运行之后,该 G 会继承当前 G 的剩余时间 // If a set of goroutines is locked in a // communicate-and-wait pattern, this schedules that set as a // unit and eliminates the (potentially large) scheduling // latency that otherwise arises from adding the ready'd // goroutines to the end of the run queue. runnext guintptr // Available G's (status == Gdead) gfree *g gfreecnt int32 sudogcache []*sudog sudogbuf [128]*sudog tracebuf traceBufPtr // traceSweep indicates the sweep events should be traced. // This is used to defer the sweep start event until a span // has actually been swept. traceSweep bool // traceSwept and traceReclaimed track the number of bytes // swept and reclaimed by sweeping in the current sweep loop. traceSwept, traceReclaimed uintptr palloc persistentAlloc // per-P to avoid mutex // Per-P GC state gcAssistTime int64 // Nanoseconds in assistAlloc gcFractionalMarkTime int64 // Nanoseconds in fractional mark worker gcBgMarkWorker guintptr gcMarkWorkerMode gcMarkWorkerMode // 当前标记 worker 的开始时间,单位纳秒 gcMarkWorkerStartTime int64 // gcw is this P's GC work buffer cache. The work buffer is // filled by write barriers, drained by mutator assists, and // disposed on certain GC state transitions. gcw gcWork // wbBuf is this P's GC write barrier buffer. // // TODO: Consider caching this in the running G. wbBuf wbBuf runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point pad [sys.CacheLineSize]byte }</code></pre> <p>全局调度器,全局只有一个 schedt 类型的实例:</p> <pre> <code class="language-go">type schedt struct { // 下面两个变量需以原子访问访问。保持在 struct 顶部,以使其在 32 位系统上可以对齐 goidgen uint64 lastpoll uint64 lock mutex // 当修改 nmidle,nmidlelocked,nmsys,nmfreed 这些数值时 // 需要记得调用 checkdead midle muintptr // idle m's waiting for work nmidle int32 // 当前等待工作的空闲 m 计数 nmidlelocked int32 // 当前等待工作的被 lock 的 m 计数 mnext int64 // 当前预缴创建的 m 数,并且该值会作为下一个创建的 m 的 ID maxmcount int32 // 允许创建的最大的 m 数量 nmsys int32 // number of system m's not counted for deadlock nmfreed int64 // cumulative number of freed m's ngsys uint32 // number of system goroutines; updated atomically pidle puintptr // 空闲 p's npidle uint32 nmspinning uint32 // See "Worker thread parking/unparking" comment in proc.go. // 全局的可运行 g 队列 runqhead guintptr runqtail guintptr runqsize int32 // dead G 的全局缓存 gflock mutex gfreeStack *g gfreeNoStack *g ngfree int32 // sudog 结构的集中缓存 sudoglock mutex sudogcache *sudog // 不同大小的可用的 defer struct 的集中缓存池 deferlock mutex deferpool [5]*_defer // 被设置了 m.exited 标记之后的 m,这些 m 正在 freem 这个链表上等待被 free // 链表用 m.freelink 字段进行链接 freem *m gcwaiting uint32 // gc is waiting to run stopwait int32 stopnote note sysmonwait uint32 sysmonnote note // safepointFn should be called on each P at the next GC // safepoint if p.runSafePointFn is set. safePointFn func(*p) safePointWait int32 safePointNote note profilehz int32 // cpu profiling rate procresizetime int64 // 上次修改 gomaxprocs 的纳秒时间 totaltime int64 // ∫gomaxprocs dt up to procresizetime }</code></pre> <h2>g/p/m 的关系</h2> <p>Go 实现了所谓的 M:N 模型,执行用户代码的 goroutine 可以认为都是对等的 goroutine。不考虑 g0 和 gsignal 的话,我们可以简单地认为调度就是将 m 绑定到 p,然后在 m 中不断循环执行调度函数(runtime.schedule),寻找可用的 g 来执行,下图为 m 绑定到 p 时,可能得到的 g 的来源:</p> <pre> <code class="language-go">+--------------+ | binded +-------------------------------------+ +-------+------+ | +------------------------------------+ | v +------------------------------------+ | | | +------------------------------------+ | | | +------------------+ | | | | | +------------------+ | | | Local Run Queue | | | | +------------------+ | | | Global Run Queue | | | other P +-+-+-+-+-+-+-+-+--+ | | | | Local Run Queue | | | schedt +--+-+-+-+-+-+-+---+ | | |G|G|G|G|G|G|G| | | | P +-+-+-+-+-+-+-+-+--+ | | |G|G|G|G|G|G| | | +-+-+-+-+-+-+-+ | | | |G|G|G|G|G|G|G| | | +-+-+-+-+-+-+ | | ^ | | | +-+-+-+-+-+-+-+ | | ^ | +----------------+-------------------+ | | ^ | +----------------+-------------------+ | | +----------------+-------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | v | | +------+-------+ .-. +----------------+ | | | steal +----------------------------( M )-----+ runqget +-----------------+ | +--------------+ `-' +----------------+ | | | | +-----------+-----+ +---------------------------------------------------------------------------+ globrunqget | | +-----------------+ | | | | | | +----------+--------+ | get netpoll g | +----------+--------+ | | | | | +--------------+--------------------+ | | | | | | | netpoll v | | +-+-+-+-+ | | |G|G|G|G| | | +-+-+-+-+ | | | +-----------------------------------+</code></pre> <p>这张图展示了 g、p、m 三者之间的大致关系。m 是执行实体,对应的是操作系统线程。可以看到 m 会从绑定的 p 的本地队列、sched 中的全局队列、netpoll 中获取可运行的 g,实在找不着还会去其它的 p 那里去偷。</p> <h2>p 如何初始化</h2> <p>程序启动时,会依次调用:</p> <pre> <code class="language-go">graph TD runtime.schedinit --> runtime.procresize</code></pre> <p>在 procresize 中会将全局 p 数组初始化,并将这些 p 串成链表放进 sched 全局调度器的 pidle 队列中:</p> <pre> <code class="language-go">for i := nprocs - 1; i >= 0; i-- { p := allp[i] // ... // 设置 p 的状态 p.status = _Pidle // 初始化时,所有 p 的 runq 都是空的,所以一定会走这个 if if runqempty(p) { // 将 p 放到全局调度器的 pidle 队列中 pidleput(p) } else { // ... } }</code></pre> <p>pidleput 也比较简单,没啥可说的:</p> <pre> <code class="language-go">func pidleput(_p_ *p) { if !runqempty(_p_) { throw("pidleput: P has non-empty run queue") } // 简单的链表操作 _p_.link = sched.pidle sched.pidle.set(_p_) // pidle count + 1 atomic.Xadd(&sched.npidle, 1) }</code></pre> <p>所有 p 在程序启动的时候就已经被初始化完毕了,除非手动调用 runtime.GOMAXPROCS。</p> <pre> <code class="language-go">func GOMAXPROCS(n int) int { lock(&sched.lock) ret := int(gomaxprocs) unlock(&sched.lock) if n <= 0 || n == ret { return ret } stopTheWorld("GOMAXPROCS") // newprocs will be processed by startTheWorld newprocs = int32(n) startTheWorld() return ret }</code></pre> <p>在 startTheWorld 中会调用 procresize。</p> <h2>g 如何创建</h2> <p>在用户代码里一般这么写:</p> <pre> <code class="language-go">go func() { // do the stuff }()</code></pre> <p>实际上会被翻译成 runtime.newproc ,特权语法只是个语法糖。如果你要在其它语言里实现类似的东西,只要实现编译器翻译之后的内容就好了。具体流程:</p> <pre> <code class="language-go">graph TD runtime.newproc --> runtime.newproc1</code></pre> <p>newproc 干的事情也比较简单</p> <pre> <code class="language-go">func newproc(siz int32, fn *funcval) { // add 是一个指针运算,跳过函数指针 // 把栈上的参数起始地址找到 argp := add(unsafe.Pointer(&fn), sys.PtrSize) pc := getcallerpc() systemstack(func() { newproc1(fn, (*uint8)(argp), siz, pc) }) } // funcval 是一个变长结构,第一个成员是函数指针 // 所以上面的 add 是跳过这个 fn type funcval struct { fn uintptr // variable-size, fn-specific data here }</code></pre> <p>runtime 里比较常见的 getcallerpc 和 getcallersp,代码里的注释写的比较明白了:</p> <pre> <code class="language-go">// For example: // // func f(arg1, arg2, arg3 int) { // pc := getcallerpc() // sp := getcallersp(unsafe.Pointer(&arg1)) //} // // These two lines find the PC and SP immediately following // the call to f (where f will return). //</code></pre> <p>getcallerpc 返回的是调用函数之后的那条程序指令的地址,即 callee 函数返回时要执行的下一条指令的地址。</p> <p>systemstack 在 runtime 中用的也比较多,其功能为让 m 切换到 g0 上执行各种调度函数。至于啥是 g0,在讲 m 的时候再说。</p> <p>newproc1 的工作流程也比较简单:</p> <pre> <code class="language-go">graph TD newproc1 --> newg newg[gfget] --> nil{is nil?} nil -->|yes|E[init stack] nil -->|no|C[malg] C --> D[set g status=> idle->dead] D --> allgadd E --> G[set g status=> dead-> runnable] allgadd --> G G --> runqput</code></pre> <p>删掉了不关心的细节后的代码:</p> <pre> <code class="language-go">func newproc1(fn *funcval, argp *uint8, narg int32, callerpc uintptr) { _g_ := getg() if fn == nil { _g_.m.throwing = -1 // do not dump full stacks throw("go of nil func value") } _g_.m.locks++ // disable preemption because it can be holding p in a local var siz := narg siz = (siz + 7) &^ 7 _p_ := _g_.m.p.ptr() newg := gfget(_p_) if newg == nil { newg = malg(_StackMin) casgstatus(newg, _Gidle, _Gdead) allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack. } totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize // extra space in case of reads slightly beyond frame totalSize += -totalSize & (sys.SpAlign - 1) // align to spAlign sp := newg.stack.hi - totalSize spArg := sp // 初始化 g,g 的 gobuf 现场,g 的 m 的 curg // 以及各种寄存器 memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched)) newg.sched.sp = sp newg.stktopsp = sp newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function newg.sched.g = guintptr(unsafe.Pointer(newg)) gostartcallfn(&newg.sched, fn) newg.gopc = callerpc newg.startpc = fn.fn if _g_.m.curg != nil { newg.labels = _g_.m.curg.labels } casgstatus(newg, _Gdead, _Grunnable) newg.goid = int64(_p_.goidcache) _p_.goidcache++ runqput(_p_, newg, true) if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 && mainStarted { wakep() } _g_.m.locks-- if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack _g_.stackguard0 = stackPreempt } }</code></pre> <p>所以 go func 执行的结果是调用 runqput 将 g 放进了执行队列。但在放队列之前还做了点小动作:</p> <pre> <code class="language-go">newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function</code></pre> <h3>gostartcallfn</h3> <pre> <code class="language-go">// adjust Gobuf as if it executed a call to fn // and then did an immediate gosave. func gostartcallfn(gobuf *gobuf, fv *funcval) { var fn unsafe.Pointer if fv != nil { fn = unsafe.Pointer(fv.fn) } else { fn = unsafe.Pointer(funcPC(nilfunc)) } gostartcall(gobuf, fn, unsafe.Pointer(fv)) } // adjust Gobuf as if it executed a call to fn with context ctxt // and then did an immediate gosave. func gostartcall(buf *gobuf, fn, ctxt unsafe.Pointer) { sp := buf.sp if sys.RegSize > sys.PtrSize { sp -= sys.PtrSize *(*uintptr)(unsafe.Pointer(sp)) = 0 } sp -= sys.PtrSize *(*uintptr)(unsafe.Pointer(sp)) = buf.pc // 注意这里,这个,这里的 buf.pc 实际上是 goexit 的 pc buf.sp = sp buf.pc = uintptr(fn) buf.ctxt = ctxt }</code></pre> <p>在 gostartcall 中把 newproc1 时设置到 buf.pc 中的 goexit 的函数地址放到了 goroutine 的栈顶,然后重新设置 buf.pc 为 goroutine 函数的位置。这样做的目的是为了在执行完任何 goroutine 的函数时,通过 RET 指令,都能从栈顶把 sp 保存的 goexit 的指令 pop 到 pc 寄存器,效果相当于任何 goroutine 执行函数执行完之后,都会去执行 runtime.goexit,完成一些清理工作后再进入 schedule。</p> <p>在之后的 m 的 schedule 讲解中会看到更详细的调度循环过程。</p> <h3>runqput</h3> <p>因为是放 runq 而不是直接执行,因而什么时候开始执行并不是用户代码能决定得了的。再看看 runqput 这个函数:</p> <pre> <code class="language-go">// runqput 尝试把 g 放到本地执行队列中 // next 参数如果是 false 的话,runqput 会将 g 放到运行队列的尾部 // If next if false, runqput adds g to the tail of the runnable queue. // If next is true, runqput puts g in the _p_.runnext slot. // If the run queue is full, runnext puts g on the global queue. // Executed only by the owner P. func runqput(_p_ *p, gp *g, next bool) { if randomizeScheduler && next && fastrand()%2 == 0 { next = false } if next { retryNext: oldnext := _p_.runnext if !_p_.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) { goto retryNext } if oldnext == 0 { return } // 把之前的 runnext 踢到正常的 runq 中 gp = oldnext.ptr() } retry: h := atomic.Load(&_p_.runqhead) // load-acquire, synchronize with consumers t := _p_.runqtail if t-h < uint32(len(_p_.runq)) { _p_.runq[t%uint32(len(_p_.runq))].set(gp) atomic.Store(&_p_.runqtail, t+1) // store-release, makes the item available for consumption return } if runqputslow(_p_, gp, h, t) { return } // 队列没有满的话,上面的 put 操作会成功 goto retry }</code></pre> <h3>runqputslow</h3> <pre> <code class="language-go">// 因为 slow,所以会一次性把本地队列里的多个 g (包含当前的这个) 放到全局队列 // 只会被 g 的 owner P 执行 func runqputslow(_p_ *p, gp *g, h, t uint32) bool { var batch [len(_p_.runq)/2 + 1]*g // 先从本地队列抓一批 g n := t - h n = n / 2 if n != uint32(len(_p_.runq)/2) { throw("runqputslow: queue is not full") } for i := uint32(0); i < n; i++ { batch[i] = _p_.runq[(h+i)%uint32(len(_p_.runq))].ptr() } if !atomic.Cas(&_p_.runqhead, h, h+n) { // cas-release, commits consume return false } batch[n] = gp if randomizeScheduler { for i := uint32(1); i <= n; i++ { j := fastrandn(i + 1) batch[i], batch[j] = batch[j], batch[i] } } // 把这些 goroutine 构造成链表 for i := uint32(0); i < n; i++ { batch[i].schedlink.set(batch[i+1]) } // 将链表放到全局队列中 lock(&sched.lock) globrunqputbatch(batch[0], batch[n], int32(n+1)) unlock(&sched.lock) return true }</code></pre> <p>操作全局 sched 时,需要获取全局 sched.lock 锁,全局锁争抢的开销较大,所以才称之为 slow。p 和 g 在 m 中交互时,因为现场永远是单线程,所以很多时候不用加锁。</p> <h2>m 工作机制</h2> <p>在 runtime 中有三种线程,一种是主线程,一种是用来跑 sysmon 的线程,一种是普通的用户线程。主线程在 runtime 由对应的全局变量: runtime.m0 来表示。用户线程就是普通的线程了,和 p 绑定,执行 g 中的任务。虽然说是有三种,实际上前两种线程整个 runtime 就只有一个实例。用户线程才会有很多实例。</p> <h3>主线程 m0</h3> <p>主线程中用来跑 runtime.main ,流程线性执行,没有跳转:</p> <pre> <code class="language-go">graph TD runtime.main --> A[init max stack size] A --> B[systemstack execute -> newm -> sysmon] B --> runtime.lockOsThread runtime.lockOsThread --> runtime.init runtime.init --> runtime.gcenable runtime.gcenable --> main.init main.init --> main.main</code></pre> <h3>sysmon 线程</h3> <p>sysmon 是在 runtime.main 中启动的,不过需要注意的是 sysmon 并不是在 m0 上执行的。因为:</p> <pre> <code class="language-go">systemstack(func() { newm(sysmon, nil) })</code></pre> <p>创建了新的 m,但这个 m 又与普通的线程不一样,因为不需要绑定 p 就可以执行。是与整个调度系统脱离的。</p> <p>sysmon 内部是个死循环,主要负责以下几件事情:</p> <ol> <li> <p>checkdead,检查是否所有 goroutine 都已经锁死,如果是的话,直接调用 runtime.throw,强制退出。这个操作只在启动的时候做一次</p> </li> <li> <p>将 netpoll 返回的结果注入到全局 sched 的任务队列</p> </li> <li> <p>收回因为 syscall 而长时间阻塞的 p,同时抢占那些执行时间过长的 g</p> </li> <li> <p>如果 span 内存闲置超过 5min,那么释放掉</p> </li> </ol> <p>流程图:</p> <pre> <code class="language-go">graph TD sysmon --> usleep usleep --> checkdead checkdead --> |every 10ms|C[netpollinited && lastpoll != 0] C --> |yes|netpoll netpoll --> injectglist injectglist --> retake C --> |no|retake retake --> A[check forcegc needed] A --> B[scavenge heap once in a while] B --> usleep</code></pre> <pre> <code class="language-go">// sysmon 不需要绑定 P 就可以运行,所以不允许 write barriers // //go:nowritebarrierrec func sysmon() { lock(&sched.lock) sched.nmsys++ checkdead() unlock(&sched.lock) // 如果一个 heap span 在一次GC 之后 5min 都没有被使用过 // 那么把它交还给操作系统 scavengelimit := int64(5 * 60 * 1e9) if debug.scavenge > 0 { // Scavenge-a-lot for testing. forcegcperiod = 10 * 1e6 scavengelimit = 20 * 1e6 } lastscavenge := nanotime() nscavenge := 0 lasttrace := int64(0) idle := 0 // how many cycles in succession we had not wokeup somebody delay := uint32(0) for { if idle == 0 { // 初始化时 20us sleep delay = 20 } else if idle > 50 { // start doubling the sleep after 1ms... delay *= 2 } if delay > 10*1000 { // 最多到 10ms delay = 10 * 1000 } usleep(delay) if debug.schedtrace <= 0 && (sched.gcwaiting != 0 || atomic.Load(&sched.npidle) == uint32(gomaxprocs)) { lock(&sched.lock) if atomic.Load(&sched.gcwaiting) != 0 || atomic.Load(&sched.npidle) == uint32(gomaxprocs) { atomic.Store(&sched.sysmonwait, 1) unlock(&sched.lock) // Make wake-up period small enough // for the sampling to be correct. maxsleep := forcegcperiod / 2 if scavengelimit < forcegcperiod { maxsleep = scavengelimit / 2 } shouldRelax := true if osRelaxMinNS > 0 { next := timeSleepUntil() now := nanotime() if next-now < osRelaxMinNS { shouldRelax = false } } if shouldRelax { osRelax(true) } notetsleep(&sched.sysmonnote, maxsleep) if shouldRelax { osRelax(false) } lock(&sched.lock) atomic.Store(&sched.sysmonwait, 0) noteclear(&sched.sysmonnote) idle = 0 delay = 20 } unlock(&sched.lock) } // trigger libc interceptors if needed if *cgo_yield != nil { asmcgocall(*cgo_yield, nil) } // 如果 10ms 没有 poll 过 network,那么就 netpoll 一次 lastpoll := int64(atomic.Load64(&sched.lastpoll)) now := nanotime() if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now { atomic.Cas64(&sched.lastpoll, uint64(lastpoll), uint64(now)) gp := netpoll(false) // 非阻塞 -- 返回一个 goroutine 的列表 if gp != nil { // Need to decrement number of idle locked M's // (pretending that one more is running) before injectglist. // Otherwise it can lead to the following situation: // injectglist grabs all P's but before it starts M's to run the P's, // another M returns from syscall, finishes running its G, // observes that there is no work to do and no other running M's // and reports deadlock. incidlelocked(-1) injectglist(gp) incidlelocked(1) } } // 接收在 syscall 状态阻塞的 P // 抢占长时间运行的 G if retake(now) != 0 { idle = 0 } else { idle++ } // 检查是否需要 force GC(两分钟一次的) if t := (gcTrigger{kind: gcTriggerTime, now: now}); t.test() && atomic.Load(&forcegc.idle) != 0 { lock(&forcegc.lock) forcegc.idle = 0 forcegc.g.schedlink = 0 injectglist(forcegc.g) unlock(&forcegc.lock) } // 每过一段时间扫描一次堆 if lastscavenge+scavengelimit/2 < now { mheap_.scavenge(int32(nscavenge), uint64(now), uint64(scavengelimit)) lastscavenge = now nscavenge++ } if debug.schedtrace > 0 && lasttrace+int64(debug.schedtrace)*1000000 <= now { lasttrace = now schedtrace(debug.scheddetail > 0) } } }</code></pre> <p>checkdead</p> <pre> <code class="language-go">// 检查死锁的场景 // 该检查基于当前正在运行的 M 的数量,如果 0,那么就是 deadlock 了 // 检查的时候必须持有 sched.lock 锁 func checkdead() { // 对于 -buildmode=c-shared 或者 -buildmode=c-archive 来说 // 没有 goroutine 正在运行也是 OK 的。因为调用这个库的程序应该是在运行的 if islibrary || isarchive { return } // If we are dying because of a signal caught on an already idle thread, // freezetheworld will cause all running threads to block. // And runtime will essentially enter into deadlock state, // except that there is a thread that will call exit soon. if panicking > 0 { return } run := mcount() - sched.nmidle - sched.nmidlelocked - sched.nmsys if run > 0 { return } if run < 0 { print("runtime: checkdead: nmidle=", sched.nmidle, " nmidlelocked=", sched.nmidlelocked, " mcount=", mcount(), " nmsys=", sched.nmsys, "\n") throw("checkdead: inconsistent counts") } grunning := 0 lock(&allglock) for i := 0; i < len(allgs); i++ { gp := allgs[i] if isSystemGoroutine(gp) { continue } s := readgstatus(gp) switch s &^ _Gscan { case _Gwaiting: grunning++ case _Grunnable, _Grunning, _Gsyscall: unlock(&allglock) print("runtime: checkdead: find g ", gp.goid, " in status ", s, "\n") throw("checkdead: runnable g") } } unlock(&allglock) if grunning == 0 { // possible if main goroutine calls runtime·Goexit() throw("no goroutines (main called runtime.Goexit) - deadlock!") } // Maybe jump time forward for playground. gp := timejump() if gp != nil { casgstatus(gp, _Gwaiting, _Grunnable) globrunqput(gp) _p_ := pidleget() if _p_ == nil { throw("checkdead: no p for timer") } mp := mget() if mp == nil { // There should always be a free M since // nothing is running. throw("checkdead: no m for timer") } mp.nextp.set(_p_) notewakeup(&mp.park) return } getg().m.throwing = -1 // do not dump full stacks throw("all goroutines are asleep - deadlock!") }</code></pre> <p>retake</p> <pre> <code class="language-go">// forcePreemptNS is the time slice given to a G before it is // preempted. const forcePreemptNS = 10 * 1000 * 1000 // 10ms func retake(now int64) uint32 { n := 0 // Prevent allp slice changes. This lock will be completely // uncontended unless we're already stopping the world. lock(&allpLock) // We can't use a range loop over allp because we may // temporarily drop the allpLock. Hence, we need to re-fetch // allp each time around the loop. for i := 0; i < len(allp); i++ { _p_ := allp[i] if _p_ == nil { // 在 procresize 修改了 allp 但还没有创建新的 p 的时候 // 会有这种情况 continue } pd := &_p_.sysmontick s := _p_.status if s == _Psyscall { // 从 syscall 接管 P,如果它进行 syscall 已经经过了一个 sysmon 的 tick(至少 20us) t := int64(_p_.syscalltick) if int64(pd.syscalltick) != t { pd.syscalltick = uint32(t) pd.syscallwhen = now continue } // 一方面如果没有其它工作可做的话,我们不想接管 p // 但另一方面为了避免 sysmon 线程陷入沉睡,我们最终还是会接管这些 p if runqempty(_p_) && atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) > 0 && pd.syscallwhen+10*1000*1000 > now { continue } // 解开 allplock 的锁,然后就可以持有 sched.lock 锁了 unlock(&allpLock) // Need to decrement number of idle locked M's // (pretending that one more is running) before the CAS. // Otherwise the M from which we retake can exit the syscall, // increment nmidle and report deadlock. incidlelocked(-1) if atomic.Cas(&_p_.status, s, _Pidle) { if trace.enabled { traceGoSysBlock(_p_) traceProcStop(_p_) } n++ _p_.syscalltick++ handoffp(_p_) } incidlelocked(1) lock(&allpLock) } else if s == _Prunning { // 如果 G 运行时间太长,那么抢占它 t := int64(_p_.schedtick) if int64(pd.schedtick) != t { pd.schedtick = uint32(t) pd.schedwhen = now continue } if pd.schedwhen+forcePreemptNS > now { continue } preemptone(_p_) } } unlock(&allpLock) return uint32(n) }</code></pre> <h3>普通线程</h3> <p>普通线程就是我们 G/P/M 模型里的 M 了,M 对应的就是操作系统的线程。</p> <p>线程创建</p> <p>上面在创建 sysmon 线程的时候也看到了,创建线程的函数是 newm。</p> <pre> <code class="language-go">graph TD newm --> newm1 newm1 --> newosproc newosproc --> clone</code></pre> <p>最终会走到 linux 创建线程的系统调用 clone ,代码里大段和 cgo 相关的内容我们就不关心了,摘掉 cgo 相关的逻辑后的代码如下:</p> <pre> <code class="language-go">// 创建一个新的 m。该 m 会在启动时调用函数 fn,或者 schedule 函数 // fn 需要是 static 类型,且不能是在堆上分配的闭包。 // 运行 m 时,m.p 是有可能为 nil 的,所以不允许 write barriers //go:nowritebarrierrec func newm(fn func(), _p_ *p) { mp := allocm(_p_, fn) mp.nextp.set(_p_) mp.sigmask = initSigmask newm1(mp) }</code></pre> <p>传入的 p 会被赋值给 m 的 nextp 成员,在 m 执行 schedule 时,会将 nextp 拿出来,进行之后真正的绑定操作(其实就是把 nextp 赋值为 nil,并把这个 nextp 赋值给 m.p,把 m 赋值给 p.m)。</p> <pre> <code class="language-go">func newm1(mp *m) { execLock.rlock() // Prevent process clone. newosproc(mp, unsafe.Pointer(mp.g0.stack.hi)) execLock.runlock() }</code></pre> <pre> <code class="language-go">func newosproc(mp *m, stk unsafe.Pointer) { // Disable signals during clone, so that the new thread starts // with signals disabled. It will enable them in minit. var oset sigset sigprocmask(_SIG_SETMASK, &sigset_all, &oset) ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart))) sigprocmask(_SIG_SETMASK, &oset, nil) if ret < 0 { print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, ")\n") if ret == -_EAGAIN { println("runtime: may need to increase max user processes (ulimit -u)") } throw("newosproc") } }</code></pre> <p>工作流程</p> <p>首先空闲的 m 会被丢进全局调度器的 midle 队列中,在需要 m 的时候,会先从这里取:</p> <pre> <code class="language-go">//go:nowritebarrierrec // 尝试从 midle 列表中获取一个 m // 必须锁全局的 sched // 可能在 STW 期间执行,所以不允许 write barriers func mget() *m { mp := sched.midle.ptr() if mp != nil { sched.midle = mp.schedlink sched.nmidle-- } return mp }</code></pre> <p>取不到的话就会调用之前提到的 newm 来创建新线程,创建的线程是不会被销毁的,哪怕之后不需要这么多 m 了,也就只是会把 m 放在 midle 中。</p> <p>什么时候会创建线程呢,可以追踪一下 newm 的调用方:</p> <pre> <code class="language-go">graph TD main --> |sysmon|newm startTheWorld --> startTheWorldWithSema gcMarkTermination --> startTheWorldWithSema gcStart--> startTheWorldWithSema startTheWorldWithSema --> |helpgc|newm startTheWorldWithSema --> |run p|newm startm --> mget mget --> |if no free m|newm startTemplateThread --> |templateThread|newm LockOsThread --> startTemplateThread main --> |iscgo|startTemplateThread handoffp --> startm wakep --> startm injectglist --> startm</code></pre> <p>基本上来讲,m 都是按需创建的。如果 sched.midle 中没有空闲的 m 了,现在又需要,那么就会去创建一个。</p> <p>创建好的线程需要绑定到 p 之后才会开始执行,执行过程中也可能被剥夺掉 p。比如前面 retake 的流程,就会将 g 的 stackguard0 修改为 stackPreempt,待下一次进入 newstack 时,会判断是否有该抢占标记,有的话,就会放弃运行。这也就是所谓的 协作式抢占 。</p> <p>工作线程执行的内容核心其实就只有俩: schedule() 和 findrunnable() 。</p> <p>schedule</p> <pre> <code class="language-go">graph TD schedule --> A[schedtick%61 == 0] A --> |yes|globrunqget A --> |no|runqget globrunqget --> C[gp == nil] C --> |no|execute C --> |yes|runqget runqget --> B[gp == nil] B --> |no|execute B --> |yes|findrunnable findrunnable --> execute</code></pre> <pre> <code class="language-go">// 调度器调度一轮要执行的函数: 寻找一个 runnable 状态的 goroutine,并 execute 它 // 调度函数是循环,永远都不会返回 func schedule() { _g_ := getg() if _g_.m.locks != 0 { throw("schedule: holding locks") } if _g_.m.lockedg != 0 { stoplockedm() execute(_g_.m.lockedg.ptr(), false) // Never returns. } // 执行 cgo 调用的 g 不能被 schedule 走 // 因为 cgo 调用使用 m 的 g0 栈 if _g_.m.incgo { throw("schedule: in cgo") } top: if sched.gcwaiting != 0 { gcstopm() goto top } if _g_.m.p.ptr().runSafePointFn != 0 { runSafePointFn() } var gp *g var inheritTime bool if trace.enabled || trace.shutdown { gp = traceReader() if gp != nil { casgstatus(gp, _Gwaiting, _Grunnable) traceGoUnpark(gp, 0) } } if gp == nil && gcBlackenEnabled != 0 { gp = gcController.findRunnableGCWorker(_g_.m.p.ptr()) } if gp == nil { // 每调度几次就检查一下全局的 runq 来确保公平 // 否则两个 goroutine 就可以通过互相调用 // 完全占用本地的 runq 了 if _g_.m.p.ptr().schedtick%61 == 0 && sched.runqsize > 0 { lock(&sched.lock) gp = globrunqget(_g_.m.p.ptr(), 1) unlock(&sched.lock) } } if gp == nil { gp, inheritTime = runqget(_g_.m.p.ptr()) if gp != nil && _g_.m.spinning { throw("schedule: spinning with local work") } } if gp == nil { gp, inheritTime = findrunnable() // 在找到 goroutine 之前会一直阻塞下去 } // 当前线程将要执行 goroutine,并且不会再进入 spinning 状态 // 所以如果它被标记为 spinning,我们需要 reset 这个状态 // 可能会重启一个新的 spinning 状态的 M if _g_.m.spinning { resetspinning() } if gp.lockedm != 0 { // Hands off own p to the locked m, // then blocks waiting for a new p. startlockedm(gp) goto top } execute(gp, inheritTime) }</code></pre> <p>m 中所谓的调度循环实际上就是一直在执行下图中的 loop:</p> <pre> <code class="language-go">graph TD schedule --> execute execute --> gogo gogo --> goexit goexit --> goexit1 goexit1 --> goexit0 goexit0 --> schedule</code></pre> <p>execute</p> <pre> <code class="language-go">// Schedules gp to run on the current M. // If inheritTime is true, gp inherits the remaining time in the // current time slice. Otherwise, it starts a new time slice. // Never returns. // // Write barriers are allowed because this is called immediately after // acquiring a P in several places. // //go:yeswritebarrierrec func execute(gp *g, inheritTime bool) { _g_ := getg() // 这个可能是 m 的 g0 casgstatus(gp, _Grunnable, _Grunning) gp.waitsince = 0 gp.preempt = false gp.stackguard0 = gp.stack.lo + _StackGuard if !inheritTime { _g_.m.p.ptr().schedtick++ } _g_.m.curg = gp // 把当前 g 的位置让给 m gp.m = _g_.m // 把 gp 指向 m,建立双向关系 gogo(&gp.sched) }</code></pre> <p>比较简单,绑定 g 和 m,然后 gogo 执行绑定的 g 中的函数。</p> <p>gogo</p> <p>runtime.gogo 是汇编完成的,功能就是执行 go func() 的这个 func() ,可以看到功能主要是把 g 对象的 gobuf 里的内容搬到寄存器里。然后从 gobuf.pc 寄存器存储的指令位置开始继续向后执行。</p> <pre> <code class="language-go">// void gogo(Gobuf*) // restore state from Gobuf; longjmp TEXT runtime·gogo(SB), NOSPLIT, $16-8 MOVQ buf+0(FP), BX // gobuf MOVQ gobuf_g(BX), DX MOVQ 0(DX), CX // make sure g != nil get_tls(CX) MOVQ DX, g(CX) MOVQ gobuf_sp(BX), SP // restore SP MOVQ gobuf_ret(BX), AX MOVQ gobuf_ctxt(BX), DX MOVQ gobuf_bp(BX), BP MOVQ $0, gobuf_sp(BX) // clear to help garbage collector MOVQ $0, gobuf_ret(BX) MOVQ $0, gobuf_ctxt(BX) MOVQ $0, gobuf_bp(BX) MOVQ gobuf_pc(BX), BX JMP BX</code></pre> <p>当然,这里还是有一些和手写汇编不太一样的,看着比较奇怪的地方, gobuf_sp(BX) 这种写法按说标准 plan9 汇编中 gobuf_sp 只是个 symbol ,没有任何偏移量的意思,但这里却用名字来代替了其偏移量,这是怎么回事呢?</p> <p>实际上这是 runtime 的特权,是需要链接器配合完成的,再来看看 gobuf 在 runtime 中的 struct 定义开头部分的注释:</p> <pre> <code class="language-go">// The offsets of sp, pc, and g are known to (hard-coded in) libmach.</code></pre> <p>这下知道怎么回事了吧,链接器会帮助我们把这个换成偏移量。。</p> <p>Goexit</p> <p>Goexit :</p> <pre> <code class="language-go">// Goexit terminates the goroutine that calls it. No other goroutine is affected. // Goexit runs all deferred calls before terminating the goroutine. Because Goexit // is not a panic, any recover calls in those deferred functions will return nil. // // Calling Goexit from the main goroutine terminates that goroutine // without func main returning. Since func main has not returned, // the program continues execution of other goroutines. // If all other goroutines exit, the program crashes. func Goexit() { // Run all deferred functions for the current goroutine. // This code is similar to gopanic, see that implementation // for detailed comments. gp := getg() for { d := gp._defer if d == nil { break } if d.started { if d._panic != nil { d._panic.aborted = true d._panic = nil } d.fn = nil gp._defer = d.link freedefer(d) continue } d.started = true reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz)) if gp._defer != d { throw("bad defer entry in Goexit") } d._panic = nil d.fn = nil gp._defer = d.link freedefer(d) // Note: we ignore recovers here because Goexit isn't a panic } goexit1() } // Finishes execution of the current goroutine. func goexit1() { if raceenabled { racegoend() } if trace.enabled { traceGoEnd() } mcall(goexit0) }</code></pre> <pre> <code class="language-go">// The top-most function running on a goroutine // returns to goexit+PCQuantum. TEXT runtime·goexit(SB),NOSPLIT,$0-0 BYTE $0x90 // NOP CALL runtime·goexit1(SB) // does not return // traceback from goexit1 must hit code range of goexit BYTE $0x90 // NOP</code></pre> <p>mcall :</p> <pre> <code class="language-go">// func mcall(fn func(*g)) // Switch to m->g0's stack, call fn(g). // Fn must never return. It should gogo(&g->sched) // to keep running g. TEXT runtime·mcall(SB), NOSPLIT, $0-8 MOVQ fn+0(FP), DI get_tls(CX) MOVQ g(CX), AX // save state in g->sched MOVQ 0(SP), BX // caller's PC MOVQ BX, (g_sched+gobuf_pc)(AX) LEAQ fn+0(FP), BX // caller's SP MOVQ BX, (g_sched+gobuf_sp)(AX) MOVQ AX, (g_sched+gobuf_g)(AX) MOVQ BP, (g_sched+gobuf_bp)(AX) // switch to m->g0 & its stack, call fn MOVQ g(CX), BX MOVQ g_m(BX), BX MOVQ m_g0(BX), SI CMPQ SI, AX // if g == m->g0 call badmcall JNE 3(PC) MOVQ $runtime·badmcall(SB), AX JMP AX MOVQ SI, g(CX) // g = m->g0 MOVQ (g_sched+gobuf_sp)(SI), SP // sp = m->g0->sched.sp PUSHQ AX MOVQ DI, DX MOVQ 0(DI), DI CALL DI POPQ AX MOVQ $runtime·badmcall2(SB), AX JMP AX RET</code></pre> <p>wakep</p> <pre> <code class="language-go">// Tries to add one more P to execute G's. // Called when a G is made runnable (newproc, ready). func wakep() { // be conservative about spinning threads if !atomic.Cas(&sched.nmspinning, 0, 1) { return } startm(nil, true) } // Schedules some M to run the p (creates an M if necessary). // If p==nil, tries to get an idle P, if no idle P's does nothing. // May run with m.p==nil, so write barriers are not allowed. // If spinning is set, the caller has incremented nmspinning and startm will // either decrement nmspinning or set m.spinning in the newly started M. //go:nowritebarrierrec func startm(_p_ *p, spinning bool) { lock(&sched.lock) if _p_ == nil { _p_ = pidleget() if _p_ == nil { unlock(&sched.lock) if spinning { // The caller incremented nmspinning, but there are no idle Ps, // so it's okay to just undo the increment and give up. if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 { throw("startm: negative nmspinning") } } return } } mp := mget() unlock(&sched.lock) if mp == nil { var fn func() if spinning { // The caller incremented nmspinning, so set m.spinning in the new M. fn = mspinning } newm(fn, _p_) return } if mp.spinning { throw("startm: m is spinning") } if mp.nextp != 0 { throw("startm: m has p") } if spinning && !runqempty(_p_) { throw("startm: p has runnable gs") } // The caller incremented nmspinning, so set m.spinning in the new M. mp.spinning = spinning mp.nextp.set(_p_) notewakeup(&mp.park) }</code></pre> <p>goroutine 挂起</p> <pre> <code class="language-go">// Puts the current goroutine into a waiting state and calls unlockf. // If unlockf returns false, the goroutine is resumed. // unlockf must not access this G's stack, as it may be moved between // the call to gopark and the call to unlockf. func gopark(unlockf func(*g, unsafe.Pointer) bool, lock unsafe.Pointer, reason string, traceEv byte, traceskip int) { mp := acquirem() gp := mp.curg status := readgstatus(gp) if status != _Grunning && status != _Gscanrunning { throw("gopark: bad g status") } mp.waitlock = lock mp.waitunlockf = *(*unsafe.Pointer)(unsafe.Pointer(&unlockf)) gp.waitreason = reason mp.waittraceev = traceEv mp.waittraceskip = traceskip releasem(mp) // can't do anything that might move the G between Ms here. mcall(park_m) } func goready(gp *g, traceskip int) { systemstack(func() { ready(gp, traceskip, true) }) } // Mark gp ready to run. func ready(gp *g, traceskip int, next bool) { if trace.enabled { traceGoUnpark(gp, traceskip) } status := readgstatus(gp) // Mark runnable. _g_ := getg() _g_.m.locks++ // disable preemption because it can be holding p in a local var if status&^_Gscan != _Gwaiting { dumpgstatus(gp) throw("bad g->status in ready") } // status is Gwaiting or Gscanwaiting, make Grunnable and put on runq casgstatus(gp, _Gwaiting, _Grunnable) runqput(_g_.m.p.ptr(), gp, next) if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 { wakep() } _g_.m.locks-- if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in Case we've cleared it in newstack _g_.stackguard0 = stackPreempt } }</code></pre> <pre> <code class="language-go">func notesleep(n *note) { gp := getg() if gp != gp.m.g0 { throw("notesleep not on g0") } ns := int64(-1) if *cgo_yield != nil { // Sleep for an arbitrary-but-moderate interval to poll libc interceptors. ns = 10e6 } for atomic.Load(key32(&n.key)) == 0 { gp.m.blocked = true futexsleep(key32(&n.key), 0, ns) if *cgo_yield != nil { asmcgocall(*cgo_yield, nil) } gp.m.blocked = false } } // One-time notifications. func noteclear(n *note) { n.key = 0 } func notewakeup(n *note) { old := atomic.Xchg(key32(&n.key), 1) if old != 0 { print("notewakeup - double wakeup (", old, ")\n") throw("notewakeup - double wakeup") } futexwakeup(key32(&n.key), 1) }</code></pre> <p>findrunnable</p> <p>findrunnable 比较复杂,流程图先把 gc 相关的省略掉了:</p> <pre> <code class="language-go">graph TD runqget --> A[gp == nil] A --> |no|return A --> |yes|globrunqget globrunqget --> B[gp == nil] B --> |no| return B --> |yes| C[netpollinited && lastpoll != 0] C --> |yes|netpoll netpoll --> K[gp == nil] K --> |no|return K --> |yes|runqsteal C --> |no|runqsteal runqsteal --> D[gp == nil] D --> |no|return D --> |yes|E[globrunqget] E --> F[gp == nil] F --> |no| return F --> |yes| G[check all p's runq] G --> H[runq is empty] H --> |no|runqget H --> |yes|I[netpoll] I --> J[gp == nil] J --> |no| return J --> |yes| stopm stopm --> runqget</code></pre> <pre> <code class="language-go">// 找到一个可执行的 goroutine 来 execute // 会尝试从其它的 P 那里偷 g,从全局队列中拿,或者 network 中 poll func findrunnable() (gp *g, inheritTime bool) { _g_ := getg() // The conditions here and in handoffp must agree: if // findrunnable would return a G to run, handoffp must start // an M. top: _p_ := _g_.m.p.ptr() if sched.gcwaiting != 0 { gcstopm() goto top } if _p_.runSafePointFn != 0 { runSafePointFn() } if fingwait && fingwake { if gp := wakefing(); gp != nil { ready(gp, 0, true) } } if *cgo_yield != nil { asmcgocall(*cgo_yield, nil) } // 本地 runq if gp, inheritTime := runqget(_p_); gp != nil { return gp, inheritTime } // 全局 runq if sched.runqsize != 0 { lock(&sched.lock) gp := globrunqget(_p_, 0) unlock(&sched.lock) if gp != nil { return gp, false } } // Poll network. // netpoll 是我们执行 work-stealing 之前的一个优化 // 如果没有任何的 netpoll 等待者,或者线程被阻塞在 netpoll 中,我们可以安全地跳过这段逻辑 // 如果在阻塞的线程中存在任何逻辑上的竞争(e.g. 已经从 netpoll 中返回,但还没有设置 lastpoll) // 该线程还是会将下面的 netpoll 阻塞住 if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Load64(&sched.lastpoll) != 0 { if gp := netpoll(false); gp != nil { // 非阻塞 // netpoll 返回 goroutine 链表,用 schedlink 连接 injectglist(gp.schedlink.ptr()) casgstatus(gp, _Gwaiting, _Grunnable) if trace.enabled { traceGoUnpark(gp, 0) } return gp, false } } // 从其它 p 那里偷 g procs := uint32(gomaxprocs) if atomic.Load(&sched.npidle) == procs-1 { // GOMAXPROCS=1 或者除了我们其它的 p 都是 idle // 新的工作可能从 syscall/cgocall,网络或者定时器中来。 // 上面这些任务都不会被放到本地的 runq,所有没有可以 stealing 的点 goto stop } // 如果正在自旋的 M 的数量 >= 忙着的 P,那么阻塞 // 这是为了 // 当 GOMAXPROCS 远大于 1,但程序的并行度又很低的时候 // 防止过量的 CPU 消耗 if !_g_.m.spinning && 2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) { goto stop } if !_g_.m.spinning { _g_.m.spinning = true atomic.Xadd(&sched.nmspinning, 1) } for i := 0; i < 4; i++ { for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() { if sched.gcwaiting != 0 { goto top } stealRunNextG := i > 2 // first look for ready queues with more than 1 g if gp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp != nil { return gp, false } } } stop: // 没有可以干的事情。如果我们正在 GC 的标记阶段,可以安全地扫描和加深对象的颜色, // 这样可以进行空闲时间的标记,而不是直接放弃 P if gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) { _p_.gcMarkWorkerMode = gcMarkWorkerIdleMode gp := _p_.gcBgMarkWorker.ptr() casgstatus(gp, _Gwaiting, _Grunnable) if trace.enabled { traceGoUnpark(gp, 0) } return gp, false } // Before we drop our P, make a snapshot of the allp slice, // which can change underfoot once we no longer block // safe-points. We don't need to snapshot the contents because // everything up to cap(allp) is immutable. allpSnapshot := allp // 返回 P 并阻塞 lock(&sched.lock) if sched.gcwaiting != 0 || _p_.runSafePointFn != 0 { unlock(&sched.lock) goto top } if sched.runqsize != 0 { gp := globrunqget(_p_, 0) unlock(&sched.lock) return gp, false } if releasep() != _p_ { throw("findrunnable: wrong p") } pidleput(_p_) unlock(&sched.lock) // Delicate dance: thread transitions from spinning to non-spinning state, // potentially concurrently with submission of new goroutines. We must // drop nmspinning first and then check all per-P queues again (with // #StoreLoad memory barrier in between). If we do it the other way around, // another thread can submit a goroutine after we've checked all run queues // but before we drop nmspinning; as the result nobody will unpark a thread // to run the goroutine. // If we discover new work below, we need to restore m.spinning as a signal // for resetspinning to unpark a new worker thread (because there can be more // than one starving goroutine). However, if after discovering new work // we also observe no idle Ps, it is OK to just park the current thread: // the system is fully loaded so no spinning threads are required. // Also see "Worker thread parking/unparking" comment at the top of the file. wasSpinning := _g_.m.spinning if _g_.m.spinning { _g_.m.spinning = false if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 { throw("findrunnable: negative nmspinning") } } // 再检查一下所有的 runq for _, _p_ := range allpSnapshot { if !runqempty(_p_) { lock(&sched.lock) _p_ = pidleget() unlock(&sched.lock) if _p_ != nil { acquirep(_p_) if wasSpinning { _g_.m.spinning = true atomic.Xadd(&sched.nmspinning, 1) } goto top } break } } // 再检查 gc 空闲 g if gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) { lock(&sched.lock) _p_ = pidleget() if _p_ != nil && _p_.gcBgMarkWorker == 0 { pidleput(_p_) _p_ = nil } unlock(&sched.lock) if _p_ != nil { acquirep(_p_) if wasSpinning { _g_.m.spinning = true atomic.Xadd(&sched.nmspinning, 1) } // Go back to idle GC check. goto stop } } // poll network if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Xchg64(&sched.lastpoll, 0) != 0 { if _g_.m.p != 0 { throw("findrunnable: netpoll with p") } if _g_.m.spinning { throw("findrunnable: netpoll with spinning") } gp := netpoll(true) // 阻塞到返回为止 atomic.Store64(&sched.lastpoll, uint64(nanotime())) if gp != nil { lock(&sched.lock) _p_ = pidleget() unlock(&sched.lock) if _p_ != nil { acquirep(_p_) injectglist(gp.schedlink.ptr()) casgstatus(gp, _Gwaiting, _Grunnable) if trace.enabled { traceGoUnpark(gp, 0) } return gp, false } injectglist(gp) } } stopm() goto top }</code></pre> <h2>m 和 p 解绑定</h2> <h3>handoffp</h3> <pre> <code class="language-go">graph TD mexit --> A[is m0?] A --> |yes|B[handoffp] A --> |no| C[iterate allm] C --> |m found|handoffp C --> |m not found| throw forEachP --> |p status == syscall| handoffp stoplockedm --> handoffp entersyscallblock --> entersyscallblock_handoff entersyscallblock_handoff --> handoffp retake --> |p status == syscall| handoffp</code></pre> <p>最终会把 p 放回全局的 pidle 队列中:</p> <pre> <code class="language-go">// Hands off P from syscall or locked M. // Always runs without a P, so write barriers are not allowed. //go:nowritebarrierrec func handoffp(_p_ *p) { // handoffp must start an M in any situation where // findrunnable would return a G to run on _p_. // if it has local work, start it straight away if !runqempty(_p_) || sched.runqsize != 0 { startm(_p_, false) return } // if it has GC work, start it straight away if gcBlackenEnabled != 0 && gcMarkWorkAvailable(_p_) { startm(_p_, false) return } // no local work, check that there are no spinning/idle M's, // otherwise our help is not required if atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) == 0 && atomic.Cas(&sched.nmspinning, 0, 1) { // TODO: fast atomic startm(_p_, true) return } lock(&sched.lock) if sched.gcwaiting != 0 { _p_.status = _Pgcstop sched.stopwait-- if sched.stopwait == 0 { notewakeup(&sched.stopnote) } unlock(&sched.lock) return } if _p_.runSafePointFn != 0 && atomic.Cas(&_p_.runSafePointFn, 1, 0) { sched.safePointFn(_p_) sched.safePointWait-- if sched.safePointWait == 0 { notewakeup(&sched.safePointNote) } } if sched.runqsize != 0 { unlock(&sched.lock) startm(_p_, false) return } // If this is the last running P and nobody is polling network, // need to wakeup another M to poll network. if sched.npidle == uint32(gomaxprocs-1) && atomic.Load64(&sched.lastpoll) != 0 { unlock(&sched.lock) startm(_p_, false) return } pidleput(_p_) unlock(&sched.lock) }</code></pre> <h2>g 的状态迁移</h2> <pre> <code class="language-go">graph LR start{newg} --> Gidle Gidle --> |oneNewExtraM|Gdead Gidle --> |newproc1|Gdead Gdead --> |newproc1|Grunnable Gdead --> |needm|Gsyscall Gscanrunning --> |scang|Grunning Grunnable --> |execute|Grunning Gany --> |casgcopystack|Gcopystack Gcopystack --> |todotodo|Grunning Gsyscall --> |dropm|Gdead Gsyscall --> |exitsyscall0|Grunnable Gsyscall --> |exitsyscall|Grunning Grunning --> |goschedImpl|Grunnable Grunning --> |goexit0|Gdead Grunning --> |newstack|Gcopystack Grunning --> |reentersyscall|Gsyscall Grunning --> |entersyscallblock|Gsyscall Grunning --> |markroot|Gwaiting Grunning --> |gcAssistAlloc1|Gwaiting Grunning --> |park_m|Gwaiting Grunning --> |gcMarkTermination|Gwaiting Grunning --> |gcBgMarkWorker|Gwaiting Grunning --> |newstack|Gwaiting Gwaiting --> |gcMarkTermination|Grunning Gwaiting --> |gcBgMarkWorker|Grunning Gwaiting --> |markroot|Grunning Gwaiting --> |gcAssistAlloc1|Grunning Gwaiting --> |newstack|Grunning Gwaiting --> |findRunnableGCWorker|Grunnable Gwaiting --> |ready|Grunnable Gwaiting --> |findrunnable|Grunnable Gwaiting --> |injectglist|Grunnable Gwaiting --> |schedule|Grunnable Gwaiting --> |park_m|Grunnable Gwaiting --> |procresize|Grunnable Gwaiting --> |checkdead|Grunnable</code></pre> <p>图上的 Gany 代表任意状态,GC 时的状态切换比较多,如果只关注正常情况下的状态转换,可以把 markroot、gcMark 之类的先忽略掉。</p> <h2>p 的状态迁移</h2> <pre> <code class="language-go">graph LR Pidle --> |acquirep1|Prunning Psyscall --> |retake|Pidle Psyscall --> |entersyscall_gcwait|Pgcstop Psyscall --> |exitsyscallfast|Prunning Pany --> |gcstopm|Pgcstop Pany --> |forEachP|Pidle Pany --> |releasep|Pidle Pany --> |handoffp|Pgcstop Pany --> |procresize release current p use allp 0|Pidle Pany --> |procresize when init|Pgcstop Pany --> |procresize when free old p| Pdead Pany --> |procresize after resize use current p|Prunning Pany --> |reentersyscall|Psyscall Pany --> |stopTheWorldWithSema|Pgcstop</code></pre> <h2>抢占流程</h2> <p>函数执行是在 goroutine 的栈上,这个栈在函数执行期间是有可能溢出的,我们前面也看到了,如果一个函数用到了栈,会将 stackguard0 和 sp 寄存器进行比较,如果 sp > stackguard0,说明栈已经增长到溢出,因为栈是从内存高地址向低地址方向增长的。</p> <p>那么这个比较过程是在哪里完成的呢?这一步是由编译器完成的,我们看看一个函数编译后的结果,这段代码来自 go-internals:</p> <pre> <code class="language-go">0x0000 TEXT "".main(SB), $24-0 ;; stack-split prologue 0x0000 MOVQ (TLS), CX 0x0009 CMPQ SP, 16(CX) 0x000d JLS 58 0x000f SUBQ $24, SP 0x0013 MOVQ BP, 16(SP) 0x0018 LEAQ 16(SP), BP ;; ...omitted FUNCDATA stuff... 0x001d MOVQ $137438953482, AX 0x0027 MOVQ AX, (SP) ;; ...omitted PCDATA stuff... 0x002b CALL "".add(SB) 0x0030 MOVQ 16(SP), BP 0x0035 ADDQ $24, SP 0x0039 RET ;; stack-split epilogue 0x003a NOP ;; ...omitted PCDATA stuff... 0x003a CALL runtime.morestack_noctxt(SB) 0x003f JMP 0</code></pre> <p>函数开头被插的这段指令,即是将 g struct 中的 stackguard 与 SP 寄存器进行对比,JLS 表示 SP < 16(CX) 的话即跳转。</p> <pre> <code class="language-go">;; stack-split prologue 0x0000 MOVQ (TLS), CX 0x0009 CMPQ SP, 16(CX) 0x000d JLS 58</code></pre> <p>这里因为 CX 寄存器存储的是 g 的起始地址,而 16(CX) 指的是 g 结构体偏移 16 个字节的位置,可以回顾一下 g 结构体定义,16 个字节恰好是跳过了第一个成员 stack(16字节) 之后的 stackguard0 的位置。</p> <p>58 转为 16 进制即是 0x3a。</p> <pre> <code class="language-go">;; stack-split epilogue 0x003a NOP ;; ...omitted PCDATA stuff... 0x003a CALL runtime.morestack_noctxt(SB) 0x003f JMP 0</code></pre> <p>morestack_noctxt:</p> <pre> <code class="language-go">// morestack but not preserving ctxt. TEXT runtime·morestack_noctxt(SB),NOSPLIT,$0 MOVL $0, DX JMP runtime·morestack(SB)</code></pre> <p>morestack:</p> <pre> <code class="language-go">TEXT runtime·morestack(SB),NOSPLIT,$0-0 // Cannot grow scheduler stack (m->g0). get_tls(CX) MOVQ g(CX), BX MOVQ g_m(BX), BX MOVQ m_g0(BX), SI CMPQ g(CX), SI JNE 3(PC) CALL runtime·badmorestackg0(SB) INT $3 // Cannot grow signal stack (m->gsignal). MOVQ m_gsignal(BX), SI CMPQ g(CX), SI JNE 3(PC) CALL runtime·badmorestackgsignal(SB) INT $3 // Called from f. // Set m->morebuf to f's caller. MOVQ 8(SP), AX // f's caller's PC MOVQ AX, (m_morebuf+gobuf_pc)(BX) LEAQ 16(SP), AX // f's caller's SP MOVQ AX, (m_morebuf+gobuf_sp)(BX) get_tls(CX) MOVQ g(CX), SI MOVQ SI, (m_morebuf+gobuf_g)(BX) // Set g->sched to context in f. MOVQ 0(SP), AX // f's PC MOVQ AX, (g_sched+gobuf_pc)(SI) MOVQ SI, (g_sched+gobuf_g)(SI) LEAQ 8(SP), AX // f's SP MOVQ AX, (g_sched+gobuf_sp)(SI) MOVQ BP, (g_sched+gobuf_bp)(SI) MOVQ DX, (g_sched+gobuf_ctxt)(SI) // Call newstack on m->g0's stack. MOVQ m_g0(BX), BX MOVQ BX, g(CX) MOVQ (g_sched+gobuf_sp)(BX), SP CALL runtime·newstack(SB) MOVQ $0, 0x1003 // crash if newstack returns RET</code></pre> <p>newstack:</p> <pre> <code class="language-go">// Called from runtime·morestack when more stack is needed. // Allocate larger stack and relocate to new stack. // Stack growth is multiplicative, for constant amortized cost. // // g->atomicstatus will be Grunning or Gscanrunning upon entry. // If the GC is trying to stop this g then it will set preemptscan to true. // // This must be nowritebarrierrec because it can be called as part of // stack growth from other nowritebarrierrec functions, but the // compiler doesn't check this. // //go:nowritebarrierrec func newstack() { thisg := getg() // TODO: double check all gp. shouldn't be getg(). if thisg.m.morebuf.g.ptr().stackguard0 == stackFork { throw("stack growth after fork") } if thisg.m.morebuf.g.ptr() != thisg.m.curg { print("runtime: newstack called from g=", hex(thisg.m.morebuf.g), "\n"+"\tm=", thisg.m, " m->curg=", thisg.m.curg, " m->g0=", thisg.m.g0, " m->gsignal=", thisg.m.gsignal, "\n") morebuf := thisg.m.morebuf traceback(morebuf.pc, morebuf.sp, morebuf.lr, morebuf.g.ptr()) throw("runtime: wrong goroutine in newstack") } gp := thisg.m.curg if thisg.m.curg.throwsplit { // Update syscallsp, syscallpc in case traceback uses them. morebuf := thisg.m.morebuf gp.syscallsp = morebuf.sp gp.syscallpc = morebuf.pc pcname, pcoff := "(unknown)", uintptr(0) f := findfunc(gp.sched.pc) if f.valid() { pcname = funcname(f) pcoff = gp.sched.pc - f.entry } print("runtime: newstack at ", pcname, "+", hex(pcoff), " sp=", hex(gp.sched.sp), " stack=[", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n", "\tmorebuf={pc:", hex(morebuf.pc), " sp:", hex(morebuf.sp), " lr:", hex(morebuf.lr), "}\n", "\tsched={pc:", hex(gp.sched.pc), " sp:", hex(gp.sched.sp), " lr:", hex(gp.sched.lr), " ctxt:", gp.sched.ctxt, "}\n") thisg.m.traceback = 2 // Include runtime frames traceback(morebuf.pc, morebuf.sp, morebuf.lr, gp) throw("runtime: stack split at bad time") } morebuf := thisg.m.morebuf thisg.m.morebuf.pc = 0 thisg.m.morebuf.lr = 0 thisg.m.morebuf.sp = 0 thisg.m.morebuf.g = 0 // NOTE: stackguard0 may change underfoot, if another thread // is about to try to preempt gp. Read it just once and use that same // value now and below. preempt := atomic.Loaduintptr(&gp.stackguard0) == stackPreempt // Be conservative about where we preempt. // We are interested in preempting user Go code, not runtime code. // If we're holding locks, mallocing, or preemption is disabled, don't // preempt. // This check is very early in newstack so that even the status change // from Grunning to Gwaiting and back doesn't happen in this case. // That status change by itself can be viewed as a small preemption, // because the GC might change Gwaiting to Gscanwaiting, and then // this goroutine has to wait for the GC to finish before continuing. // If the GC is in some way dependent on this goroutine (for example, // it needs a lock held by the goroutine), that small preemption turns // into a real deadlock. if preempt { if thisg.m.locks != 0 || thisg.m.mallocing != 0 || thisg.m.preemptoff != "" || thisg.m.p.ptr().status != _Prunning { // Let the goroutine keep running for now. // gp->preempt is set, so it will be preempted next time. gp.stackguard0 = gp.stack.lo + _StackGuard gogo(&gp.sched) // never return } } if gp.stack.lo == 0 { throw("missing stack in newstack") } sp := gp.sched.sp if sys.ArchFamily == sys.AMD64 || sys.ArchFamily == sys.I386 { // The call to morestack cost a word. sp -= sys.PtrSize } if stackDebug >= 1 || sp < gp.stack.lo { print("runtime: newstack sp=", hex(sp), " stack=[", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n", "\tmorebuf={pc:", hex(morebuf.pc), " sp:", hex(morebuf.sp), " lr:", hex(morebuf.lr), "}\n", "\tsched={pc:", hex(gp.sched.pc), " sp:", hex(gp.sched.sp), " lr:", hex(gp.sched.lr), " ctxt:", gp.sched.ctxt, "}\n") } if sp < gp.stack.lo { print("runtime: gp=", gp, ", gp->status=", hex(readgstatus(gp)), "\n ") print("runtime: split stack overflow: ", hex(sp), " < ", hex(gp.stack.lo), "\n") throw("runtime: split stack overflow") } if preempt { if gp == thisg.m.g0 { throw("runtime: preempt g0") } if thisg.m.p == 0 && thisg.m.locks == 0 { throw("runtime: g is running but p is not") } // Synchronize with scang. casgstatus(gp, _Grunning, _Gwaiting) if gp.preemptscan { for !castogscanstatus(gp, _Gwaiting, _Gscanwaiting) { // Likely to be racing with the GC as // it sees a _Gwaiting and does the // stack scan. If so, gcworkdone will // be set and gcphasework will simply // return. } if !gp.gcscandone { // gcw is safe because we're on the // system stack. gcw := &gp.m.p.ptr().gcw scanstack(gp, gcw) if gcBlackenPromptly { gcw.dispose() } gp.gcscandone = true } gp.preemptscan = false gp.preempt = false casfrom_Gscanstatus(gp, _Gscanwaiting, _Gwaiting) // This clears gcscanvalid. casgstatus(gp, _Gwaiting, _Grunning) gp.stackguard0 = gp.stack.lo + _StackGuard gogo(&gp.sched) // never return } // Act like goroutine called runtime.Gosched. casgstatus(gp, _Gwaiting, _Grunning) gopreempt_m(gp) // never return } // Allocate a bigger segment and move the stack. oldsize := gp.stack.hi - gp.stack.lo newsize := oldsize * 2 if newsize > maxstacksize { print("runtime: goroutine stack exceeds ", maxstacksize, "-byte limit\n") throw("stack overflow") } // The goroutine must be executing in order to call newstack, // so it must be Grunning (or Gscanrunning). casgstatus(gp, _Grunning, _Gcopystack) // The concurrent GC will not scan the stack while we are doing the copy since // the gp is in a Gcopystack status. copystack(gp, newsize, true) if stackDebug >= 1 { print("stack grow done\n") } casgstatus(gp, _Gcopystack, _Grunning) gogo(&gp.sched) }</code></pre> <p>总结一下流程:</p> <pre> <code class="language-go">graph TD start[entering func] --> cmp[sp < stackguard0] cmp --> |yes| morestack_noctxt cmp --> |no|final[execute func] morestack_noctxt --> morestack morestack --> newstack newstack --> preempt</code></pre> <p>抢占都是在 newstack 中完成,但抢占标记是在 Go 源代码中的其它位置来进行标记的:</p> <p>我们来看看 stackPreempt 是在哪些位置赋值给 stackguard0 的:</p> <pre> <code class="language-go">graph LR unlock --> |in case cleared in newstack|restorePreempt ready --> |in case cleared in newstack|restorePreempt startTheWorldWithSema --> |in case cleared in newstack|restorePreempt allocm --> |in case cleared in newstack|restorePreempt exitsyscall --> |in case cleared in newstack|restorePreempt newproc1--> |in case cleared in newstack|restorePreempt releasem --> |in case cleared in newstack|restorePreempt scang --> setPreempt reentersyscall --> setPreempt entersyscallblock --> setPreempt preemptone--> setPreempt enlistWorker --> preemptone retake --> preemptone preemptall --> preemptone freezetheworld --> preemptall stopTheWorldWithSema --> preemptall forEachP --> preemptall startpanic_m --> freezetheworld gcMarkDone --> forEachP</code></pre> <p>可见只有 gc 和 retake 才会去真正地抢占 g,并没有其它的入口,其它的地方就只是恢复一下可能在 newstack 中被清除掉的抢占标记。</p> <p>当然,这里 entersyscall 和 entersyscallblock 比较特殊,虽然这俩函数的实现中有设置抢占标记,但实际上这两段逻辑是不会被走到的。因为 syscall 执行时是在 m 的 g0 栈上,如果在执行时被抢占,那么会直接 throw,而无法恢复。</p> <p> </p> <p>来自:http://xargin.com/go-scheduler/</p> <p> </p>