以前の記事:ALB測量を使って水深平均二次元流計算をすると凄い結果がでた - 趣味で計算流砂水理の計算速度についてのメモです。
numbaによる並列化の話です。numbaについては以下を参照下さい。
numba並列化の有無による速度比較
ちゃんと計測したわけではなく、タイムスタンプからの概算です。
- 並列化 6コア12スレッド:31時間
- シングル:64時間
並列化効率 206%
メッシュ数は430万とそれなりの数ですが、二次元なのでスレッド数の割には全然高速化しないですね。多分4スレッドくらいで十分だと思います。
numbaによる並列化の基礎
詳しくは、公式Automatic parallelization with @jit — Numba 0.50.1 documentationに書いていますが、 要点だけをまとめておきます。
numbaの高速化は、
- 自動並列化
- 手動並列化
の2種類に分けられます。
自動並列化について
自動並列化は、 numbaによる高速化の対象となる関数のオプションにnopython=True、parallel=Trueを追記することによって、有効になります。
これによって、numpy関数が並列化されたり、複数のループを結合したり、冗長的な書き方が最適化されたりと、上手く高速化してくれます。
最適化の結果は、parallel_diagnostics(level)メソッドで確認できます。
今回の計算例より、numba並列化を行った1つ関数について結果を確認してみます。
まず、numbaのオプションを設定
@numba.jit(nopython=True, parallel=True) def conEq(dep, qx, qy, dzb, dt, dx, dy, ibx, hmin, hbuf, hdown, periodic=True): ......................
一度プログラムを実行した後に、parallel_diagnosticsメソッドで確認します。なお、levelは出力する情報のレベルで1が最小、4が最大になります。今回は4にしています。
conEq.parallel_diagnostics(level=4)
================================================================================ Parallel Accelerator Optimizing: Function conEq, <ipython- input-5-cf997de63147> (1) ================================================================================ Parallel loop listing for Function conEq, <ipython-input-5-cf997de63147> (1) ------------------------------------------------------------------------------------------------------------------|loop #ID @numba.jit(nopython=True, parallel=True) | def conEq(dep, qx, qy, dzb, dt, dx, dy, ibx, hmin, hbuf, hdown, periodic=True): | # Qind = range(154,159) | | imax, jmax = len(dep), len(dep[0]) | depn = np.zeros_like(dep, dtype=np.float64) | fluxx = np.zeros((imax+1, jmax), dtype=np.float64)------------------------------------------------------------| #3 fluxy = np.zeros((imax, jmax+1), dtype=np.float64)------------------------------------------------------------| #4 modflux = np.full( (imax, jmax), False) | | gravity = float( 9.8 ) | | f = lambda Qp, Qm : Qm if Qp >= 0.0 and Qm >= 0.0 else (Qp if Qp <= 0.0 and Qm <= 0.0 else 0.5*Qp+0.5*Qm ) | | def flux(Qp, Qm, depp, depm, zbp, zbm, ib, delta) : | r = f(Qp, Qm) | # if ( (depm + zbm) < zbp - ib*delta) and (depp <= hbuf) : r = 0.0 | # if ( (depp + zbp) < zbm + ib*delta) and (depm <= hbuf) : r = 0.0 | if ( (depm + zbm) <= zbp + hbuf - ib*delta) and (depp <= hbuf) : r = 0.0 | if ( (depp + zbp) <= zbm + hbuf + ib*delta) and (depm <= hbuf) : r = 0.0 | | return r | | for i in numba.prange( imax ):--------------------------------------------------------------------------------| #8 for j in range( jmax ): | c, xm = (i,j), (i-1,j) | fluxx[c] = flux(qx[c], qx[xm], dep[c], dep[xm], dzb[c], dzb[xm], ibx, dx) | | if periodic : | # boundary : periodic | fluxx[-1,:] = fluxx[0,:] ---------------------------------------------------------------------------------| #0 else: | for j in numba.prange( jmax ): fluxx[-1,j] = fluxx[-2,j] # qx[-1,j] # if qx[-1,j] > 0.0 else qx[,j]-------| #5 # normal | # for j in numba.prange( jmax ): fluxx[-1,j] = qx[-1,j] if qx[-1,j] > 0.0 else 0.0 | | for i in numba.prange( imax ):--------------------------------------------------------------------------------| #7 for j in range( 1, jmax ): | c, ym = (i,j), (i,j-1) | fluxy[c] = flux(qy[c], qy[ym], dep[c], dep[ym], dzb[c], dzb[ym], 0.0, dy) | | # wall boundary | # fluxy[:,0] = 0.0 | # fluxy[:,-1] = 0.0 | | for i in numba.prange( imax ):--------------------------------------------------------------------------------| #6 fluxy[i,-1] = qy[i,-1] if qy[i,-1] > 0.0 else 0.0 | fluxy[i, 0] = qy[i, 0] if qy[i, 0] < 0.0 else 0.0 | | nis = 0 if periodic else 1 | # limiter -------------------------------------------------------------- | # 水深が負になる際に質量保存を満たすためにフラックスを修正する | for i in range(nis, imax): | for j in range(jmax): | if dep[c] > hmin : | c, xp, yp = (i, j), (i+1, j), (i, j+1) | fxp = fluxx[xp] if fluxx[xp] > 0.0 else 0.0 | fxm = -fluxx[c] if fluxx[c] < 0.0 else 0.0 | fyp = fluxy[yp] if fluxy[yp] > 0.0 else 0.0 | fym = -fluxy[c] if fluxy[c] < 0.0 else 0.0 | V = dep[c]*dx*dy - hmin*dx*dy | Vq = ( fxp*dy + fxm*dy + fyp*dx + fym*dx )*dt | if V < Vq: | alfa = V / Vq - 0.001 | if fluxx[xp] > 0.0 : fluxx[xp] *= alfa | if fluxx[c] < 0.0 : fluxx[c] *= alfa | if fluxy[yp] > 0.0 : fluxy[yp] *= alfa | if fluxy[c] < 0.0 : fluxy[c] *= alfa | | modflux[c] = True | # ------------------------------------------------------------------------ | n = 0 | for i in numba.prange(nis, imax):-----------------------------------------------------------------------------| #9 for j in range(jmax): | c, xp, yp = (i, j), (i+1, j), (i, j+1) | depn[c] = dep[c] - dt*(fluxx[xp] - fluxx[c])/dx - dt*(fluxy[yp] - fluxy[c])/dy | if depn[c] < hmin : | n += 1 | # print('dep-error') | # print( modflux[c], depn[c], fluxx[xp], fluxx[c], fluxy[yp], fluxy[c] ) | fxp = fluxx[xp] if fluxx[xp] > 0.0 else 0.0 | fxm = -fluxx[c] if fluxx[c] < 0.0 else 0.0 | fyp = fluxy[yp] if fluxy[yp] > 0.0 else 0.0 | fym = -fluxy[c] if fluxy[c] < 0.0 else 0.0 | V = dep[c]*dx*dy - hmin*dx*dy | Vq = ( fxp*dy + fxm*dy + fyp*dx + fym*dx )*dt | # print(V,Vq) | | depn[c] = hmin | | # upstream boundary | # if periodic == False: depn[0][:] = depn[1][:] | if periodic == False: | # depn[0][:] = dep[0][:] | # for j in Qind : depn[0,j] = depn[1,j] | | depn[0,:] = depn[1,:] -----------------------------------------------------------------------------------| #1 | # downstream boundary | # depn[-1][:] = hdown | depn[-1][:] = depn[-2][:]-------------------------------------------------------------------------------------| #2 | return depn, n | --------------------------------- Fusing loops --------------------------------- Attempting fusion of parallel loops (combines loops with similar properties)... Trying to fuse loops #7 and #6: - fusion succeeded: parallel for-loop #6 is fused into for-loop #7. Trying to fuse loops #3 and #4: - fusion failed: loop dimension mismatched in axis 0. slice(0, $48binary_add.19, 1) != slice(0, dep_size0.183, 1) ----------------------------- Before Optimisation ------------------------------ Parallel region 0: +--7 (parallel) +--6 (parallel) -------------------------------------------------------------------------------- ------------------------------ After Optimisation ------------------------------ Parallel region 0: +--7 (parallel, fused with loop(s): 6) Parallel region 0 (loop #7) had 1 loop(s) fused. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ---------------------------Loop invariant code motion--------------------------- Allocation hoisting: No allocation hoisting found Instruction hoisting: loop #3: Has the following hoisted: $expr_out_var.287 = const(float64, 0.0) Failed to hoist the following: dependency: $parfor_index_tuple_var.288 = build_tuple(items=[Var($parfor__index_281.743, <string>:2), Var($parfor__index_282.745, <string>:3)]) loop #4: Has the following hoisted: $expr_out_var.295 = const(float64, 0.0) Failed to hoist the following: dependency: $parfor_index_tuple_var.296 = build_tuple(items=[Var($parfor__index_289.789, <string>:2), Var($parfor__index_290.791, <string>:3)]) loop #8: Has the following hoisted: $28binary_multiply.12.112 = ibx * dx bool34.115 = global(bool: <class 'bool'>) bool42.120 = global(bool: <class 'bool'>) $const172.7 = const(int, 1) $const4.1.151 = const(float, 0.0) bool8.153 = global(bool: <class 'bool'>) r.2.145 = const(float, 0.0) $const42.0.174 = const(float, 0.5) $const48.3.177 = const(float, 0.5) r.1.144 = const(float, 0.0) $64binary_multiply.8.131 = ibx * dx bool70.134 = global(bool: <class 'bool'>) $const12.1.156 = const(float, 0.0) bool16.158 = global(bool: <class 'bool'>) $152load_global.1 = global(range: <class 'range'>) $const32.1.168 = const(float, 0.0) bool36.170 = global(bool: <class 'bool'>) $const24.1.163 = const(float, 0.0) bool28.165 = global(bool: <class 'bool'>) bool78.139 = global(bool: <class 'bool'>) Failed to hoist the following: dependency: $160for_iter.2 = iternext(value=$158get_iter.4) dependency: $j.328 = pair_first(value=$160for_iter.2) dependency: $160for_iter.4 = pair_second(value=$160for_iter.2) dependency: $16binary_add.6.106 = $depm.94.327 + $zbm.96.326 dependency: $22binary_add.9.109 = $zbp.95.322 + hbuf dependency: $30binary_subtract.13.113 = $22binary_add.9.109 - $28binary_multiply.12.112 dependency: $32compare_op.14.114 = $16binary_add.6.106 <= $30binary_subtract.13.113 dependency: $34pred.116 = call $push_global_to_block.837($32compare_op.14.114, func=$push_global_to_block.837, args=(Var($32compare_op.14.114, <ipython-input-5-cf997de63147>:19),), kws=(), vararg=None) dependency: $40compare_op.2.119 = $depp.93.319 <= hbuf dependency: $42pred.121 = call $push_global_to_block.838($40compare_op.2.119, func=$push_global_to_block.838, args=(Var($40compare_op.2.119, <ipython-input-5-cf997de63147>:19),), kws=(), vararg=None) dependency: $c.9.656 = build_tuple(items=[Var($parfor__index_303.828, <string>:3), Var($j.328, <ipython-input-5-cf997de63147>:25)]) dependency: $174binary_subtract.8 = $parfor__index_303.828 - $const172.7 dependency: $xm.323 = build_tuple(items=[Var($174binary_subtract.8, <ipython-input-5-cf997de63147>:26), Var($j.328, <ipython-input-5-cf997de63147>:25)]) dependency: $r.103.3.2.848 = getitem(value=qx, index=$c.9.656, fn=<built-in function getitem>) dependency: $r.103.3.2.850 = getitem(value=qx, index=$xm.323, fn=<built-in function getitem>) dependency: $depp.93.319 = getitem(value=dep, index=$c.9.656, fn=<built-in function getitem>) dependency: $depm.94.327 = getitem(value=dep, index=$xm.323, fn=<built-in function getitem>) dependency: $zbp.95.322 = getitem(value=dzb, index=$c.9.656, fn=<built-in function getitem>) dependency: $zbm.96.326 = getitem(value=dzb, index=$xm.323, fn=<built-in function getitem>) dependency: $6compare_op.2.152 = $r.103.3.2.848 >= $const4.1.151 dependency: $8pred.154 = call $push_global_to_block.839($6compare_op.2.152, func=$push_global_to_block.839, args=(Var($6compare_op.2.152, <ipython-input-5-cf997de63147>:13),), kws=(), vararg=None) dependency: $46binary_multiply.2.176 = $const42.0.174 * $r.103.3.2.848 dependency: $52binary_multiply.5.179 = $const48.3.177 * $r.103.3.2.850 dependency: $r.103.3.2.849 = $46binary_multiply.2.176 + $52binary_multiply.5.179 dependency: $52binary_add.2.125 = $depp.93.319 + $zbp.95.322 dependency: $58binary_add.5.128 = $zbm.96.326 + hbuf dependency: $66binary_add.9.132 = $58binary_add.5.128 + $64binary_multiply.8.131 dependency: $68compare_op.10.133 = $52binary_add.2.125 <= $66binary_add.9.132 dependency: $70pred.135 = call $push_global_to_block.840($68compare_op.10.133, func=$push_global_to_block.840, args=(Var($68compare_op.10.133, <ipython-input-5-cf997de63147>:20),), kws=(), vararg=None) dependency: $14compare_op.2.157 = $r.103.3.2.850 >= $const12.1.156 dependency: $16pred.159 = call $push_global_to_block.841($14compare_op.2.157, func=$push_global_to_block.841, args=(Var($14compare_op.2.157, <ipython-input-5-cf997de63147>:13),), kws=(), vararg=None) dependency: c_8 = c.9 dependency: i = $parfor__index_303.828 not pure: $156call_function.3 = call $push_global_to_block.836(_jmax_316, func=$push_global_to_block.836, args=[Var(_jmax_316, <ipython-input-5-cf997de63147>:5)], kws=(), vararg=None) dependency: $158get_iter.4 = getiter(value=$156call_function.3) dependency: $34compare_op.2.169 = $r.103.3.2.850 <= $const32.1.168 dependency: $36pred.171 = call $push_global_to_block.842($34compare_op.2.169, func=$push_global_to_block.842, args=(Var($34compare_op.2.169, <ipython-input-5-cf997de63147>:13),), kws=(), vararg=None) dependency: $26compare_op.2.164 = $r.103.3.2.848 <= $const24.1.163 dependency: $28pred.166 = call $push_global_to_block.843($26compare_op.2.164, func=$push_global_to_block.843, args=(Var($26compare_op.2.164, <ipython-input-5-cf997de63147>:13),), kws=(), vararg=None) dependency: $76compare_op.2.138 = $depm.94.327 <= hbuf dependency: $78pred.140 = call $push_global_to_block.844($76compare_op.2.138, func=$push_global_to_block.844, args=(Var($76compare_op.2.138, <ipython-input-5-cf997de63147>:20),), kws=(), vararg=None) loop #0: Failed to hoist the following: dependency: $value_var.274 = getitem(value=_262binary__subscr_7, index=$parfor__index_273.877, fn=<built-in function getitem>) loop #5: Has the following hoisted: $const298.3 = const(int, -2) $const308.8 = const(int, -1) Failed to hoist the following: dependency: $302build_tuple.5 = build_tuple(items=[Var($const298.3, <ipython-input-5-cf997de63147>:33), Var($parfor__index_297.903, <string>:2)]) dependency: $304binary_subscr.6 = getitem(value=fluxx, index=$302build_tuple.5, fn=<built-in function getitem>) dependency: $312build_tuple.10 = build_tuple(items=[Var($const308.8, <ipython-input-5-cf997de63147>:33), Var($parfor__index_297.903, <string>:2)]) loop #7: Has the following hoisted: $const24.1.72 = const(float, 0.0) bool28.74 = global(bool: <class 'bool'>) $const42.0.83 = const(float, 0.5) $const48.3.86 = const(float, 0.5) r.1.53 = const(float, 0.0) $const12.1.65 = const(float, 0.0) bool16.67 = global(bool: <class 'bool'>) $const364.8 = const(int, 1) $ib.6.334 = const(float, 0.0) $const4.1.60 = const(float, 0.0) bool8.62 = global(bool: <class 'bool'>) $64binary_multiply.8.40 = $ib.6.334 * dy bool70.43 = global(bool: <class 'bool'>) $const32.1.77 = const(float, 0.0) bool36.79 = global(bool: <class 'bool'>) $const458.4 = const(int, -1) $const464.7 = const(float, 0.0) bool468 = global(bool: <class 'bool'>) $28binary_multiply.12.21 = $ib.6.334 * dy bool34.24 = global(bool: <class 'bool'>) bool78.48 = global(bool: <class 'bool'>) r.2.54 = const(float, 0.0) bool42.29 = global(bool: <class 'bool'>) $340load_global.1 = global(range: <class 'range'>) $const342.2 = const(int, 1) $const532.4 = const(int, 0) $const518.3 = const(int, 0) $const476.3 = const(int, -1) $const526.1 = const(float, 0.0) $const490.4 = const(int, -1) $const500.8 = const(int, 0) $const506.11 = const(float, 0.0) bool510 = global(bool: <class 'bool'>) $const484.1 = const(float, 0.0) Failed to hoist the following: dependency: $26compare_op.2.73 = $r.12.3.2.962 <= $const24.1.72 dependency: $28pred.75 = call $push_global_to_block.948($26compare_op.2.73, func=$push_global_to_block.948, args=(Var($26compare_op.2.73, <ipython-input-5-cf997de63147>:13),), kws=(), vararg=None) dependency: $46binary_multiply.2.85 = $const42.0.83 * $r.12.3.2.962 dependency: $52binary_multiply.5.88 = $const48.3.86 * $r.12.3.2.963 dependency: $r.12.3.2.961 = $46binary_multiply.2.85 + $52binary_multiply.5.88 dependency: $350for_iter.2 = iternext(value=$348get_iter.5) dependency: $j.2.341 = pair_first(value=$350for_iter.2) dependency: $350for_iter.4 = pair_second(value=$350for_iter.2) dependency: $14compare_op.2.66 = $r.12.3.2.963 >= $const12.1.65 dependency: $16pred.68 = call $push_global_to_block.949($14compare_op.2.66, func=$push_global_to_block.949, args=(Var($14compare_op.2.66, <ipython-input-5-cf997de63147>:13),), kws=(), vararg=None) dependency: $c.7.659 = build_tuple(items=[Var($parfor__index_301.939, <string>:3), Var($j.2.341, <ipython-input-5-cf997de63147>:38)]) dependency: $366binary_subtract.9 = $j.2.341 - $const364.8 dependency: $ym.330 = build_tuple(items=[Var($parfor__index_301.939, <string>:3), Var($366binary_subtract.9, <ipython-input-5-cf997de63147>:39)]) dependency: $r.12.3.2.962 = getitem(value=qy, index=$c.7.659, fn=<built-in function getitem>) dependency: $r.12.3.2.963 = getitem(value=qy, index=$ym.330, fn=<built-in function getitem>) dependency: $depp.2.337 = getitem(value=dep, index=$c.7.659, fn=<built-in function getitem>) dependency: $depm.3.335 = getitem(value=dep, index=$ym.330, fn=<built-in function getitem>) dependency: $zbp.4.331 = getitem(value=dzb, index=$c.7.659, fn=<built-in function getitem>) dependency: $zbm.5.336 = getitem(value=dzb, index=$ym.330, fn=<built-in function getitem>) dependency: $6compare_op.2.61 = $r.12.3.2.962 >= $const4.1.60 dependency: $8pred.63 = call $push_global_to_block.950($6compare_op.2.61, func=$push_global_to_block.950, args=(Var($6compare_op.2.61, <ipython-input-5-cf997de63147>:13),), kws=(), vararg=None) dependency: $52binary_add.2.34 = $depp.2.337 + $zbp.4.331 dependency: $58binary_add.5.37 = $zbm.5.336 + hbuf dependency: $66binary_add.9.41 = $58binary_add.5.37 + $64binary_multiply.8.40 dependency: $68compare_op.10.42 = $52binary_add.2.34 <= $66binary_add.9.41 dependency: $70pred.44 = call $push_global_to_block.951($68compare_op.10.42, func=$push_global_to_block.951, args=(Var($68compare_op.10.42, <ipython-input-5-cf997de63147>:20),), kws=(), vararg=None) dependency: $34compare_op.2.78 = $r.12.3.2.963 <= $const32.1.77 dependency: $36pred.80 = call $push_global_to_block.952($34compare_op.2.78, func=$push_global_to_block.952, args=(Var($34compare_op.2.78, <ipython-input-5-cf997de63147>:13),), kws=(), vararg=None) dependency: c_6 = c.7 dependency: $460build_tuple.5 = build_tuple(items=[Var($parfor__index_301.939, <string>:3), Var($const458.4, <ipython-input-5-cf997de63147>:47)]) dependency: $462binary_subscr.6 = getitem(value=qy, index=$460build_tuple.5, fn=<built-in function getitem>) dependency: $466compare_op.8 = $462binary_subscr.6 > $const464.7 dependency: $468pred = call $push_global_to_block.953($466compare_op.8, func=$push_global_to_block.953, args=(Var($466compare_op.8, <ipython-input-5-cf997de63147>:47),), kws=(), vararg=None) dependency: $16binary_add.6.15 = $depm.3.335 + $zbm.5.336 dependency: $22binary_add.9.18 = $zbp.4.331 + hbuf dependency: $30binary_subtract.13.22 = $22binary_add.9.18 - $28binary_multiply.12.21 dependency: $32compare_op.14.23 = $16binary_add.6.15 <= $30binary_subtract.13.22 dependency: $34pred.25 = call $push_global_to_block.954($32compare_op.14.23, func=$push_global_to_block.954, args=(Var($32compare_op.14.23, <ipython-input-5-cf997de63147>:19),), kws=(), vararg=None) dependency: $76compare_op.2.47 = $depm.3.335 <= hbuf dependency: $78pred.49 = call $push_global_to_block.955($76compare_op.2.47, func=$push_global_to_block.955, args=(Var($76compare_op.2.47, <ipython-input-5-cf997de63147>:20),), kws=(), vararg=None) dependency: $40compare_op.2.28 = $depp.2.337 <= hbuf dependency: $42pred.30 = call $push_global_to_block.956($40compare_op.2.28, func=$push_global_to_block.956, args=(Var($40compare_op.2.28, <ipython-input-5-cf997de63147>:19),), kws=(), vararg=None) not pure: $346call_function.4 = call $push_global_to_block.947($const342.2, _jmax_316, func=$push_global_to_block.947, args=[Var($const342.2, <ipython-input-5-cf997de63147>:38), Var(_jmax_316, <ipython-input-5-cf997de63147>:5)], kws=(), vararg=None) dependency: $348get_iter.5 = getiter(value=$346call_function.4) dependency: $534build_tuple.5 = build_tuple(items=[Var($parfor__index_301.939, <string>:3), Var($const532.4, <ipython-input-5-cf997de63147>:48)]) dependency: $520build_tuple.4 = build_tuple(items=[Var($parfor__index_301.939, <string>:3), Var($const518.3, <ipython-input-5-cf997de63147>:48)]) dependency: $522binary_subscr.5 = getitem(value=qy, index=$520build_tuple.4, fn=<built-in function getitem>) dependency: $478build_tuple.4 = build_tuple(items=[Var($parfor__index_301.939, <string>:3), Var($const476.3, <ipython-input-5-cf997de63147>:47)]) dependency: $480binary_subscr.5 = getitem(value=qy, index=$478build_tuple.4, fn=<built-in function getitem>) dependency: $492build_tuple.5 = build_tuple(items=[Var($parfor__index_301.939, <string>:3), Var($const490.4, <ipython-input-5-cf997de63147>:47)]) dependency: $502build_tuple.9 = build_tuple(items=[Var($parfor__index_301.939, <string>:3), Var($const500.8, <ipython-input-5-cf997de63147>:48)]) dependency: $504binary_subscr.10 = getitem(value=qy, index=$502build_tuple.9, fn=<built-in function getitem>) dependency: $508compare_op.12 = $504binary_subscr.10 < $const506.11 dependency: $510pred = call $push_global_to_block.957($508compare_op.12, func=$push_global_to_block.957, args=(Var($508compare_op.12, <ipython-input-5-cf997de63147>:48),), kws=(), vararg=None) loop #9: Has the following hoisted: $const1202.6 = const(float, 0.0) bool1206 = global(bool: <class 'bool'>) $const1228.6 = const(float, 0.0) bool1232 = global(bool: <class 'bool'>) $1010load_global.1 = global(range: <class 'range'>) $const1032.7 = const(int, 1) $const1044.13 = const(int, 1) bool1130 = global(bool: <class 'bool'>) $const1136.3 = const(int, 1) $const1148.8 = const(float, 0.0) bool1152 = global(bool: <class 'bool'>) $const1174.6 = const(float, 0.0) bool1178 = global(bool: <class 'bool'>) Failed to hoist the following: dependency: $1200binary_subscr.5 = getitem(value=fluxy, index=$yp.1.355, fn=<built-in function getitem>) dependency: $1204compare_op.7 = $1200binary_subscr.5 > $const1202.6 dependency: $1206pred = call $push_global_to_block.1022($1204compare_op.7, func=$push_global_to_block.1022, args=(Var($1204compare_op.7, <ipython-input-5-cf997de63147>:83),), kws=(), vararg=None) dependency: $1226binary_subscr.5 = getitem(value=fluxy, index=$c.3.356, fn=<built-in function getitem>) dependency: $1230compare_op.7 = $1226binary_subscr.5 < $const1228.6 dependency: $1232pred = call $push_global_to_block.1023($1230compare_op.7, func=$push_global_to_block.1023, args=(Var($1230compare_op.7, <ipython-input-5-cf997de63147>:84),), kws=(), vararg=None) not pure: $1014call_function.3 = call $push_global_to_block.1021(_jmax_316, func=$push_global_to_block.1021, args=[Var(_jmax_316, <ipython-input-5-cf997de63147>:5)], kws=(), vararg=None) dependency: $1016get_iter.4 = getiter(value=$1014call_function.3) dependency: $1018for_iter.2 = iternext(value=$1016get_iter.4) dependency: $j.4.359 = pair_first(value=$1018for_iter.2) dependency: $1018for_iter.4 = pair_second(value=$1018for_iter.2) dependency: $c.3.356 = build_tuple(items=[Var($parfor__index_305.1017, <string>:3), Var($j.4.359, <ipython-input-5-cf997de63147>:74)]) dependency: $1034binary_add.8 = $parfor__index_305.1017 + $const1032.7 dependency: $xp.1.357 = build_tuple(items=[Var($1034binary_add.8, <ipython-input-5-cf997de63147>:75), Var($j.4.359, <ipython-input-5-cf997de63147>:74)]) dependency: $1046binary_add.14 = $j.4.359 + $const1044.13 dependency: $yp.1.355 = build_tuple(items=[Var($parfor__index_305.1017, <string>:3), Var($1046binary_add.14, <ipython-input-5-cf997de63147>:75)]) dependency: $1064binary_subscr.18 = getitem(value=dep, index=$c.3.356, fn=<built-in function getitem>) dependency: $1072binary_subscr.22 = getitem(value=fluxx, index=$xp.1.357, fn=<built-in function getitem>) dependency: $1078binary_subscr.25 = getitem(value=fluxx, index=$c.3.356, fn=<built-in function getitem>) dependency: $1080binary_subtract.26 = $1072binary_subscr.22 - $1078binary_subscr.25 dependency: $1082binary_multiply.27 = dt * $1080binary_subtract.26 dependency: $1086binary_true_divide.29 = $1082binary_multiply.27 / dx dependency: $1088binary_subtract.30 = $1064binary_subscr.18 - $1086binary_true_divide.29 dependency: $1096binary_subscr.34 = getitem(value=fluxy, index=$yp.1.355, fn=<built-in function getitem>) dependency: $1102binary_subscr.37 = getitem(value=fluxy, index=$c.3.356, fn=<built-in function getitem>) dependency: $1104binary_subtract.38 = $1096binary_subscr.34 - $1102binary_subscr.37 dependency: $1106binary_multiply.39 = dt * $1104binary_subtract.38 dependency: $1110binary_true_divide.41 = $1106binary_multiply.39 / dy dependency: $1112binary_subtract.42 = $1088binary_subtract.30 - $1110binary_true_divide.41 dependency: $1124binary_subscr.47 = getitem(value=depn, index=$c.3.356, fn=<built-in function getitem>) dependency: $1128compare_op.49 = $1124binary_subscr.47 < hmin dependency: $1130pred = call $push_global_to_block.1024($1128compare_op.49, func=$push_global_to_block.1024, args=(Var($1128compare_op.49, <ipython-input-5-cf997de63147>:77),), kws=(), vararg=None) dependency: $n.4.663 = inplace_binop(fn=<built-in function iadd>, immutable_fn=<built-in function add>, lhs=n.2, rhs=$const1136.3, static_lhs=Undefined, static_rhs=Undefined) dependency: $1146binary_subscr.7 = getitem(value=fluxx, index=$xp.1.357, fn=<built-in function getitem>) dependency: $1150compare_op.9 = $1146binary_subscr.7 > $const1148.8 dependency: $1152pred = call $push_global_to_block.1025($1150compare_op.9, func=$push_global_to_block.1025, args=(Var($1150compare_op.9, <ipython-input-5-cf997de63147>:81),), kws=(), vararg=None) dependency: n_3 = n.2 dependency: $1172binary_subscr.5 = getitem(value=fluxx, index=$c.3.356, fn=<built-in function getitem>) dependency: $1176compare_op.7 = $1172binary_subscr.5 < $const1174.6 dependency: $1178pred = call $push_global_to_block.1026($1176compare_op.7, func=$push_global_to_block.1026, args=(Var($1176compare_op.7, <ipython-input-5-cf997de63147>:82),), kws=(), vararg=None) loop #1: Failed to hoist the following: dependency: $value_var.277 = getitem(value=_1356binary__subscr_7, index=$parfor__index_276.1071, fn=<built-in function getitem>) loop #2: Failed to hoist the following: dependency: $value_var.280 = getitem(value=_1384binary__subscr_7, index=$parfor__index_279.1098, fn=<built-in function getitem>) --------------------------------------------------------------------------------
かなり長い出力ですが、 After Optimisationをみるとわかるとおり、自動で最適された場所は、#7と#6のループがfusedされたのみです。
私は並列化を意識してソースを書いているので、あまり最適化する場所が無かったみたいです。
手動並列化について
次に手動並列化ですが、先程と同様の
@numba.jit(nopython=True, parallel=True) def conEq(dep, qx, qy, dzb, dt, dx, dy, ibx, hmin, hbuf, hdown, periodic=True): ......................
に加えて、並列化の対象とするループの回数をrangeからnumba.prangeに変更します。
for i in numba.prange( imax ): ......................
これによって、指定したループが並列化されます。 通常の並列化のため、共有変数等の扱いには十分に注意する必要があります。
まとめ
以上でnumba並列化の基礎をまとめました。気軽に使えますが、並列化効率を上げるためにはコーディングの段階で十分に工夫する必要があるのは、他の並列化手法と同様な感じです。
pythonの高速化はもうちょっと勉強したいですね。 以下の書籍が詳しそうですがまだ読めてないです。