TODO: 1. Evaluate avx512 in multi thread setup. 2. Use `__mm_prefetch` -- only if bounded by memory latency.