2021年1月

cmake && makefile 使用

作者: leenldk
时间: 2021-01-22
分类: technique

cmake

记录一些cmake中使用的命令
cmake 通常使用 out-of-source build，将 build 内容存放在 source tree 之外。
out-of-source build 时 source tree 中不能有 cmake 相关文件。如果 source tree 中有 CMakeCache.txt，cmake 会认为目录是一个 build tree。
cmake 中使用绝对路径，不能拷贝 build tree

set(CMAKE_CXX_FLAGS "-std=c++14 -O2 -g -Wall ${OpenMP_CXX_FLAGS}") # 设置C++编译选项
set(CUDA_NVCC_FLAGS "-Xcompiler -fopenmp -std=c++14 -O2 -g -arch=compute_70 -code=sm_70 --ptxas-options=-v -lineinfo -keep") # 设置cuda编译选项
option(SHOW_SCHEDULE "Print the schedule" ON) #设置一个ON/OFF的选项
add_definitions(-DBACKEND=0) #增加一个define
if (BACKEND STREQUAL "serial")
    add_definitions(-DBACKEND=0)
elseif(BACKEND STREQUAL "group")
    add_definitions(-DBACKEND=1)
else()
    MESSAGE(ERROR "invalid mode")
endif() # if使用，endif中留空即可
add_executable(${BENCHMARK} micro-benchmark/${BENCHMARK}.cpp) #

cmake -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON 输出makefile选项
set(CMAKE_CUDA_FLAGS "-Xcompiler -std=c++14 -O2 -g -arch=compute_70 -code=sm_70 -cudart=shared ") 设置cuda flag


cmake_minimum_required(VERSION 3.13) # cmake最低版本（必须指定）
project(Demo1) # 项目信息
add_executable(Demo a.cc b.cc) # 从 a.cc b.cc 编译可执行文件 Demo
aux_source_directory(. DIR_SRCS) # 查找目录 . 下所有文件，结果存到 ${DIR_SRCS}
add_subdirectory(math) # 添加子目录，处理器其中 CMakeLists.txt
option (USE_MYMATH "Use provided math implementation" ON) # 添加选项

set(CMAKE_EXPORT_COMPILE_COMMANDS ON) # 生成 compile_commands.json 包含所有编译指令 
-DCMAKE_EXPORT_COMPILE_COMMANDS=on

CMAKE_CURRENT_BINARY_DIR : 当前 subdirectory 在 build tree 中的目录
CMAKE_CURRENT_SOURCE_DIR : 当前源代码路径
CMAKE_BINARY_DIR : build tree 顶层路径
CMAKE_SOURCE_DIR : 源代码路径顶层
EXECUTABLE_OUTPUT_PATH :

add_library()
set()
get_filename_component()
set_source_files_properties(GENERATED)

makefile

@前缀：执行指令，不在屏幕显示

$@ : target being generated
$< : first prerequiste
$^ : all prerequiste

all: library.cpp main.cpp

$@ evaluates to all
$< evaluates to library.cpp
$^ evaluates to library.cpp main.cpp

example :

# Define required macros here
SHELL = /bin/bash

OBJS =  main.o factorial.o hello.o
CFLAG = -Wall -g
CC = gcc

hello:${OBJ}
   ${CC} ${CFLAGS} -o $@ ${OBJS} 

clean:
   -rm -f *.o core *.core

.cpp.o:
   ${CC} ${CFLAGS} -c $<

QCSimulator相关

作者: leenldk
时间: 2021-01-19
分类: technique

compiler

函数 getGroup：规划出一个 group，返回一个 GateGroup

schedule

GateGroup：一组门
1. relatedQubits
2. state
3. cuttPlans

compile

backend 选项：group, mix, blas

cuda libraries 使用

作者: leenldk
时间: 2021-01-16
分类: technique

cublas

首先创建 cublas handle

#include <cublas_v2.h>
#define checkCudaErrors(status) do {                                   \
    std::stringstream _error;                                          \
    if (status != 0) {                                                 \
      _error << "Cuda failure: " << status;                            \
      FatalError(_error.str());                                        \
    }                                                                  \
} while(0)


cublasHandle_t cublasH;
checkCudaErrors(cublasCreate(&cublasH));
// 之后的 library function call 显式传入 handle
cublasDestroy(cublasH);

curand

#include <curand.h>
curandGenerator_t curand;
curandCreateGenerator(&curand, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(curand, 123ULL);
curandGenerateUniform(curand, p, size);

NCCL

nvidia 集合通信库。多 GPU 多节点通信原语。
支持 all-reduce, all-gather 等

profiler工具

作者: leenldk
时间: 2021-01-15
分类: technique

vtune

intel profiler

source /home/leenldk/intel/oneapi/vtune/2021.2.0/env/vars.sh  #加载

gprof

gcc 开源 profile 工具

编译时添加 -pg 选项进行插装
运行后生成 gmon.out
通过 gprof 输出 profiling 文件

gcc example.c -o temp -g -pg
./temp
gprof temp > profiling.out

nvprof

update : nvprof 已经不再支持最新 GPU，请使用 nsys 和 ncu

cuda toolkit 中自带工具
使用：

nvprof ./gemm # 输出 prof 结果
# 在使用了 unified memory 时可能需要 添加 --unified-memory-profiling off
nvprof --unified-memory-profiling off ./gemm

-o prof.nvvp : 输出为 nvvp 文件
--metrics [all/gld_throughput] : profile 所有参数/Global Load Throughput (可能需要 sudo)

可视化：使用 x11 forwarding nvvp prof.out
cuda 11 版本可能有 java 问题，此时需要
sudo apt install openjdk-8-jdk
nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java prof.out

windows :
.\nvvp.exe -vm 'D:\Program Files\Java\jdk1.8.0_311\jre\bin\java.exe'

nsys (nsight system)

粗粒度 timeline profile

ncu (nsight compute)

细粒度单个 kernel 级别 profile
ncu --list-sets 获取支持的 metric section set

--set full
-o file

MPI profile

在单节点 profile 中，nsys 可以在 mpirun 之前：
nsys profile [nsys args] mpirun [mpirun args] ...

在多节点 profile 中，nsys 必须在 mpirun 之后：
mpirun [mpirun args] nsys profile [nsys args] ...

每周乱搞日记 (5)

作者: leenldk
时间: 2021-01-14
分类: technique

1.14

为了防止自己在康paper时睡着来记一点笔记
目前在重新（？）看TACO的文章

TACO 支持在 CPU 上生成稀疏和稠密 tensor 表达式的代码
文章做的几点事情：
1. tensor 存储方法
2. iteration graph
3. merge lattices
4. 代码生成算法

由用户指定 merge 几个简单的 kernel 或者当成一个整体计算
稀疏kernel计算复杂原因：
1. sparse 数据结构维度的不同排布方式
2. sparse 下标合并

Gustafson's law:
任务总工作量为 W ,其中 p 比例的工作量可以通过并行加速，1-p 比例无法通过并行加速
当并行加速的加速比为 s 时，理论工作量为：
$$ W(s) = (1-p)W + spW $$

upd: 这个东西似乎应该这么理解：
Amdahl's law:
设串行需要时间 t, 其中 f 比例可以并行，则 N 线程加速比为：
$$S = \frac{t}{(1-f)t + \frac{f}{N} t} = \frac{1}{(1-f) + \frac{f}{N}}$$

Gustafson's law:
设 N 线程并行需要时间 ts, 其中 f 比例以并行执行，则相比单线程加速比为：
$$S = \frac{(1-f)t_s + N f t_s}{t_s} = (1-f) + N f$$

二者区别应该在于 Gustafson 从并行角度、（强调并行可以提高可解决问题规模，对应弱扩展性），而 Amdahl 从串行角度（强调并行瓶颈在于串行部分，对应强扩展性）

1.15

GPU中所有 active thread 被分配了单独的寄存器，当切换线程时不需要交换寄存器

cuda 中所有 kernel launch 都为异步
在开始和结束 CPU timer 之前都需要调用 cudaDeviceSynchronize

1.16

GPU 理论带宽计算：
V100 使用 HBM2 (double data rate) RAM, 时钟 877MHz, 4096位内存接口
理论带宽为：$$(0.877 \times 10^9 \times (4096 / 8) \times 2)\div 10^9 = 898GB/s$$

cudaMallocManaged( void** devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal ) ：申请 unified memory，可以从 device 和 host 上访问

Requested Global Load/Store Throughput :
kernel 需要的 gobal memory throughput，对应于等效带宽
Global Load/Store Throughput: 最小内存传输块较大，实际传输量可能超过 kernel 的需求量，记为 Global Load/Store Throughput

GPU architecture

在 V100 中 device memory 到 GPU 理论峰值带宽为 898GB/s
host memory 到 device memory 峰值带宽为 16GB/s

page locked (pinned) memory 可以获得较高host to device 带宽，可达约 12GB/s
使用 cudaHostAlloc() 进行分配
使用 cudaHostRegister() pin已经分配的内存
pinned memory 不能过量使用，分配是 heavyweight 操作

cudaMemcpy() 是阻塞操作
cudaMemcpyAsync() 非阻塞，需要 pinned host memory，需要指定 stream ID，可以与 host 的 cpu function overlap，但不能与同 stream 的 kernel overlap
两个不同的非 default stream 可以 overlap

zero copy : 需要 mapped pinned (non-pageable) memory

1.18

CUDA memory space: