这篇文章针对同一个任务进行了单线程,多线程和CUDA程序的比较。以显示GPU在并行计算上的时间节省能力。
105,纵向有458/7
65,所以这张图像我们能划分出105*65 = 6825个方块。 一个像素点在图像中的位置假设是
,该点的灰度值是
,x方向和y方向梯度的计算方法
我们会遍历每个像素点,计算哪个点的梯度的平方和根最大,即找到每个区域的
原图如下
单线程
单线程的计算很简单,程序如下
#include <opencv2/opencv.hpp>
#include <Eigen/Core>
int cell_ = 4;
int pattern_gradient_ = 3;//2*pattern + 1 = 7
std::vector<cv::KeyPoint> hg_keys_;
void ExtractHighGradient(const cv::Mat &im){
//cell row and cell columns
int rows = im.rows;
int cols = im.cols;
int c_row = im.rows/cell_;
int c_col = im.cols/cell_;
int k = 0, l = 0;
//divided image into cell by cell parts, in each part we'll extract high gradient points in a 7 by 7 pattern and calcuate one NID
for(int k = 0; k < cell_; k++)
for(int l = 0; l < cell_; l++)
for(int i = c_row*k; i < c_row*(k+1); i += (2*pattern_gradient_+1)){
if(i == 0 || i >= (rows - (2*pattern_gradient_ + 1)))
continue;
for(int j = c_col*l; j < c_col*(l+1); j += (2*pattern_gradient_+1)){
if(j == 0 || j>=(cols - (2*pattern_gradient_ + 1)) )
continue;
//find the pixel that has max gradient in the `2*pattern_gradient+1` by ``2*pattern_gradient+1` pattern
float max_gradient_norm = 0.0;
float gradient_norm = 0.0;
cv::KeyPoint lmgp; //local_max_gradient_point
for(int y = i; y < i+2*pattern_gradient_+1; y++){
for(int x = j; x< j+2*pattern_gradient_+1; x++){
//if(y+1<rows && x+1<cols && y-1>=0 && x-1>=0){
Eigen::Vector2d gradient(
im.ptr<uchar>(y+1)[x] - im.ptr<uchar>(y-1)[x],
im.ptr<uchar>(y)[x+1] - im.ptr<uchar>(y)[x-1]
);
gradient_norm = gradient.norm();
if(gradient_norm > max_gradient_norm){
lmgp.pt.x = x;
lmgp.pt.y = y;
max_gradient_norm = gradient_norm;
}
}
}
hg_keys_.push_back(lmgp);
}
}
}
int main(int argc, char* argv[]){
cv::Mat test;
std::string add("../demo.png");
if(argc != 2)
std::cout<<"use default address "<<add.c_str()<<std::endl;
else{
add.clear();
add.append(argv[1]);
}
test = cv::imread(add, 0);//0 for gray and 1 for rgb
cv::Mat im_with_keypoints;
double extract_time = (double)cv::getTickCount();
ExtractHighGradient(test);
extract_time = ((double)cv::getTickCount() - extract_time)/cv::getTickFrequency();
std::cout<<"extract time is "<<extract_time<<std::endl;
std::cout<<"point size "<<hg_keys_.size()<<std::endl;
cv::drawKeypoints(test, hg_keys_, im_with_keypoints);
cv::namedWindow("show key points", cv::WINDOW_AUTOSIZE);
cv::imshow("show key points", im_with_keypoints);
cv::waitKey(0);
}
cell
,这是将一整张图片再细分成4X4的16个方块区域,多线程我需要把图片分成数个区域,每个线程并行且处理一个区域。单线程不必这么做,不过为了和接下来的多线程比较罢了。我把每个7X7小方块的局部最大梯度点给画了出来。 运行程序,获得结果如下
use default address ../demo.png
extract time is 0.101059
point size 6996
多线程
多线程,正如我前面所说,我把图像分为4X4个cell。代码里我使用了8个线程,每个线程负责计算两个cell。 在使用多线程时,我们经常需要注意的问题就是各个线程之间不要冲突。比如我的代码中有一个全局变量hg_keys_
向量,每找到一个局部最大梯度点,就会push到hg_keys_
里。如果两个线程同时做这个操作,就会起冲突,因此在一个线程往hg_keys_
push新的点的时候,必须要把hg_keys_
锁住,不让其他线程在此时对他进行操作。代码中这部分如下
mtx.lock();//avoid push_back conflix problem
hg_keys_.push_back(lmgp);//lmgp is local maximum gradient point
mtx.unlock();
全部代码内容如下,注意多线程函数调用结束之后一定需要调用线程.join()
函数。
#include <opencv2/opencv.hpp>
#include <Eigen/Core>
#include <thread>
#include <mutex>
int cell_ = 4;
int pattern_gradient_ = 3;
std::vector<cv::KeyPoint> hg_keys_;
static const int num_threads = 8;
int repeat_ = cell_ * cell_ / num_threads;
std::mutex mtx;
void ExtractHighGradient(int flag, int cell_part, const cv::Mat &im){
//cell row and cell columns
int rows = im.rows;
int cols = im.cols;
int c_row = im.rows/cell_;
int c_col = im.cols/cell_;
if(flag == 0){
//no consider monocular case for now
}
else{
int k = cell_part/cell_, l = cell_part%cell_;
int count = 0;
//divided image into cell by cell parts, in each part we'll extract high gradient points in a 7 by 7 pattern and calcuate one NID
while(k<cell_){
for(int i = c_row*k; i < c_row*(k+1); i += (2*pattern_gradient_+1)){
if(i == 0 || i >= (rows - (2*pattern_gradient_ + 1)))
continue;
for(int j = c_col*l; j < c_col*(l+1); j += (2*pattern_gradient_+1)){
if(j == 0 || j>=(cols - (2*pattern_gradient_ + 1)) )
continue;
//find the pixel that has max gradient in the `2*pattern_gradient+1` by ``2*pattern_gradient+1` pattern
float max_gradient_norm = 0.0;
float gradient_norm = 0.0;
cv::KeyPoint lmgp; //local_max_gradient_point
for(int y = i; y < i+2*pattern_gradient_+1; y++){
for(int x = j; x< j+2*pattern_gradient_+1; x++){
Eigen::Vector2d gradient(
im.ptr<uchar>(y+1)[x] - im.ptr<uchar>(y-1)[x],
im.ptr<uchar>(y)[x+1] - im.ptr<uchar>(y)[x-1]
);
gradient_norm = gradient.norm();
if(gradient_norm > max_gradient_norm){
lmgp.pt.x = x;
lmgp.pt.y = y;
max_gradient_norm = gradient_norm;
}
}
}
mtx.lock();//avoid push_back conflix problem
hg_keys_.push_back(lmgp);
mtx.unlock();
}
}
k += repeat_;
}
}
}
//I use 8 threads and the speed is only 2 or 3 times faster = = .....
int main(int argc, char* argv[]){
cv::Mat test;
std::string add("../demo.png");
if(argc != 2)
std::cout<<"use default address "<<add.c_str()<<std::endl;
else{
add.clear();
add.append(argv[1]);
}
test = cv::imread(add, 0);//0 for gray and 1 for rgb
cv::Mat im_with_keypoints;
std::thread t[num_threads];
double extract_time = (double)cv::getTickCount();
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(ExtractHighGradient, 1, i, test);
}
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
//ExtractHighGradient(1, test);
extract_time = ((double)cv::getTickCount() - extract_time)/cv::getTickFrequency();
std::cout<<"extract time is "<<extract_time<<std::endl;
std::cout<<"point size "<<hg_keys_.size()<<std::endl;
cv::drawKeypoints(test, hg_keys_, im_with_keypoints);
cv::namedWindow("show key points", cv::WINDOW_AUTOSIZE);
cv::imshow("show key points", im_with_keypoints);
cv::waitKey(0);
}
程序运行结果如下
use default address ../demo.png
extract time is 0.0548764
point size 6996
hg_keys_
时,其他线程必须等待,会耗费一些时间。当然最后的图像结果也基本一样。 CUDA
CUDA的程序和前面的单线程多线程看起来就是两个世界的程序了,内容非常不一样。我们可以想象GPU有几千几万个线程可以一起运行,所以我们只需要让每一个GPU的线程单独处理一个7*7的小像素块就行了。 由于GPU和CPU是两个不同的硬件,所以在需要使用GPU时,我们都需要把数据先从CPU拷贝到GPU,在GPU中处理完后,再拷贝回来。 每次使用了GPU后,我们也需要手动释放里面的空间。还有一些小细节,大概写在了代码里面,见代码里的步骤1到8相对比较固定。
#include <opencv2/opencv.hpp>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
__global__ void ExtractHighGradient(uchar* im, cv::KeyPoint* kp,int rows, int cols, int pattern_row, int pattern_col){
//NOTE THIS.......if there are a lot of pixel that index_x is larger than threadIdx.x_max, then they won't be reached
int thread_idx = threadIdx.x + blockIdx.x * blockDim.x;
int thread_idy = threadIdx.y + blockIdx.y * blockDim.y;
//int pattern = 3;
int p = 7;//2*3+1
int index;
float gradient_x, gradient_y, norm;
float max_norm = 0.0;
if(thread_idx>pattern_col || thread_idy>pattern_row)
return;
// if(thread_idy == 0)
// printf("id x %d and pattern col is %d \n", thread_idx, pattern_col);
for(int i = 0; i < p; i++)
for(int j = 0; j< p; j++){
index = i + p * (thread_idx+1) + (p*(thread_idy+1) + j)*cols;
gradient_x = std::abs(im[index + 1] - im[index - 1]);//right and left
gradient_y = std::abs(im[index + cols] - im[index - cols]);//down and up
norm = sqrt(gradient_x*gradient_x + gradient_y*gradient_y);
//push_to_key points
if(norm>max_norm){
kp[thread_idx + thread_idy*pattern_col].pt.y = p*(thread_idy+1) + j;
kp[thread_idx + thread_idy*pattern_col].pt.x = p*(thread_idx+1) + i;
max_norm = norm;
}
}
};
int main(int argc, char* argv[]){
cv::Mat test;
std::string add("../demo.png");
if(argc != 2)
std::cout<<"use default address "<<add.c_str()<<std::endl;
else{
add.clear();
add.append(argv[1]);
}
//See https://stackoverflow.com/questions/25346395/why-is-the-first-cudamalloc-the-only-bottleneck
//First cuda_realted function is responsible for initailize the cuda. It may cost about 100ms. once the initialization is done, copy data or do some other things shoudl be fast, so we just call a cudaFree(0) to initialize the device
cudaFree(0);
test = cv::imread(add, 0);//0 for gray and 1 for rgb
cv::Mat im_with_keypoints;
int rows = test.rows;
int cols = test.cols;
int p = 7;
int row_pattern_num = rows / p + 1;
int col_pattern_num = cols / p + 1;
int keypoint_num = row_pattern_num * col_pattern_num;
//std::cout<<"keypoint num is "<<keypoint_num<<std::endl;
int image_size = sizeof(uchar) * rows * cols;
int keypoint_size = sizeof(cv::KeyPoint) * keypoint_num;
//printf("size of keypoint %ld \n", sizeof(cv::KeyPoint));
double extract_time = (double)cv::getTickCount();
//1 allocate memory in the host(CPU)
uchar* host_input = (uchar *)malloc(image_size);
cv::KeyPoint* host_output = (cv::KeyPoint*)malloc(keypoint_size);
//2 assign value to the data (In reality, you get your data from a thousand ways)
host_input = test.data;
//3 allocate memory in the device (GPU)
uchar* device_input;
cv::KeyPoint* device_output;
//double alloc_time = (double)cv::getTickCount();
cudaMalloc((void**)&device_output, keypoint_size);
cudaMalloc((void**)&device_input, image_size);
//4 copy the data from the host to the device
cudaMemcpy(device_input, host_input, image_size, cudaMemcpyHostToDevice);
cudaMemcpy(device_output, host_output, keypoint_size, cudaMemcpyHostToDevice);
//5 assign block number and thread number in the block
// On my lenovo computer
// Maximum number of threads per multiprocessor: 2048
// Maximum number of threads per block: 1024
// 6 multi processor
dim3 block_number = dim3(4, 4);
dim3 thread_number = dim3(32, 32);
//6 call function in the device. During this step, results should be ready.
ExtractHighGradient<<<block_number, thread_number>>>(device_input, device_output, rows, cols, row_pattern_num, col_pattern_num);//in cpp file '<<<' doesn't make sense and will lead to error
cudaDeviceSynchronize();
//7 copy memory from device to host
cudaMemcpy(host_output, device_output, keypoint_size, cudaMemcpyDeviceToHost);
extract_time = ((double)cv::getTickCount() - extract_time)/cv::getTickFrequency();
std::cout<<"extract time is "<<extract_time<<std::endl;
std::vector<cv::KeyPoint> hg_points(keypoint_num);
for(int i = 0; i <keypoint_num; i++){
hg_points[i] = host_output[i];
}
std::cout<<"point size "<<hg_points.size()<<std::endl;
cv::drawKeypoints(test, hg_points, im_with_keypoints);
cv::namedWindow("show key points", cv::WINDOW_AUTOSIZE);
cv::imshow("show key points", im_with_keypoints);
cv::waitKey(0);
//8 free the momory in device and the host
//free(host_input);//free host_input will leads to double free or corruption, maybe its because free host_input will free image.data too, which will make the image means nothing. Just a guess. No need to free here in fact, when the program is done, everthing will be freed
free(host_output);
cudaFree(device_input);
cudaFree(device_output);
return 1;
}
运行程序,得到结果
use default address ../demo.png
extract time is 0.000418617
point size 6996
最后出来的图像结果大底相同
CUDA处理图像的一些注意点
1: 前面我们讲到CUDA编程很麻烦的一点是你需要分配好资源。可以看到我的CUDA代码中大概有下面的内容
...
cv::KeyPoint* host_output = (cv::KeyPoint*)malloc(keypoint_size);
CUDAFree(0);
的像素点的灰度值,有opencv,我们只需要调用
im.ptr<uchar>(y)[x]
就可以得到像素值。你如果写python文件,用PIL之类的获取像素值也很类似,输入坐标就可以了。而在GPU的函数中,我们只能使用比较底层的储存方式。 比如opencv的矩阵,它底层其实还是一个一维的数组。opencv的Mat里有一个成员是uchar* data
,这个成员是一个数组指针,数据的最底层是储存在这个数组里的。如果是一个灰度图,那么data[0]
会返回像素点(0,0)的灰度值,如果要获取像素点(1,0)的灰度值,那么对应data
的索引就应该是data[0 + img_width * 0]
。即像素点(x,y)的灰度值
intensity = im.ptr<uchar>(y)[x]
等价于
uchar* pointer;
pointer = im.data;
intensity = pointer[x + y * img_width]
其中img_width
为图像宽度。如果图像不是灰度图(比如16位深度图),Mat每一个元素类型为ushort
,我们不能用im.data
来获取某个像素的值,因为data
返回的是uchar
类型的指针,而需要类似下面的东西
ushort* pointer;
pointer = im.ptr<ushort>(0);
depth = pointer[x + y * img_width]
depth为返回的深度值。 GPU没法处理Mat类型的变量,因此使用GPU处理图像时,都是把Mat里的一维数组的指针提取出来,传入到GPU里的。对应我上面CUDA的代码里的
//分配CPU上需要的一张图片的数据的空间。image_size为图像长*宽,每一个元素储存了一个uchar类型的变量,所以整体分配的空间是uchar的长度,8个字节,乘上image_size。
uchar* host_input = (uchar *)malloc(image_size);
...
//test是图像,test.data返回第一个元素的指针
host_input = test.data;
...
//给GPU分配同样大小的空间
cudaMalloc((void**)&device_input, image_size);
...
//把数据从CPU拷贝到GPU
cudaMemcpy(device_input, host_input, image_size, cudaMemcpyHostToDevice);
最后在global
函数中我们获取像素点的方法就是
__global__ void ExtractHighGradient(uchar* im, cv::KeyPoint* kp,int rows, int cols, int pattern_row, int pattern_col){
...
//这里im不是cv::Mat,只是个uchar指针
// index = x + y*cols;
//用im[index]来获取位于(x, y)处像素的灰度值
...
}
总结
可能玩儿深度学习之类同学表示,pytorch里一行代码
img = img.to(device='gpu', dtype=torch.float32)