Chapter43Fast Clustering of Radar Reflectivity Dataon GPUsWei Zhou,Hong An,Hongping Yang,Gu Liu,Mu Xuand Xiaoqiang LiAbstract In short-term weather analysis,we use clustering algorithm as a fun-damental tool to analyze and display the radar reflectivity data.Different from ordinary parallel k-means clustering algorithms using compute unified device architecture,in our clustering of radar reflectivity data,we face the dataset of large scale and the high dimension of texture feature vector we used as clustering space. Therefore,the memory access latency becomes a new bottleneck in our application of short-term weather analysis which requests real time.We propose a novel parallel k-means method on graphics processing units which utilizes on-chip registers and shared memory to cut the dependency of off-chip global memory. The experimental results show that our optimization approach can achieve409 performance improvement compared to the serial code.It sharply reduces the algorithm’s running time and makes it satisfy the request of real time in appli-cations of short-term weather analysis.Keywords Clustering algorithmÁReal timeÁShort-term weather forecastÁGPUÁCUDAW.Zhou(&)ÁH.AnÁG.LiuÁM.XuÁX.LiDepartment of Computer Science and Technology,University of Science and Technology of China,230027Hefei,Chinae-mail:greatzv@H.YangChina Meteorological Administration,Meteorological Observation Centre,100081Beijing,ChinaW.ZhouHefei New Star Research Institute of Applied Technology,230031Hefei,China487 J.J.Park et al.(eds.),Proceedings of the International Conference on Human-centric Computing2011and Embedded and Multimedia Computing2011,Lecture Notes in Electrical Engineering102,DOI:10.1007/978-94-007-2105-0_43,ÓSpringer Science+Business Media B.V.2011488W.Zhou et al.43.1IntroductionThere are many existing research efforts conducted in the radar data analysis for numerical weather prediction[1].Some of them take the clustering of radar reflectivity data as a fundamental tool to depict the storm structure.The large volume and high updating frequency(*every5min)of the radar reflectivity data poses a strict requirement of real timing on the data processing.Lakshmanan[2]conducted a comparing of several segmentation algorithms used to analyze weather images,and also proposed a nested segmentation method using k-means clustering[3].However,the execution time required by k-means algorithm is significantly long and scales up with the size and dimensionality of the dataset.Such feature hinders its application into the short-term weather analysis,which emphasizes the real-timing a lot.On the other hand,as a specialized single-chip massively parallel architecture, graphics processing units(GPUs)are becoming increasingly attractive for general purpose parallel computation beyond their traditional uses in graphics rendering. For the parallel codes that required costly computational clusters to achieve rea-sonable speedup,they could be ported to GPUs to gain the equivalent perfor-mance.The NIVDIA compute unified device architecture(CUDA)makes the programming easier.GPU also brings convenience and lower cost.It has shown throughput and performance per dollar that is higher than traditional CPU by orders of magnitude. Consequently,parallel clustering algorithms for the traditional computational clusters could be ported to desktop machines or laptops now.Thus,GPU are providing weather researchers great convenience to conduct fast weather analysis and forecast.The intrinsic feature of clustering algorithm makes it suitable to be imple-mented as multithreaded SIMD program in CUDA.Many researches aimed at porting the clustering algorithm to CUDA-enabled GPU[4–7];however,most of them only port part of clustering works to the GPU,not the whole process of clustering algorithm including k centroids re-calculation.Therefore,the rich memory hierarchy of GPU has not been fully utilized yet.Meanwhile,the bot-tleneck in our real time system actually derives from the long memory access latency.In this paper,we propose a novel method using on-chip registers and shared memory to reduce the global memory latency,and in turn fully utilize the feature of many-core architecture and the rich memory hierarchy to optimize the whole process of the clustering of radar reflectivity data.The result shows that our approach reduces the algorithm’s execution time within4s with1million points input,and thus enable its satisfaction of the real time requirement of short-term weather analysis.Moreover,our approach could be applied in the cases of mul-timedia computing in which clustering is adopted in the process of images.43Fast Clustering of Radar Reflectivity Data on GPUs489 The paper is organized as follows.Section43.2presents the related works of clustering algorithm using CUDA.Section43.3presents the clustering of radar reflectivity data based on CPU.Section43.4gives the details of our fast clustering of radar reflectivity data.It shows the strategy of parallelizing and the complexity of our algorithm.The experiments of our approach are reported in Sect.43.5. Finally,conclusions are drawn in Sect.43.6.43.2Related WorkAs the clustering algorithm is widely used in many areas,many researches have been made to port the clustering algorithm onto GPU[4–7].Farivar et al.[4]parallelized the k-means algorithm and achieved a satisfying speed up.But the algorithm only dealt with1D dataset and only parallelized the first phase of algorithm.Chen et al.[7]propose a mean shift clustering algorithm and accelerate the k means clustering using CUDA.But after each of iteration,they use a single thread to solve the means updating and threads synchronization which reduce the performance.The GPU’s rich memory hierarchy was not well utilized in this step.Walters et al.[6]used clustering in liver segmentation.They ignored the threads synchronization between block and made the difference ratio in control. Achieve great speed up at the cost of sacrificing the precision.Most of the work only parallelized the partial steps of clustering[4,5,7]. The rich memory hierarchy on GPU was seldom fully utilized to combine with the algorithm improvement.In order to meet the request of real time,our fast clus-tering of radar reflectivity data adopts a new method and achieves higher speed up.43.3Clustering of Radar Reflectivity DataClustering algorithm is widely used in automated weather analysis area. Lakshmanan[3]proposed a method to segment radar reflectivity data using tex-ture.In the implementation of this method,k-means clustering algorithm was adopted to cluster the radar reflectivity data.The k-means algorithm is one of the most popular clustering methods.In our implementation,a set of n data points R={r1,r2,…,r n}in a d-dimensional space is given,where each data point represents1km*1km and the d-dimensional space is composed by the texture feature vectors extracted from radar composition reflectivity.The task of k-means is to partition R into k clusters S={S1,S2,…, S k}such that each cluster has maximal similarity as defined by a cost function related to weather analysis[8].In our clustering of radar data,we preprocess the input data by labeling the areas requiring computation.Then,we adopt k-means algorithm iteratively to partition the dataset into k clusters.It first divides up the measurement space into k equal intervals and each data point was initially assigned to the interval in which its reflectivity data value lies,and so the initial k centroids are calculated.Then,the algorithm iterates as follows:(1)Compute the cost function between each pair of data point and the centroid,and then assign each data point to the centroid with the minimum cost function.The cost function incorporates two measures,the euclidean distance and contiguity measure [8];(2)Recalculate new centroids by taking the means of all the data points in each cluster.The iteration terminates when the changes in the centroids are less than some threshold or the number of iterations reaches a predefined threshold.The whole process is shown in Algorithm 1.We point out that there’s an interesting distribution of radar reflectivity.The points with the same value are always in the same area.We only need to compute the area we care about.The areas where reflectivity data value below a threshold can be ignored in our meteorology application.So we preprocess the input data and label the areas needing computation before performing k-means.The areas needing computation are only the rectangle areas in Fig 43.1b.the feature of the reflectivity distribution sharply reduces the computational requirement.Our clustering of radar reflectivity data includes the following steps1.Preprocess the input data.2.Partition the input data using k-means with the costfunction.Fig.43.1a The distribution of radar reflectivity data.b The value of the area outside the rectangle is bellow 5dbz,thus,we can ignore it in our weather analysis,and sharply reduce the computation.c The result after clustering490W.Zhou et al.43Fast Clustering of Radar Reflectivity Data on GPUs491 Algorithm1Clustering of radar data based on CPU43.4Design of Fast Clustering of Radar Reflectivity Dataon GPUsThe computational complexity of algorithm1is O[n?m(nk?n?k)].Lines 1–2preprocess the data once and get complexity O(n);Lines3–28contain a series of2-phase steps.Lines3–12compute the cost function and assign each point to the cluster whose centroid has minimum cost function values with it.This phase has a complexity of O(nk);Lines13–23recalculate the centroids and has a complexity O[(n?k)d],where d is dimension of the data point.492W.Zhou et al.Many works have been done to parallelize thefirst phase:data points assign-ment.But after parallelizing thefirst phase,the problem we faced is that the second phase recalculating the centroids becomes the new bottleneck for our short-term weather forecast application which has strict real time requirements.To see this,observe that with p cores we can accelerate thefirst phase to O(nk/p),so the ratio between the two is O[nk/p(n?k)]&O(k/p)(for n)k)when d=1.That means when k is a few orders of magnitude larger than the amount of cores,the algorithm is bound by thefirst phase and not by the second.For example,in the Farivar’s[4]implementation,there were about k=4,000clusters and32 streaming multiprocessors and the performance is bound by thefirst phase only. But in our clustering of radar reflectivity data,k is3–16[1],and there are16or30 multiprocessors(p=16or30).So the second phase becomes the new bottleneck for us.We show it with the experimental results in Sect.43.5.part A.The high-dimensional of dataset causes larger memory access and makes things worse. Therefore,we adopt a new method including parallelizing the second phase on GPUs utilizing the shared memory and registers to reduce the memory access latency.The problems in thefirst phase are the large scale dataset and the high dimension of the data point which causes long memory latency.The on-chip register resource must be utilized skillfully to decrease the reading latency.The strategy is simply as follow:(1)keep the multiprocessors busy;(2)keep register usage high and reduce the usage of local memory;besides,coalesced access to global memory also decreases the reading latency.We discuss specific design decisions to accelerate k-means for the CUDA architecture in the following two subsections.Section43.4.1introduces parallel algorithm for assignment of data points.Section43.4.2illustrates our novel par-allel algorithm for k centroids recalculation.43.4.1Data Points AssignmentThe CPU-based algorithm of assignment of data points is shown in algorithm1 lines3–12.There are two strategies to parallel the algorithm.Thefirst is the centroid-oriented,in which the cost function value from each centroid to all data points are calculated and then,each point get its own centroid with minimum cost function value.Another is the data points-oriented.It dispatching one data point to one thread and then each thread calculates the cost function from one data point to all the centroids,and maintains the minimum cost function value and the corre-sponding centroid.The former strategy has disadvantage that each point which is stored in off-chip global memory has to be read several times and causes long latency.Another disadvantage in our clustering of radar data is that our k is small (k=3–16),resulting in making the number of threads too small for GPU sche-duler to hide the latency well in this strategy.Therefore,the latter strategy is adopted in our application.The parallel algorithm of data point assignment is43Fast Clustering of Radar Reflectivity Data on GPUs493 shown in Algorithm2.Lines1–2show the design of block size and grid;line3–5 calculate the position of the corresponding data points for each thread in global memory;line6loads the data point into the register;line7–13compute the cost function and maintain the minimum one.Algorithm2Data points assignment based on GPUAlgorithm2only has one loop instead of two loops in Algorithm1.The loop for n data point has been dispatched to n threads.If the number of processing elements were equal to the number of data points,this pass could befinish in one step.However,the number of processing elements is limited in our GPU and with p cores we can accelerate thefirst phase to O(nk/p).The key point of achieving high efficiency is reducing the global memory access latency utilizing the on-chip register and hiding the latency with GPU scheduler.We accomplish as follows.To fully utilize the on-chip register,firstly,we load the data points into the on-chip registers and ensure that reading data points from global memory happens only once when calculating the cost function in the thread.Reading from the register is much faster than reading from global memory which largely reduces the latency.Secondly,we adjust the block size and optimize the kernel program to fully use the register and reduce the usage of local memory which stored in global memory.Because of the total number of registers in stream multiprocessor is limited,the kernel program have to be adjusted properly and the block size have to be adjusted to utilize the SM’s limited registers resources.Our experiments in Sect.43.5show that a block size of128results better performance than block size of32,64and256.Besides,coalesced access to the global memory also decreases494W.Zhou et al. the reading latency.In our design of the thread block and grid,the access of global memory can be coalesced well.Hiding the latency is mainly done by GPU scheduler automatically.The mul-tiprocessor SIMT unit creates,manages,schedules,and executes threads in groups of32parallel threads called warps.So the block size should be a multiple of32. The number of blocks in our application is large enough to be scheduled. 43.4.2K Centroids RecalculationIn order to achieve the new centroids,the points have to be added in variable centros_temp in Algorithm1,line16.The second phase of k centroids recalcu-lation has a relatively low computational complexity of O(nd)and is difficult to be fully parallelized due to the write conflict.Though it has a low computational complexity,it is a time consuming pro-cessing in our clustering of radar reflectivity data due to the long memory access latency.In order to compute the new centroids on GPU,we have to accumulate all data points in k variables in centros_tmp array.It should be done in atomic operations.Thus,the process was turned into a serial process in fact.What makes things worse,because of the variables of centros_tmp should be shared by all threads in all the blocks,they have to be stored in global memory suffering long global memory access latency.We give a new method to solve the problem of the serial accumulation process and the long latency of global memory.Our method includes two steps as follows. Figure43.2shows the two steps.Firstly,we use‘‘divide and conquer’’strategy to turn the serial process into a parallel one.We divide the dataset and accumulate the partial sum of each sub dataset in different multiprocessor simultaneously.Each part of sum would be dispatched to one block.The algorithm is shown in algorithm3.In line1,we use shared memory instead of global memory to store the variables of centroid_temp because of the shared memory can be shared in one block.This reduces the latency caused by atomic operations on global memory.With p cores we can get the complexity of this step to O(n/p)when d=1.Algorithm3Data points assignment based on GPUSecondly,we accumulate all the partial sums using parallel reduction algorithm which has a complexity of O(log n).We make the partial sums be accessed from global memory only once and accessed coalesced(algorithm4,lines4).The496W.Zhou et al. computation process is accomplished in shared memory(algorithm4,lines6–13). The algorithm is shown in algorithm4.The variable of count can also be calcu-lated in this way.After that,we get the centroids by dividing the total sum variables by count variables.Thus,we can accelerate the whole process of the second phase to O(n/p?k log n).However the number of process elements is limited.In fact,because the shared memory is used in the two step of centroids calculation,the latency has been sharply reduced.Meanwhile,the serial process of accumulation is parallel-ized to be done in several multiprocessors and the accumulation of partial sums is calculated in parallel reduction.Therefore,the whole process of centroids was largely parallelized.The performance is shown in our experiments in Sect.43.5.Algorithm4Parallel reduction on GPU43.5ExperimentsWe have implemented our fast clustering of radar data using CUDA version2.3. Our experiments were conducted on a PC with an NVIDIA Geforce GTX275and an Intel(R)Core(TM)Q8200CPU.GTX275has30multiprocessors,and per-forms at1.40GHz.The memory of GPU is1GB with the peak bandwidth of 127GB/s.The CPU has four cores running at2.33GHz.The main memory is 4GB with the peak bandwidth of5.6GB/s.Our data set consist of a d-dimen-sional texture feature vector extract from radar reflectivity data.There are several million data points to be clustered,and the number of clusters k is smaller than32 according to our application demands.43.5.1Time Consuming AnalysisThe radar reflectivity data generated by multiple radars at different times in one day was used as our input data:CREF_20100717_010,CREF_20100717_020,and CREF_20100717_030.We show the time consuming proportion of the second phase in the algorithm based on CPU in Table 43.1.We show the proportion of the second phase after only parallelizing the first phase in Table 43.2.It shows that in the CPU based algorithm,the first phase is the bottleneck to accelerate.But after parallelizing the first phase only,the second phase of k centroids recalculation becomes the new bottleneck in our real time system which takes more than 57%of the total con-suming time.And the proportion doesn’t change with the scale of input data.43.5.2Speed up of Fast Clustering of Radar Reflectivity DataFigure 43.3present the speedups gained by using our method on GPU for the second phase.It has a 2X speed improvement over the serial one.The data sets with 1,2,3million points were created.Figure 43.4shows the speed up of our fast clustering method for the whole process.We experienced a 36*40X speed improvement over a baseline application running on host machine.The speed up almost doesn’t change with the input data scale.Table 43.1The proportion in CPU based algorithm Input radar data The first phase (s)The secondphase (s)Proportion of the second phase (%)CREF_20100717_010152.203 3.201 2.06CREF_20100717_020152.882 3.093 1.98CREF_20100717_030153.2933.1622.02Table 43.2The proportion after parallelizing the first phase only Input radar data The first phase (s)The secondphase (s)Proportion of the second phase CREF_20100717_0102.2043.20159.2%CREF_20100717_020 2.187 3.09358.5%CREF_20100717_0302.3773.16257.1%24123T i m e (s )cpubased gpubasedFig.43.3The speed up of the second phase using our method43Fast Clustering of Radar Reflectivity Data on GPUs 49743.6Conclusion and Future WorkIn this paper,we proposed the fast clustering of radar reflectivity data algorithm.It accelerates two phase of k-means clustering algorithm.The first phase mainly utilizes the on-chip register and adjusts the execution configuration to reduce the global memory latency and hide the latency with scheduler.The second phase adopts a novel algorithm.It firstly accumulates the partial sums of centroids in shared memory of different blocks in parallel.And then,it uses parallel reduction to get the total sum of partial sums and get the centroids eventually.In this way,our clustering algorithm show over a 409performance improvement.It meets the request of real time in application of short-time weather analysis and forecast.Acknowledgment This work is supported financially by the National Basic Research Program of China under contract 2011CB302501,the National Natural Science Foundation of China grants 60633040and 60970023,the National Hi-tech Research and Development Program of China under contracts 2009AA01Z106,the National Science &Technology Major Projects 2009ZX01036-001-002and 2011ZX01028-001-002-3.References1.Yang HP,Zhang J,Langston C (2009)Synchronization of radar observations with multi-scale storm tracking.J Adv atmos sci 26kshmanan V,Rabin R,DeBrunner V (2003)Multiscale storm identification and forecast.J Atmos Res 67–68:367–380kshmanan V,Rabin R,DeBrunner V (2001)Segmenting radar reflectivity data using texture.30th international conference on radar meteorology,Zurich,Switzerland4.Farivar R,Rebolledo D,Chan E,Campbell R (2008)A parallel implementation of K-means clustering on GPUs.International conference on parallel and distributed processing techniques and applications (PDPTA 2008),pp.340–3455.Miao Q,Fu ZL,Zhao XH (2009)A new approach for color character extraction based on parallel clustering.International conference on computational intelligence and software engineering,IEEE Computer Society16111621263136410501001502002503003504004505001million 2million 3million s p e e d u pT i m e (s )cpu basedour GPU method speedupFig.43.4The speed up of our parallel clustering method of whole process498W.Zhou et al.43Fast Clustering of Radar Reflectivity Data on GPUs499 6.Walters JP,Balu V,Kompalli S(2009)Evaluating the use of GPUs in liver imagesegmentation and HMMER database searches.IEEE International Parallel And Distributed Processing Symposium,IEEE Computer Society7.Chen J.,Wu XJ,Cai R(2010)Parallel processing for accelerated mean shift algorithm withGPU.J Comput-Aided Des Comput Gr38.Wang GL(2007)The Development on a multiscale identifying algorithm for heavy rainfalland methods of extracting evolvement information.Ph.D Thesis,Nanjing University of Information Science and Technology。