基于GPU的多片元效果的高效绘制
文献类型:学位论文
作者 | 刘芳 |
学位类别 | 博士 |
答辩日期 | 2010-05-28 |
授予单位 | 中国科学院研究生院 |
授予地点 | 北京 |
导师 | 吴恩华 |
关键词 | 图形处理器 多片元效果 顺序独立的半透明现象 深度剥离 最大/最小融合 桶排序 多绘制目标缓存 计算统一设备架构 原子操作 |
中文摘要 | 现代GPU的图形绘制管线一般基于扫描线转换算法。模型中的三角面片经过光栅化后投影到屏幕上,在投影区域的各像素位置分别生成一个对应的基本处理单位,称之为片元。在光栅化过程中,当场景中的物体互相遮挡时,相应像素位置会产生多个片元与之对应。一般图形应用只需处理视点所能直接看到的表面,即每个像素只需要保留离视点最近或最远的片元信息,但是有一些绘制效果需要同时处理同一像素位置对应的多个片元,这些特殊的效果通常称为多片元效果,包括顺序独立的透明现象、半透明现象、体绘制以及折射等。然而,现有GPU只针对不透明物体的绘制进行了硬件层面的优化,使得光栅化后每个象素位置只保留最近或最远的片元,其余都在生成最终图像时被抛弃,所以目前多片元效果的绘制需要重复光栅化整个场景多遍,才可以收集到每个像素对应的所有片元。当场景规模较小时算法性能较高,但对于大型复杂场景来说,模型的多次顶点变换将成为绘制瓶颈,导致算法效率下降,因而难以在交互式应用中广泛使用。 针对这个问题,本文提出一种基于桶排序的高效深度剥离算法,将GPU上的多绘制目标缓存作为桶数组,采用桶排序原理以及最大/最小融合模式来收集投影到同一个像素上的多个片元并按深度排序,最后在后处理中再对场景进行延迟着色以绘制多片元效果。当发生桶内片元冲突时可以采用多遍绘制或者自适应划分的方式来降低片元冲突概率,以进一步提高绘制的准确性。针对透明现象的绘制,提出一种基于桶内动态融合的改进算法,采用并发读写的方法逐一融合落入同一个桶内的所有片元,并在后处理中按从前向后的顺序融合各个桶内的颜色值。由于同时发生桶内片元冲突和读写冲突的概率非常小,因而可以进一步提高绘制结果的准确性。实验结果表明,基于桶排序的深度剥离算法可以高效地处理大型场景多片元效果的绘制,同时生成与真实结果非常相近的绘制效果。 针对传统图形管线的不足之处,本文进一步设计并实现了CUDA渲染器:第一个可以在当前图形硬件上运行的全线可编程的图形管线系统,并基于该框架设计了两种新的透明现象的高效单遍绘制策略:第一种策略称之为多级深度测试策略,该策略利用了CUDA 的原子操作符atomicMin,可以在单遍绘制中动态收集所有片元并排序;第二种策略称之为固定数组缓存策略,该策略利用CUDA 的原子操作符atomicInc,可以在单遍绘制中按光栅化顺序收集所有的片元并在后处理中排序。实验结果表明,基于CUDA渲染器的这两种片元收集策略可以在单遍场景遍历中高效地绘制多片元效果,同时生成与真实结果非常相近的绘制效果。 未来的工作方向在于进一步完善基于桶排序的深度剥离算法,设计更加完善的深度区间划分方式,使得桶数组可以与片元一一对应,以完全消除桶内片元冲突。此外,可以进一步完善CUDA渲染器,使其可以更高效地处理遮挡剔除以及反走样等其它经典图形问题。 |
英文摘要 | In the modern rasterization-based graphics pipeline, the scenes are projected to the screen and rasterized by scan-line algorithm, generating multiple fragments for a single pixel when the geometry overlaps. Most applications only need to render the nearest surface to the viewer. In another word, only the nearest or furthest fragment is needed per pixel while the rest is discarded before the final result is generated. In contrast, multi-fragment effect requires operations on more than one fragment per pixel location. It plays important roles in many graphics applications, such as order-independent transparency, translucency, volume rendering, and refraction. However, modern GPUs are only optimized to capture the nearest or furthest layer each pass. Therefore, multiple passes are required for multi-fragment effects. In such case, the perforamnce is acceptable for small scenes, but for large complex scenes, the vertex transformation will become the performance bottleneck, making it unsuitable for real-time applications. This paper presents a new algorithm for efficient multi-fragment effects via bucket sort on the GPU. The algorithm exploits multiple render targets as bucket array per pixel. On the fragment shader, multiple fragments of each pixel will be captured and sorted via bucket sort by max/min blending for deferred shaing in post-processing. When a bucket collision happens, i.e., more than one fragment are routed to the same bucket, we can resort to multi-pass approach or an adaptive scheme for better results. Also we can use dynamic blending within each bucket for better results while rendering order-independent transparency. Experimental results show that our algorithm gains great speedup to the classical depth peeling especially for large-scale scenes. Due to the limitation of traditional graphics pipeline, we propose a CUDA Renderer system for full programmability. Within this framework, we present two highly efficient schemes for efficient multi-fragment effects via the atomic operations in CUDA in a single geometry pass. The first scheme is called multi-depth test scheme, which stores the depth values of the fragments into the array of the corresponding pixel and sorts them on the fly using the 32-bit atomicMin operation in CUDA. A following CUDA kernel will blend the fragments per pixel in depth order. The second scheme is called fixed-size A-buffer scheme, which captures the fragments in rasterization order using the atomicInc operation in CUDA. In post-processing, the fragments per pixel array will be sorted in depth order before blending. Experimental result shows that both schemes have significant speedup to classical depth peeling, as well as faithful results. In the future, we are interested in finding better depth division schemes to assure one-to-one correspondance between fragments and buckets to futher reduce collisions. In addition, the CUDA Renderer system can be improved to better solve other classical problems on computer graphics, such as occlusion culling and anti-aliasing. |
学科主题 | 计算机图形学 |
语种 | 中文 |
公开日期 | 2010-06-04 |
源URL | [http://124.16.136.157/handle/311060/2301] ![]() |
专题 | 软件研究所_计算机科学国家重点实验室 _学位论文 |
推荐引用方式 GB/T 7714 | 刘芳. 基于GPU的多片元效果的高效绘制[D]. 北京. 中国科学院研究生院. 2010. |
入库方式: OAI收割
来源:软件研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。