hpv和tct有什么区别| 总是嗜睡是什么原因| 修缮是什么意思| 泰州有什么好玩的地方| 孕囊是什么东西| 炖羊肉放什么| 尾款是什么意思| 反复高烧是什么原因| 餐饮行业五行属什么| 梦见烧纸钱是什么意思| 衣字旁有什么字| 脾虚湿重吃什么中成药| 什么是韧带| 罗衣是什么意思| 维生素d3吃多了有什么副作用| 乳腺增生结节吃什么药| thenorthface是什么牌子| 刘五行属什么| 3月3号是什么星座| 晚来天欲雪能饮一杯无什么意思| 为什么肾阳虚很难恢复| 脾大是什么原因造成的| 门槛什么意思| 转述句是什么意思| 看灰指甲去医院挂什么科| 5月31号什么星座| 这是什么皮肤病| 什么狗最聪明| 什么属于| 走资派是什么意思| 咳嗽打什么点滴效果好| 苦海翻起爱恨是什么歌| 吃什么补叶酸最快| 鳄鱼的天敌是什么动物| 喝什么茶去湿气| 61岁属什么| 女人手脚发热吃什么药| 风疹吃什么药| 88年属什么的| 什么是生化妊娠| 垂的第三笔是什么| 膀胱炎挂什么科| 全国政协常委什么级别| 胃息肉有什么症状| 女人右下巴有痣代表什么| 95棉5氨纶是什么面料| 父亲节送什么| 汤姆福特属于什么档次| 容易感冒的人缺什么| 2021属什么| 高铁座位为什么没有e座| 1020是什么星座| 半夜吃什么不会胖| 闭口是什么| 精力旺盛是什么意思| 鹅喜欢吃什么食物| 11月16是什么星座| 司局级是什么级别| 低血压高什么原因| 后背长痘痘用什么药膏| 梦见别人杀人是什么预兆| 贴水是什么意思| 神经痛吃什么药效果好| 为什么拉的屎是墨绿色| 验尿细菌高是什么原因| 晚上十点多是什么时辰| 胃酸是什么颜色的| kenzo属于什么档次| 人格什么意思| 心脑供血不足吃什么药效果最好| 肩膀上的肌肉叫什么| 嗓子痒干咳吃什么药| 搭桥香是什么意思| 有什么好的赚钱方法| 红烧肉放什么调料| 养牛仔裤是什么意思| 两个土念什么| 日本为什么要偷袭珍珠港| 忽然心口疼是什么原因| 谦虚什么意思| 思前想后是什么意思| 吃什么水果能降血压| 进产房吃什么补充体力| 什么开什么笑| 三晋是什么意思| 什么桥下没有水脑筋急转弯| 身体发抖是什么病| 156是什么意思| 慢性盆腔炎吃什么药效果好| lh是什么| 68年属什么生肖多少岁| 小暑吃什么水果| 承字属于五行属什么| 加尿素起什么作用| 得意忘形是什么意思| 过会是什么意思| 月经前腰疼的厉害是什么原因| 什么是潮热症状| 敖包是什么意思| dmdm乙内酰脲是什么| 郑州机场叫什么名字| 荸荠的读音是什么| 宝宝消化不良吃什么药| darker是什么意思| 八九不离十是什么意思| 龟头敏感早泄吃什么药| 静息心率是什么意思| 肛瘘是什么病| 无花果是什么季节的水果| 男性射精是什么感觉| 看破红尘什么意思| 血压偏低有什么症状| 心是什么意思| 大天真香是什么意思| 月经前乳房胀痛是什么原因| 龟头炎用什么软膏| 送男孩子什么礼物比较好| 肾阴虚的症状吃什么药| 背疼是什么原因引起的女人| 卵巢多囊样改变是什么意思| 什么东西好消化| 女人山根低代表什么| 八月一日是什么节日| 男性生殖系统感染吃什么药| 孕中期同房要注意什么| 五一年属什么生肖| 蒲公英什么时候采最好| 什么止疼药见效最快| 生长发育科检查些什么| 什么草药可以止痒| 被和谐了是什么意思| 膀胱炎吃什么药最见效| 面瘫挂什么科室| 沙拉是什么| 御守是什么| 疤痕修复用什么药膏好| 小便短赤是什么意思| 什么食物含铅| 吃饭时头晕是什么原因| 什么是网恋| 肝内强回声是什么意思| 节令是什么意思| 北京为什么叫北平| 中二病是什么意思| 艾灸后皮肤痒出红疙瘩是什么原因| 家有一老如有一宝是什么意思| 仓鼠爱吃什么东西| 1922年属什么生肖| 属猴女和什么属相最配| 葫芦代表什么寓意| 农历9月28日是什么星座| 家族是什么意思| 为什么睡觉会流口水| 窥见是什么意思| 音什么笑什么成语| 天气热适合吃什么| 诸葛亮字什么| 痛风是什么感觉| 贲临是什么意思| 恐惧症吃什么药最好| 甲状腺结节是什么病| 龘读什么| 脑震荡挂什么科| 无利起早是什么生肖| 载体是什么| 肺结节是什么症状| 脚磨破了涂什么药| 小孩睡不着觉是什么原因| 杨梅有什么功效与作用| 一望无际是什么意思| 眼晴干涩模糊用什么药| 查血压高挂什么科室| 闪卡是什么意思| 应无所住而生其心什么意思| 6月是什么月| 2岁属什么生肖| 儿童坐动车需要带什么证件| 采什么| 什么不可什么四字词语| 嗓子干痒吃什么药| 5月20号是什么星座| 维生素c对身体有什么好处| 缺钙吃什么| 5月12日是什么星座| 乳腺术后吃什么最好| 产成品是什么意思| 胸口容易出汗是什么原因| 铁公鸡是什么意思| 烂好人什么意思| 三月六日是什么星座| 长方形脸适合什么发型| xxoo什么意思| 生理期腰疼是什么原因| 窍门是什么意思| 什么叫变应性鼻炎| 25分贝相当于什么声音| 眉骨疼是什么原因| medicine什么意思| 蹲久了站起来头晕是什么原因| 死库水是什么| 紫癜是什么病| l代表什么单位| 做梦结婚是什么征兆| 注会什么时候考试| 身份证数字分别代表什么| 69式是什么意思| 抗着丝点抗体阳性是什么| 低血压低是什么原因| 小饭桌是什么意思| 什么桌椅| 黑枸杞对男性性功能有什么帮助| 猫的胡须是干什么用的| 飞的最高的鸟是什么鸟| 怀孕为什么要吃叶酸| 子午相冲是什么生肖| 爱出汗什么原因| 叫人挪车打什么电话| 什么是叶酸| 邹字五行属什么| 45岁属什么的生肖| 狸猫换太子什么意思| 怀孕一个月出血是什么情况| 心脏不好的人吃什么好| 东南大学什么专业最牛| 白细胞偏低是什么原因| 肛门坠胀吃什么药最好| 眼皮跳是什么原因引起的| 记录是什么意思| 什么样的礼物| 产妇吃什么下奶快又多又营养| 读军校需要什么条件| 血糖吃什么食物好| 什么是调和油| 检查骨头做什么检查| 属鸡的守护神是什么菩萨| 377是什么意思| 小柴胡颗粒主要治什么| 什么叫散瞳| 盆腔积液是什么原因造成的| 安赛蜜是什么| 表面积是什么| 五劳七伤什么生肖| 中医四诊指的是什么| 修女是什么意思| 笑靥是什么意思| 胸腰椎退行性变是什么意思| 欣赏是什么意思| 女生胸部长什么样| 什么叫有氧运动| 黑眼圈是什么原因引起的| 5月2日是什么星座| 手串18颗代表什么意思| 栗子不能和什么一起吃| 宇宙外面是什么| 难以启齿是什么意思| 儿童肠胃感冒吃什么药效果好| 强磁对人体有什么危害| 心经讲的是什么| 种草是什么意思| 木姜子是什么| 包干费用是什么意思| 修复胃粘膜吃什么药| 拔完智齿能吃什么| 冰箱为什么结冰| 百度Jump to content

国办发文城市轨道交通:未通过安全评估不得投入运营

From Wikipedia, the free encyclopedia
百度 截至目前,全市有机、绿色食品生产企业和单位发挥了示范引领作用,2016年绥滨县、萝北县被评为黑龙江省农产品质量安全示范县,2017年萝北县入围国家级农产品质量安全示范县。

Explicit data graph execution, or EDGE, is a type of instruction set architecture (ISA) which intends to improve computing performance compared to common processors like the Intel x86 line. EDGE combines many individual instructions into a larger group known as a "hyperblock". Hyperblocks are designed to be able to easily run in parallel.

Parallelism of modern CPU designs generally starts to plateau at about eight internal units and from one to four "cores", EDGE designs intend to support hundreds of internal units and offer processing speeds hundreds of times greater than existing designs. Major development of the EDGE concept had been led by the University of Texas at Austin under DARPA's Polymorphous Computing Architectures program, with the stated goal of producing a single-chip CPU design with 1 TFLOPS performance by 2012, which has yet to be realized as of 2018.[1]

Traditional designs

[edit]

Almost all computer programs consist of a series of instructions that convert data from one form to another. Most instructions require several internal steps to complete an operation. Over time, the relative performance and cost of the different steps have changed dramatically, resulting in several major shifts in ISA design.

CISC to RISC

[edit]

In the 1960s memory was relatively expensive, and CPU designers produced instruction sets that densely encoded instructions and data in order to better utilize this resource. For instance, the add A to B to produce C instruction would be provided in many different forms that would gather A and B from different places; main memory, indexes, or registers. Providing these different instructions allowed the programmer to select the instruction that took up the least possible room in memory, reducing the program's needs and leaving more room for data. For instance, the MOS 6502 has eight instructions (opcodes) for performing addition, differing only in where they collect their operands.[2]

Actually making these instructions work required circuitry in the CPU, which was a significant limitation in early designs and required designers to select just those instructions that were really needed. In 1964, IBM introduced its System/360 series which used microcode to allow a single expansive instruction set architecture (ISA) to run across a wide variety of machines by implementing more or less instructions in hardware depending on the need.[3] This allowed the 360's ISA to be expansive, and this became the paragon of computer design in the 1960s and 70s, the so-called orthogonal design. This style of memory access with wide variety of modes led to instruction sets with hundreds of different instructions, a style known today as CISC (Complex Instruction Set Computing).

In 1975 IBM started a project to develop a telephone switch that required performance about three times that of their fastest contemporary computers. To reach this goal, the development team began to study the massive amount of performance data IBM had collected over the last decade. This study demonstrated that the complex ISA was in fact a significant problem; because only the most basic instructions were guaranteed to be implemented in hardware, compilers ignored the more complex ones that only ran in hardware on certain machines. As a result, the vast majority of a program's time was being spent in only five instructions. Further, even when the program called one of those five instructions, the microcode required a finite time to decode it, even if it was just to call the internal hardware. On faster machines, this overhead was considerable.[4]

Their work, known at the time as the IBM 801, eventually led to the RISC (Reduced Instruction Set Computing) concept. Microcode was removed, and only the most basic versions of any given instruction were put into the CPU. Any more complex code was left to the compiler. The removal of so much circuitry, about 1?3 of the transistors in the Motorola 68000 for instance, allowed the CPU to include more registers, which had a direct impact on performance. By the mid-1980s, further developed versions of these basic concepts were delivering performance as much as 10 times that of the fastest CISC designs, in spite of using less-developed fabrication.[4]

Internal parallelism

[edit]

In the 1990s the chip design and fabrication process grew to the point where it was possible to build a commodity processor with every potential feature built into it. Units that were previously on separate chips, like floating point units and memory management units, were now able to be combined onto the same die, producing all-in one designs. This allows different types of instructions to be executed at the same time, improving overall system performed. In the later 1990s, single instruction, multiple data (SIMD) units were also added, and more recently, AI accelerators.

While these additions improve overall system performance, they do not improve the performance of programs which are primarily operating on basic logic and integer math, which is the majority of programs (one of the outcomes of Amdahl's law). To improve performance on these tasks, CPU designs started adding internal parallelism, becoming "superscalar". In any program there are instructions that work on unrelated data, so by adding more functional units these instructions can be run at the same time. A new portion of the CPU, the scheduler, looks for these independent instructions and feeds them into the units, taking their outputs and re-ordering them so externally it appears they ran in succession.

The amount of parallelism that can be extracted in superscalar designs is limited by the number of instructions that the scheduler can examine for interdependencies. Examining a greater number of instructions can improve the chance of finding an instruction that can be run in parallel, but only at the cost of increasing the complexity of the scheduler itself. Despite massive efforts, CPU designs using classic RISC or CISC ISA's plateaued by the late 2000s. Intel's Haswell designs of 2013 have a total of eight dispatch units,[5] and adding more results in significantly complicating design and increasing power demands.[6]

Additional performance can be wrung from systems by examining the instructions to find ones that operate on different types of data and adding units dedicated to that sort of data; this led to the introduction of on-board floating point units in the 1980s and 90s and, more recently, single instruction, multiple data (SIMD) units. The drawback to this approach is that it makes the CPU less generic; feeding the CPU with a program that uses almost all floating point instructions, for instance, will bog the FPUs while the other units sit idle.

A more recent problem in modern CPU designs is the delay talking to the registers. In general terms the size of the CPU die has remained largely the same over time, while the size of the units within the CPU has grown much smaller as more and more units were added. That means that the relative distance between any one function unit and the global register file has grown over time. Once introduced in order to avoid delays in talking to main memory, the global register file has itself become a delay that is worth avoiding.

A new ISA?

[edit]

Just as the delays talking to memory while its price fell suggested a radical change in ISA (Instruction Set Architecture) from CISC to RISC, designers are considering whether the problems scaling in parallelism and the increasing delays talking to registers demands another switch in basic ISA.

Among the ways to introduce a new ISA are the very long instruction word (VLIW) architectures, typified by the Itanium. VLIW moves the scheduler logic out of the CPU and into the compiler, where it has much more memory and longer timelines to examine the instruction stream. This static placement, static issue execution model works well when all delays are known, but in the presence of cache latencies, filling instruction words has proven to be a difficult challenge for the compiler.[7] An instruction that might take five cycles if the data is in the cache could take hundreds if it is not, but the compiler has no way to know whether that data will be in the cache at runtime – that's determined by overall system load and other factors that have nothing to do with the program being compiled.

The key performance bottleneck in traditional designs is that the data and the instructions that operate on them are theoretically scattered about memory. Memory performance dominates overall performance, and classic dynamic placement, dynamic issue designs seem to have reached the limit of their performance capabilities. VLIW uses a static placement, static issue model, but has proven difficult to master because the runtime behavior of programs is difficult to predict and properly schedule in advance.

EDGE

[edit]

Theory

[edit]

EDGE architectures are a new class of ISA's based on a static placement, dynamic issue design. EDGE systems compile source code into a form consisting of statically allocated hyperblocks containing many individual instructions, hundreds or thousands. These hyperblocks are then scheduled dynamically by the CPU. EDGE thus combines the advantages of the VLIW concept of looking for independent data at compile time, with the superscalar RISC concept of executing the instructions when the data for them becomes available.

In the vast majority of real-world programs, the linkage of data and instructions is both obvious and explicit. Programs are divided into small blocks referred to as subroutines, procedures or methods (depending on the era and the programming language being used) which generally have well-defined entrance and exit points where data is passed in or out. This information is lost as the high level language is converted into the processor's much simpler ISA. But this information is so useful that modern compilers have generalized the concept as the "basic block", attempting to identify them within programs while they optimize memory access through the registers. A block of instructions does not have control statements but can have predicated instructions. The dataflow graph is encoded using these blocks, by specifying the flow of data from one block of instructions to another, or to some storage area.

The basic idea of EDGE is to directly support and operate on these blocks at the ISA level. Since basic blocks access memory in well-defined ways, the processor can load up related blocks and schedule them so that the output of one block feeds directly into the one that will consume its data. This eliminates the need for a global register file, and simplifies the compiler's task in scheduling access to the registers by the program as a whole – instead, each basic block is given its own local registers and the compiler optimizes access within the block, a much simpler task.

EDGE systems bear a strong resemblance to dataflow languages from the 1960s–1970s, and again in the 1990s. Dataflow computers execute programs according to the "dataflow firing rule", which stipulates that an instruction may execute at any time after its operands are available. Due to the isolation of data, similar to EDGE, dataflow languages are inherently parallel, and interest in them followed the more general interest in massive parallelism as a solution to general computing problems. Studies based on existing CPU technology at the time demonstrated that it would be difficult for a dataflow machine to keep enough data near the CPU to be widely parallel, and it is precisely this bottleneck that modern fabrication techniques can solve by placing hundreds of CPU's and their memory on a single die.

Another reason that dataflow systems never became popular is that compilers of the era found it difficult to work with common imperative languages like C++. Instead, most dataflow systems used dedicated languages like Prograph, which limited their commercial interest. A decade of compiler research has eliminated many of these problems, and a key difference between dataflow and EDGE approaches is that EDGE designs intend to work with commonly used languages.

CPUs

[edit]

An EDGE-based CPU would consist of one or more small block engines with their own local registers; realistic designs might have hundreds of these units. The units are interconnected to each other using dedicated inter-block communication links. Due to the information encoded into the block by the compiler, the scheduler can examine an entire block to see if its inputs are available and send it into an engine for execution – there is no need to examine the individual instructions within.

With a small increase in complexity, the scheduler can examine multiple blocks to see if the outputs of one are fed in as the inputs of another, and place these blocks on units that reduce their inter-unit communications delays. If a modern CPU examines a thousand instructions for potential parallelism, the same complexity in EDGE allows it to examine a thousand hyperblocks, each one consisting of hundreds of instructions. This gives the scheduler considerably better scope for no additional cost. It is this pattern of operation that gives the concept its name; the "graph" is the string of blocks connected by the data flowing between them.

Another advantage of the EDGE concept is that it is massively scalable. A low-end design could consist of a single block engine with a stub scheduler that simply sends in blocks as they are called by the program. An EDGE processor intended for desktop use would instead include hundreds of block engines. Critically, all that changes between these designs is the physical layout of the chip and private information that is known only by the scheduler; a program written for the single-unit machine would run without any changes on the desktop version, albeit thousands of times faster. Power scaling is likewise dramatically improved and simplified; block engines can be turned on or off as required with a linear effect on power consumption.

Perhaps the greatest advantage to the EDGE concept is that it is suitable for running any sort of data load. Unlike modern CPU designs where different portions of the CPU are dedicated to different sorts of data, an EDGE CPU would normally consist of a single type of ALU-like unit. A desktop user running several different programs at the same time would get just as much parallelism as a scientific user feeding in a single program using floating point only; in both cases the scheduler would simply load every block it could into the units. At a low level the performance of the individual block engines would not match that of a dedicated FPU, for instance, but it would attempt to overwhelm any such advantage through massive parallelism.

Implementations

[edit]

TRIPS

[edit]

The University of Texas at Austin was developing an EDGE ISA known as TRIPS. In order to simplify the microarchitecture of a CPU designed to run it, the TRIPS ISA imposes several well-defined constraints on each TRIPS hyperblock, they:

  • have at most 128 instructions,
  • issue at most 32 loads and/or stores,
  • issue at most 32 register bank reads and/or writes,
  • have one branch decision, used to indicate the end of a block.

The TRIPS compiler statically bundles instructions into hyperblocks, but also statically compiles these blocks to run on particular ALUs. This means that TRIPS programs have some dependency on the precise implementation they are compiled for.

In 2003 they produced a sample TRIPS prototype with sixteen block engines in a 4 by 4 grid, along with a megabyte of local cache and transfer memory. A single chip version of TRIPS, fabbed by IBM in Canada using a 130 nm process, contains two such "grid engines" along with shared level-2 cache and various support systems. Four such chips and a gigabyte of RAM are placed together on a daughter-card for experimentation.

The TRIPS team had set an ultimate goal of producing a single-chip implementation capable of running at a sustained performance of 1 TFLOPS, about 50 times the performance of high-end commodity CPUs available in 2008 (the dual-core Xeon 5160 provides about 17 GFLOPS).

CASH

[edit]

CMU's CASH is a compiler that produces an intermediate code called "Pegasus".[8] CASH and TRIPS are very similar in concept, but CASH is not targeted to produce output for a specific architecture, and therefore has no hard limits on the block layout.

WaveScalar

[edit]

The University of Washington's WaveScalar architecture is substantially similar to EDGE, but does not statically place instructions within its "waves". Instead, special instructions (phi, and rho) mark the boundaries of the waves and allow scheduling.[9]

References

[edit]

Citations

[edit]
  1. ^ University of Texas at Austin, "TRIPS : One Trillion Calculations per Second by 2012"
  2. ^ Pickens, John (17 October 2020). "NMOS 6502 Opcodes".
  3. ^ Shirriff, Ken. "Simulating the IBM 360/50 mainframe from its microcode".
  4. ^ a b Cocke, John; Markstein, Victoria (January 1990). "The evolution of RISC technology at IBM" (PDF). IBM Journal of Research and Development. 34 (1): 4–11. doi:10.1147/rd.341.0004.
  5. ^ Shimpi, Anand Lal (5 October 2012). "Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel". AnandTech. Archived from the original on April 24, 2013.
  6. ^ Tseng, Francis; Patt, Yale (June 2008). "Achieving Out-of-Order Performance with Almost In-Order Complexity". ACM SIGARCH Computer Architecture News. 36 (3): 3–12. doi:10.1145/1394608.1382169.
  7. ^ W. Havanki, S. Banerjia, and T. Conte. "Treegion scheduling for wide-issue processors", in Proceedings of the Fourth International Symposium on High-Performance Computer Architectures, January 1998, pg. 266–276
  8. ^ "Phoenix Project"
  9. ^ "The WaveScalar ISA"

Bibliography

[edit]
冰岛茶属于什么茶 啦啦是什么意思 吃葱有什么好处和坏处 医学ca是什么意思 家里消毒杀菌用什么好
浑浊是什么意思 海参是什么动物 类风湿不能吃什么 肝火旺吃什么 只羡鸳鸯不羡仙是什么意思
胃反酸吃什么药最好 炸鸡翅裹什么粉 小孩咳嗽吃什么药 cyl是什么意思 黑瞎子是什么动物
dem是什么 吃什么排湿气效果好 充电宝什么品牌最好 喝酒为什么会吐 崩大碗配什么煲汤最好
冠状沟是什么位置travellingsim.com 什么菜下饭又好吃96micro.com 雅诗兰黛是什么档次xinmaowt.com fancl是什么品牌1949doufunao.com 翡翠和和田玉有什么区别hcv8jop1ns3r.cn
手机五行属什么hcv8jop7ns6r.cn 来例假喝什么好hcv9jop6ns9r.cn 老人怕冷是什么原因hcv7jop6ns9r.cn 阿莫西林是治什么的hcv8jop8ns1r.cn vb610是什么药hcv9jop4ns5r.cn
指甲长得快说明什么hcv8jop0ns6r.cn 军训是什么时候开始的hcv7jop6ns1r.cn 大手牵小手是什么菜hcv8jop8ns4r.cn 牛肉发绿色是什么原因qingzhougame.com ddp是什么化疗药hcv9jop5ns5r.cn
骨折后吃什么好hcv8jop9ns3r.cn 部队指导员是什么级别hcv8jop6ns9r.cn 吃什么能去黑眼圈hcv9jop8ns3r.cn 农历二月是什么月hcv8jop1ns4r.cn 一喝酒就脸红是什么原因hcv7jop5ns1r.cn
百度