vhdl - 从 Xilinx 软件编写的 VHDL 代码中找出 FPGA 设计的最大延迟

我正在研究 AES 代码,我的目标是创建一个能够提供最快性能的架构。因此我需要确定从给出输入到获得最终输出的时间的延迟。该设计将在fpga上实现。我需要通过 xilinx 仿真和设计摘要找到延迟。但是我无法理解各种报告。

对于模型一,我给出了设计摘要中的 3 份报告。

  1. 综合报告
  2. 地点和路线报告
  3. 静态时序报告


Release 9.2i Trace
Copyright (c) 1995-2007 Xilinx, Inc. All rights reserved.

C:\Xilinx92i\bin\nt\trce.exe -ise C:/Xilinx92i/sbox/sbox.ise -intstyle ise -e 3
-s 5 -xml dynamic5stage dynamic5stage.ncd -o dynamic5stage.twr

Design file: dynamic5stage.ncd
Physical constraint file: dynamic5stage.pcf
Device,package,speed: xc3s200,pq208,-5 (PRODUCTION 1.39 2007-04-13)
Report level: error report

Environment Variable Effect
-------------------- ------
NONE No environment variables were set

INFO:Timing:2698 - No timing constraints found, doing default enumeration.
INFO:Timing:2752 - To get complete path coverage, use the unconstrained paths
option. All paths that are not constrained will be reported in the
unconstrained paths section(s) of the report.
INFO:Timing:3339 - The clock-to-out numbers in this timing report are based on
a 50 Ohm transmission line loading model. For the details of this model,
and for more information on accounting for different loading conditions,
please see the device datasheet.

Data Sheet report:
All values displayed in nanoseconds (ns)

Setup/Hold to clock SYS_CLK
| Setup to | Hold to | | Clock |
Source | clk (edge) | clk (edge) |Internal Clock(s) | Phase |
BYTE_IN<0> | 2.659(R)| 0.515(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<1> | 3.216(R)| 0.381(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<2> | 3.373(R)| 0.453(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<3> | 3.155(R)| 0.001(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<4> | 3.419(R)| 0.663(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<5> | 4.055(R)| 0.118(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<6> | 3.389(R)| 0.545(R)|SYS_CLK_BUFGP | 0.000|
BYTE_IN<7> | 3.151(R)| 0.389(R)|SYS_CLK_BUFGP | 0.000|
RST | 2.750(R)| 0.970(R)|SYS_CLK_BUFGP | 0.000|
s | 3.140(R)| 0.344(R)|SYS_CLK_BUFGP | 0.000|

Clock SYS_CLK to Pad
| clk (edge) | | Clock |
Destination | to PAD |Internal Clock(s) | Phase |
SUB_BYTE_OUT<0>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<1>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<2>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<3>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<4>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<5>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<6>| 6.404(R)|SYS_CLK_BUFGP | 0.000|
SUB_BYTE_OUT<7>| 6.403(R)|SYS_CLK_BUFGP | 0.000|

Clock to Setup on destination clock SYS_CLK
| Src:Rise| Src:Fall| Src:Rise| Src:Fall|
Source Clock |Dest:Rise|Dest:Rise|Dest:Fall|Dest:Fall|
SYS_CLK | 3.612| | | |

Analysis completed Sat Nov 29 11:39:23 2014

Trace Settings:
Trace Settings

Peak Memory Usage: 93 MB


Release 9.2i par J.36
Copyright (c) 1995-2007 Xilinx, Inc. All rights reserved.

ACER-PC:: Sat Nov 29 11:38:52 2014

par -w -intstyle ise -ol std -t 1 dynamic5stage_map.ncd dynamic5stage.ncd

Constraints file: dynamic5stage.pcf.
Loading device for application Rf_Device from file '3s200.nph' in environment C:\Xilinx92i.
"dynamic5stage" is an NCD, version 3.1, device xc3s200, package pq208, speed -5

Initializing temperature to 85.000 Celsius. (default - Range: 0.000 to 85.000 Celsius)
Initializing voltage to 1.140 Volts. (default - Range: 1.140 to 1.260 Volts)

INFO:Par:282 - No user timing constraints were detected or you have set the option to ignore timing constraints ("par
-x"). Place and Route will run in "Performance Evaluation Mode" to automatically improve the performance of all
internal clocks in this design. The PAR timing summary will list the performance achieved for each clock. Note: For
the fastest runtime, set the effort level to "std". For best performance, set the effort level to "high". For a
balance between the fastest runtime and best performance, set the effort level to "med".

Device speed data version: "PRODUCTION 1.39 2007-04-13".

Device Utilization Summary:

Number of BUFGMUXs 1 out of 8 12%
Number of External IOBs 19 out of 141 13%
Number of LOCed IOBs 0 out of 19 0%

Number of Slices 62 out of 1920 3%
Number of SLICEMs 0 out of 960 0%

Overall effort level (-ol): Standard
Placer effort level (-pl): High
Placer cost table entry (-t): 1
Router effort level (-rl): Standard

REAL time consumed by placer: 16 secs
CPU time consumed by placer: 10 secs
Writing design to file dynamic5stage.ncd

Total REAL time to Placer completion: 17 secs
Total CPU time to Placer completion: 11 secs

Starting Router

Phase 1: 482 unrouted; REAL time: 18 secs

Phase 2: 436 unrouted; REAL time: 18 secs

Phase 3: 178 unrouted; REAL time: 18 secs

Phase 4: 178 unrouted; (0) REAL time: 18 secs

Phase 5: 180 unrouted; (0) REAL time: 18 secs

Phase 6: 0 unrouted; (87) REAL time: 19 secs

Phase 7: 0 unrouted; (87) REAL time: 19 secs

Updating file: dynamic5stage.ncd with current fully routed design.

Phase 8: 0 unrouted; (0) REAL time: 20 secs

Phase 9: 0 unrouted; (0) REAL time: 20 secs

Total REAL time to Router completion: 20 secs
Total CPU time to Router completion: 13 secs

Partition Implementation Status

No Partitions were found in this design.


Generating "PAR" statistics.

Generating Clock Report

| Clock Net | Resource |Locked|Fanout|Net Skew(ns)|Max Delay(ns)|
| SYS_CLK_BUFGP | BUFGMUX6| No | 45 | 0.036 | 0.916 |

* Net Skew is the difference between the minimum and maximum routing
only delays for the net. Note this is different from Clock Skew which
is reported in TRCE timing report. Clock Skew is the difference between
the minimum and maximum path delays which includes logic delays.

The Delay Summary Report


The AVERAGE CONNECTION DELAY for this design is: 0.832

Listing Pin Delays by value: (nsec)

d < 1.00 < d < 2.00 < d < 3.00 < d < 4.00 < d < 5.00 d >= 5.00
--------- --------- --------- --------- --------- ---------
337 142 2 0 0 0

Timing Score: 0

Asterisk (*) preceding a constraint indicates it was not met.
This may be due to a setup or hold violation.

Constraint | Check | Worst Case | Best Case | Timing | Timing
| | Slack | Achievable | Errors | Score
Autotimespec constraint for clock net SYS | SETUP | N/A| 3.612ns| N/A| 0
_CLK_BUFGP | HOLD | 0.702ns| | 0| 0

All constraints were met.
INFO:Timing:2761 - N/A entries in the Constraints list may indicate that the
constraint does not cover any paths or that it has no requested value.

Generating Pad Report.

All signals are completely routed.

Total REAL time to PAR completion: 21 secs
Total CPU time to PAR completion: 15 secs

Peak Memory Usage: 136 MB

Placement: Completed - No errors found.
Routing: Completed - No errors found.

Number of error messages: 0
Number of warning messages: 0
Number of info messages: 1

Writing design to file dynamic5stage.ncd

PAR done!


Release 9.2i - xst J.36
Copyright (c) 1995-2007 Xilinx, Inc. All rights reserved.
--> Parameter TMPDIR set to ./xst/projnav.tmp
CPU : 0.00 / 4.04 s | Elapsed : 0.00 / 4.00 s

--> Parameter xsthdpdir set to ./xst
CPU : 0.00 / 4.04 s | Elapsed : 0.00 / 4.00 s

--> Reading design: dynamic5stage.prj

* Synthesis Options Summary *
---- Source Parameters
Input File Name : "dynamic5stage.prj"
Input Format : mixed
Ignore Synthesis Constraint File : NO

---- Target Parameters
Output File Name : "dynamic5stage"
Output Format : NGC
Target Device : xc3s200-5-pq208

---- Source Options
Top Module Name : dynamic5stage
Automatic FSM Extraction : YES
FSM Encoding Algorithm : Auto
Safe Implementation : No
FSM Style : lut
RAM Extraction : Yes
RAM Style : Auto
ROM Extraction : Yes
Mux Style : Auto
Decoder Extraction : YES
Priority Encoder Extraction : YES
Shift Register Extraction : YES
Logical Shifter Extraction : YES
XOR Collapsing : YES
ROM Style : Auto
Mux Extraction : YES
Resource Sharing : YES
Asynchronous To Synchronous : NO
Multiplier Style : auto
Automatic Register Balancing : No

---- Target Options
Add IO Buffers : YES
Global Maximum Fanout : 500
Add Generic Clock Buffer(BUFG) : 8
Register Duplication : YES
Slice Packing : YES
Optimize Instantiated Primitives : NO
Use Clock Enable : Yes
Use Synchronous Set : Yes
Use Synchronous Reset : Yes
Pack IO Registers into IOBs : auto
Equivalent register Removal : YES

---- General Options
Optimization Goal : Speed
Optimization Effort : 1
Library Search Order : dynamic5stage.lso
Keep Hierarchy : NO
RTL Output : Yes
Global Optimization : AllClockNets
Read Cores : YES
Write Timing Constraints : NO
Cross Clock Analysis : NO
Hierarchy Separator : /
Bus Delimiter : <>
Case Specifier : maintain
Slice Utilization Ratio : 100
BRAM Utilization Ratio : 100
Verilog 2001 : YES
Auto BRAM Packing : NO
Slice Utilization Ratio Delta : 5


* HDL Compilation *
Compiling vhdl file "C:/Xilinx92i/sbox/dynamic5stage.vhd" in Library work.
Entity <dynamic5stage> compiled.
Entity <dynamic5stage> (Architecture <Behavioral>) compiled.

* Design Hierarchy Analysis *
Analyzing hierarchy for entity <dynamic5stage> in library <work> (architecture <Behavioral>).

* HDL Analysis *
Analyzing Entity <dynamic5stage> in library <work> (Architecture <Behavioral>).
INFO:Xst:1561 - "C:/Xilinx92i/sbox/dynamic5stage.vhd" line 278: Mux is complete : default of case is discarded
Entity <dynamic5stage> analyzed. Unit <dynamic5stage> generated.

HDL Synthesis Report

Macro Statistics
# ROMs : 1
16x4-bit ROM : 1
# Registers : 13
4-bit register : 12
8-bit register : 1
# Xors : 89
1-bit xor2 : 56
1-bit xor3 : 24
1-bit xor4 : 1
2-bit xor2 : 6
4-bit xor2 : 2


* Advanced HDL Synthesis *

Loading device for application Rf_Device from file '3s200.nph' in environment C:\Xilinx92i.
INFO:Xst:2506 - Unit <dynamic5stage> : In order to maximize performance and save block RAM resources, the small ROM <Mrom_GALOIS_MUL_INV> will be implemented on LUT. If you want to force its implementation on block, use option/constraint rom_style.
INFO:Xst:2261 - The FF/Latch <STAGE2_1_3> in Unit <dynamic5stage> is equivalent to the following FF/Latch, which will be removed : <STAGE2_2_1>

Advanced HDL Synthesis Report

Macro Statistics
# ROMs : 1
16x4-bit ROM : 1
# Registers : 55
Flip-Flops : 55
# Xors : 89
1-bit xor2 : 56
1-bit xor3 : 24
1-bit xor4 : 1
2-bit xor2 : 6
4-bit xor2 : 2


* Low Level Synthesis *

Optimizing unit <dynamic5stage> ...

Mapping all equations...
Building and optimizing final netlist ...
Found area constraint ratio of 100 (+ 5) on block dynamic5stage, actual ratio is 3.

Final Macro Processing ...

Final Register Report

Macro Statistics
# Registers : 55
Flip-Flops : 55


* Partition Report *

Partition Implementation Status

No Partitions were found in this design.


* Final Report *
Final Results
RTL Top Level Output File Name : dynamic5stage.ngr
Top Level Output File Name : dynamic5stage
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO

Design Statistics
# IOs : 19

Cell Usage :
# BELS : 114
# LUT2 : 22
# LUT2_D : 4
# LUT2_L : 1
# LUT3 : 14
# LUT3_L : 2
# LUT4 : 49
# LUT4_D : 3
# LUT4_L : 12
# MUXF5 : 7
# FlipFlops/Latches : 55
# FDR : 54
# FDRS : 1
# Clock Buffers : 1
# BUFGP : 1
# IO Buffers : 18
# IBUF : 10
# OBUF : 8

Device utilization summary:

Selected Device : 3s200pq208-5

Number of Slices: 61 out of 1920 3%
Number of Slice Flip Flops: 55 out of 3840 1%
Number of 4 input LUTs: 107 out of 3840 2%
Number of IOs: 19
Number of bonded IOBs: 19 out of 141 13%
Number of GCLKs: 1 out of 8 12%

Partition Resource Summary:

No Partitions were found in this design.




Clock Information:
Clock Signal | Clock buffer(FF name) | Load |
SYS_CLK | BUFGP | 55 |

Asynchronous Control Signals Information:
No asynchronous control signals found in this design

Timing Summary:
Speed Grade: -5

Minimum period: 4.822ns (Maximum Frequency: 207.394MHz)
Minimum input arrival time before clock: 6.639ns
Maximum output required time after clock: 6.216ns
Maximum combinational path delay: No path found

Timing Detail:
All values displayed in nanoseconds (ns)

Timing constraint: Default period analysis for Clock 'SYS_CLK'
Clock period: 4.822ns (frequency: 207.394MHz)
Total number of paths / destination ports: 242 / 43
Delay: 4.822ns (Levels of Logic = 3)
Source: STAGE3_3_0 (FF)
Destination: STAGE4_2_3 (FF)
Source Clock: SYS_CLK rising
Destination Clock: SYS_CLK rising

Data Path: STAGE3_3_0 to STAGE4_2_3
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDR:C->Q 4 0.626 1.074 STAGE3_3_0 (STAGE3_3_0)
LUT4_D:I0->O 2 0.479 0.768 Mxor_GAL2_MUL_31_xor0000_xo<1>1 (GAL2_MUL_31_xor0000)
LUT4:I3->O 1 0.479 0.740 Mxor_OUTPUT1_xor0000_Result<1>11 (N211)
LUT4:I2->O 1 0.479 0.000 Mxor_OUTPUT1_xor0000_Result<1> (GALOIS_MUL_3<3>)
FDR:D 0.176 STAGE4_2_3
Total 4.822ns (2.239ns logic, 2.583ns route)
(46.4% logic, 53.6% route)

Timing constraint: Default OFFSET IN BEFORE for Clock 'SYS_CLK'
Total number of paths / destination ports: 168 / 76
Offset: 6.639ns (Levels of Logic = 5)
Source: BYTE_IN<4> (PAD)
Destination: STAGE1_2_1 (FF)
Destination Clock: SYS_CLK rising

Data Path: BYTE_IN<4> to STAGE1_2_1
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
IBUF:I->O 7 0.715 1.201 BYTE_IN_4_IBUF (BYTE_IN_4_IBUF)
LUT2:I0->O 2 0.479 0.804 GALOIS_ADD_1<0>31 (GALOIS_ADD_1<0>_bdd5)
LUT4:I2->O 1 0.479 0.976 GALOIS_ADD_1<0>11 (GALOIS_ADD_1<0>_bdd0)
LUT3:I0->O 1 0.479 0.851 GALOIS_ADD_1<1>_SW0 (N25)
LUT4:I1->O 1 0.479 0.000 GALOIS_ADD_1<1> (GALOIS_ADD_1<1>)
FDR:D 0.176 STAGE1_2_1
Total 6.639ns (2.807ns logic, 3.832ns route)
(42.3% logic, 57.7% route)

Timing constraint: Default OFFSET OUT AFTER for Clock 'SYS_CLK'
Total number of paths / destination ports: 8 / 8
Offset: 6.216ns (Levels of Logic = 1)
Destination: SUB_BYTE_OUT<7> (PAD)
Source Clock: SYS_CLK rising

Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDR:C->Q 1 0.626 0.681 OUTPUT_LATCH_7 (OUTPUT_LATCH_7)
Total 6.216ns (5.535ns logic, 0.681ns route)
(89.0% logic, 11.0% route)

CPU : 29.56 / 34.76 s | Elapsed : 29.00 / 34.00 s


Total memory usage is 205164 kilobytes

Number of errors : 0 ( 0 filtered)
Number of warnings : 0 ( 0 filtered)
Number of infos : 3 ( 0 filtered)


要测量 AES 模块的性能,您可以将布局布线报告底部的 autotimespec 值 3.612ns 乘以系统中的管道级数。您写道,当前有 5 个管道阶段,因此系统的总时间将为 5*3.612ns = 18.060ns。如果您添加另一个流水线阶段希望使系统更快,那么时钟必须能够以 18.060ns/6 = 3.010 ns 的速度运行,以便添加的流水线阶段能够提高性能。

该工具计算出的最小时钟周期为 3.612ns = 276 MHz,但如果您将 sys_clk 限制为比该值更快,则可能会使其更快。

