.\" @(#)schedulers	1.4	11/29/92
.H1 "Schedulers"
.pp
.Id "schedulers, CG domain"
Given a Universe of functional blocks to be scheduled and a Target 
describing the topology and characteristics of the single- or 
multiple-processor system for which code is generated, it 
is the responsibility of the Scheduler object to perform some or 
all of the following functions:
.ip
Determine which processor a given invocation of a given Block is 
executed on (for multiprocessor systems);
.ip
Determine the order in which actors are to be executed on a processor;
.ip
Arrange the execution of actors into standard control structures, 
like nested loops.
.pp
In this section, we explain different scheduling options and their
effect on the generated code.
.H2 "Single-Processor Schedulers"
.pp
For targets consisting of a single processor, we provide three different
scheduling techniques. The user can select the most appropriate scheduler
for a given application by setting the
.c loopingLevel
.Ir "loopingLevel, target parameter"
target parameter.
.pp
In the first approach (loopingLevel = 0), which is the default SDF scheduler,
we conceptually construct the acyclic precedence graph (APG) corresponding 
to the system, and generate a schedule that is consistent with that 
precedence graph. There are many possible 
schedules for all but the most trivial graphs; the schedule chosen 
takes resource costs, such as the necessity of flushing registers and 
the amount of buffering required, into account. The Target then 
generates code by executing the actors in the sequence defined by 
this schedule. This is a quick and efficient approach unless 
there are large sample rate changes, in which case it corresponds to 
completely unrolling all loops, so results in huge code size. 
.pp
The second approach we call \fIJoe's\fR scheduling. In this approach
.Ir "Joe's scheduling"
(loopingLevel = 1), actors that have the same sample rate are merged 
(wherever this will not cause deadlock) and loops are introduced to 
match the sample rates. The result is a hierarchical clustering; 
within each cluster, the techniques described above can be used to 
generate a schedule. The code then contains nested loop constructs 
together with sequences of code from the actors. 
.pp
Since the second approach uses a conservative approach, there are some
cases, if rare, in which some looping possibilities are undetected. 
By setting the loopingLevel to 2, we can choose the third approach,
called \fISJS\fR (Shuvra-Joe-Soonhoi) scheduling after the inventor's
.Ir "SJS scheduling"
first names. After performing Joe's scheduling at the front end,
it attacks the remaining graph with more complicated approach.
.pp
In the second or third approach is taken, the code size is drastically
reduced when there are large sample rate changes in the application.
On the other hand, we sacrifice some efficient buffer management schemes.
For example, suppose that star A produces 5 samples to star B which consumes
1 sample at a time. If we take the first approach, we schedule this
graph as ABBBBB and assign a buffer of size 5 between star A and B.
Since all invocations of star B knows which sample to consume from the
allocated buffer, each B invocation can read the sample directly for the
buffer. If we choose the second or third approach, the scheduling result
will be A5(B). Since the body of star B is included inside a loop of
factor 5, we have to use an indirect addressing for star B to read 
a sample from the buffer. Therefore, we need an additional buffer
pointer for star B (memory overhead), and one more level of memory
access (runtime overhead) for indirect addressing. 

.H2 "Multiple-Processor Schedulers"
.pp
.Id "multiple-processor schedulers"
.Id "parallel schedulers"
A key idea in Ptolemy is that there is no single scheduler that 
is expected to handle all situations. Users can write schedulers and 
can use them in conjunction with schedulers we have written. 
As with the rest of Ptolemy, schedulers are written following 
object-oriented design principals. Thus a user would never have to 
write a scheduler from ground up, and in fact the user 
is free to derive the new scheduler from even our most 
advanced schedulers. We have designed a suite of specialized schedulers 
that can be mixed and matched for specific applications. After the 
scheduling is performed, each processor is assigned 
a set of blocks to be executed in a scheduler-determined order. 
.pp
The first step for multiple-processor scheduling, or parallel scheduling,
is to translate a given SDF graph to a acyclic precedence expanded graph 
(shortly, APEG graph). The APEG graph describes the dependency between
invocations of blocks in the SDF graph during execution of one iteration.
Refer to the SDF domain documentation for the meaning of one iteration.
Hence, a block in an SDF graph may correspond to 
several APEG nodes in a multi-rate example. Parallel schedulers are
schedule the APEG nodes, not SDF stars, onto processors. 
.pp
We have implemented three scheduling techniques that map 
SDF graphs onto multiple-processors with various interconnection 
topologies: Hu's level-based list scheduling, Sih's dynamic level 
scheduling [1], and Sih's declustering scheduling [2]. 
The target architecture is described by its Target object, derived from 
.c CGMultiTarget . 
The Target class provides the scheduler with the necessary information 
on interprocessor communication to enable both scheduling and code 
synthesis. 
.pp
Targets also have parameters that allow the user to select the 
type of schedule, and (where appropriate) to experiment with the 
effect of altering the topology or the communication 
costs. The parameters are
.c ignoreIPC ,
.Ir "ignoreIPC, target parameter"
.c overlapComm ,
.Ir "overlapComm, target parameter"
and 
.c useCluster .
.Ir "useCluster, target parameter"
There is priority between these parameter. If the
.c ignoreIPC
is set to 
.c YES ,
we use Hu's level-based list scheduler. Otherwise, we check whether
the
.c overlapComm
parameter is set or not. This parameter is set when the target
processor has different modules for communication and computation,
so can execute both at the same time. We haven`t yet tested any target 
that has this feature. After the 
.c overlapComm
parameter is proved to be
.c NO ,
the third parameter,
.c useCluster ,
is examined. If that parameter is set, the Sih's declustering scheduling
scheme is used. Otherwise, the modified version of Sih's dynamic level
scheduling is used.
.pp 
Whichever scheduler is used, we schedule communication nodes
in the generated code. For example, we use the Hu's level-based list
scheduler, we ignore communication overhead when assigning stars
to processors. Hence, it is likely to have a code that contains
more communication stars than what the other schedulers would result in.
.pp
There are another set of target parameter to direct the scheduling
procedure. If parameter
.c manualAssignment
.Ir "manualAssignment, target parameter"
is set to 
.c YES ,
the default parallel scheduler does not perform star assignment. Instead,
it checks the processor assignment of all stars (
.c procId
.Ir "procId, CG star state"
state of CG stars). By default, the procId state is set to -1, which is
an illegal assignment since the child target is numbered from 0.
If there exists any star, except
.c Fork
star, that has an illegal
.c procId
state, an error is generated saying that manual scheduling is failed.
Otherwise, we invoke a list scheduling based on the manual assignment
to determine the order of execution of blocks on each processor.
We do not support the case in which a block requires more than
one processors to be assigned on. This option automatically sets
the
.c oneStarOneProc
.Ir "oneStarOneProc, target parameter"
state to be discussed next.
.pp
If there are sample rate changes, a star in the program graph
may be invoked multiple times for each iteration. These invocations
may be assigned to multiple processors by default. We can prevent it
by setting the
.c oneStarOneProc
star to 
.c YES .
Then, all invocations of a star are enforced to be assigned to the same
processor regardless of whether they are parallelizable or not.
The advantage of doing this is its simplicity of code generation since we
do not need 
.c Spread/Collect
stars, which will be discussed later.
Also, it provides us another possibility of scheduling option:
.c adjustSchedule .
.Ir "adjustSchedule, target parameter"
The main disadvantage is performance loss of not exploiting parallelism.
It is most severe if Sih's declustering algorithm is used. Therefore,
it is not recommended to use Sih's declustering algorithm with this option.
.pp
In this paragraph, we describe a future scheduling option which
this release does not support yet.
Once an automatic scheduling (with oneStarOneProc option set) is performed,
the processor assignment of each star is determined. After examining
the
assignment, the user may want to override the scheduling decision
manually. It can be done by setting the
.c adjustSchedule
.Ir "adjustSchedule, target parameter"
parameter. If that parameter is set, after the automatic scheduling
is performed, the 
.c procId
state of each star is automatically updated with the assigned processor. 
The programmer override the scheduling decision by setting that state.
The
.c adjustSchedule
can not be YES before any scheduling decision is made previously.
Again, this option is not supported in this release.
.pp
Different scheduling options result in different assignments of APEG
nodes. Therefore, one block in a problem SDF graph can be assigned to
more than one processors.
Regardless of which scheduling options are chosen, however,
The final stage of the scheduling is to decide the execution order
of stars including send/receive stars by a simple list scheduling
algorithm in each child target. 
The final scheduling results are displayed in Gantt chart.
The multiple-processor scheduler produces a list of single 
processor schedules, copying them to the child targets. 
The schedules include send/receive stars for interprocessor communication.
.pp
To produce codes for child targets, we create a sub-galaxy
.Id "sub-galaxy"
.Id "sub-universe"
for each child target, which consists of the stars scheduled on that
target and some extra stars to be discussed below if necessary. 
A child target follows the same step to generate code as a single
processor target except that the schedule is not computed again since
the scheduling result is inherited from the parent target.
.H3 "Send/Receive Stars"
.pp
.Id "send/receive stars"
.Id "send star"
.Id "receive star"
After the assignment of APEG nodes is finished, the interprocessor
communication requirements between blocks are determined in sub-galaxies. 
Suppose, star A is connected to star B, and  there is no sample rate change.
By assigning star A and star B to different processors (1 and 2 respectively),
the parallel scheduler introduces interprocessor communication.
Then, processor 1 should generate code for star A and a "send" star, while
processor 2 should generate code for a "receive" star and star B. Therefore,
These "send" and "receive" stars are automatically inserted by the Ptolemy
kernel when determining the execution order of blocks in each child target
and also when creating sub-galaxies. The parallel scheduler asks the
multi-processor target to create "send" and "receive" stars.
.pp
Once the generated code is loaded, processors run autonomously. 
The synchronization protocol between processors is hardwired into the 
"send" and "receive" stars. One common approach in shared-memory 
architectures is the use of semaphores. Thus a typical synchronization 
protocol used is to have the send star set a flag signaling the completion 
of the data transfer; the receive star would then wait for the 
proper semaphores to be set. When the semaphores are 
set, the receive star will read the data and clear the 
semaphores. In a message passing architecture, 
the send star may form a message header to specify the source and 
destination processors. In this case, the receive star would decode 
the message by examining the message header.
.H3 "Spread/Collect stars"
.pp
.Id "spread/collect stars"
.Id "spread star"
.Id "collect star"
Consider an multi-rate example in which star A produces two
tokens and star B consumes one token each time. And, suppose that
the first invocation of star B is assigned to the same processor
as the star A (processor 1), but the second invocation  is assigned to
processor 2. After star A fires in processor 1, the first token produced
is routed to the star B assigned to the processor while the second token
produced should be shipped to processor 2; interprocessor communication
is required! Since star A has one output port and that port
should be connected two different destination (one is to star B, the other
is to a "send" star), we insert a "spread" star after the star A.
As a result, the sub-galaxy created for processor 1 contains 4 blocks:
star A is connected to a "spread" star, which has two outputs connected
to star B and a "send" star. The role of a "spread" star is to spread
tokens from a single output porthole to multiple destinations.
.pp
On the other hand, we may need to "collect" tokens from
multiple sources to a single input porthole. Suppose we reverse the
connection in the above example: star B produces one token and
star A consumes two tokens. We have to insert a "collect" star
at the input porthole of star A to collect tokens from star B
and a "receive" star that receives a token from processor 2.
.pp
The "spread" and "collect" stars are automatically inserted by the
Ptolemy kernel, so invisible from the user. Moreover, these stars
can not be scheduled. They are added to sub-galaxies only for
the allocation of memory and other resources before generating code.
The "spread" and "collect" stars themselves do not require
extra memory in most cases by overlaying memory buffers. 
For example, in the first example, a buffer of size 2 is assigned for
the output of star A. Via the "spread" star, star B obtains the
information that it needs to fetch a token from the first location
of the buffer while the "send" star knows that it will fetch a token
from the second location. Thus, the buffers for the outputs of 
the "spread" star are overlaid with the output buffer of star A.
.pp
In cases there are delays or past tokens are used on the connection
between two blocks that should be connected through "spread" or
"collect" stars, we need to copy data explicitly, thus need extra
memory for these stars. Then, the user can see the existence of
"spread/collect" stars in the generated code.
.H3 "The Gantt Chart Display"
.pp
Demos that use targets derived from
.c CGMultiTarget
can produce an interactive Gantt chart display for viewing the parallel
schedule.
.pp
The Gantt chart display involves two windows \- one for displaying
part of the Gantt chart (the display window), and one for controlling
how much of the Gantt chart is shown in the display window (the
control window).
.pp
The control window represents the entire Gantt chart, and
it contains a shaded region which represents the part of the
Gantt chart that is currently shown in the display window.
This shaded region can be moved around inside the control window
by dragging the mouse with the middle button depressed and
its size can be changed with the left or right mouse button. Moving the
shaded region horizontally changes the time from which the
Gantt chart display begins ; moving it vertically changes the
topmost processor displayed. The fraction of the control window's
width spanned by the shaded region is approximately
the fraction of the total duration of the schedule that is
currently being displayed, and the fraction of the control window's
height spanned by the shaded region is approximately
the fraction of the total number of processors which are currently
displayed.
.pp
The display window represents each star or "nops" in the schedule
as a box drawn through the time interval it is scheduled over.
If the name of a star can fit in its box, it is printed inside.
A vertical bar inside the window is available for identifying
stars which cannot be labeled. The names of the stars which
this bar passes through are printed alongside their respective
processors. The bar can be moved horizontally by pressing
the middle mouse button in the vicinity of the bar and then dragging
the mouse with the middle button down.
The stars which the bar passes through are identified by having
their icons highlighted.
.pp
The vertical bar used for identifying stars is sometimes
hard to find \- for example, it might be coincident with a
vertical line in the Gantt chart. Pressing the right button
in the display window locates the bar by moving the mouse on top
of it. Pressing the left button in the display window moves the
bar to the mouse and identifies the stars touched by the bar
in its new position.
.pp
Here is a summary of commands that can be used while the Gantt chart
display is active:
.pp
\fBTo change the size of shaded area inside the control window:\fR
.ip
Depress the left or right mouse button inside the shaded area and
drag the mouse with the left or right button depressed.
.pp
\fBTo move the shaded area around inside the control window:\fR
.ip
Depress the middle mouse button inside the shaded area and
drag the mouse with the middle button depressed.
.pp
\fBTo move the vertical bar inside the display window:\fR
.ip
Depress the middle mouse button in the vicinity of the bar and
drag the mouse with the middle button depressed.
.pp
\fBTo locate the vertical bar used for identifying stars:\fR
.ip
Depress the right mouse button inside the display window.
.pp
\fBTo move the vertical bar to the mouse inside the display window:\fR
.ip
Depress the left mouse button inside the display window.
.pp
\fBTo redraw the Gantt chart:\fR
.ip
Type control-L inside the display window.
.pp
\fBTo exit the Gantt chart display program:\fR
.ip
Type control-D inside the display window.
.pp
There are a fixed (hard-coded) number of colors available for coloring
processors and highlighting icons. If the number of processors
exceeds the number of available colors, the Gantt chart will be
displayed in black and white.
.pp
On monochrome displays, highlighting of stars does not work.

