ARM Linux 進程調度（2.4.x）

作者：時間：2011-02-25 來源：網(wǎng)絡

加入技術交流群
- 掃碼加入
  和技術大咖面對面交流
  海量資料庫查詢

ARM Linux 進程調度（2.4.x） FireAngel（原作）

小弟最近研究了一段時間的ARM Linux,想把進程管理方面的感受跟大家交流下，不對的地方多多指點

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

Process Creation and Termination
Process Scheduling and Dispatching
Process Switching
Porcess Synchronization and support for interprocess communication
Management of process control block

-------from Operating system:internals and design principles>

進程調度

Linux2.4.x是一個基于非搶占式的多任務的分時操作系統(tǒng)，雖然在用戶進程的調度上采用搶占式策略，但是而在內核還是采用了輪轉的方法，如果有個內核態(tài)的線程惡性占有CPU不釋放，那系統(tǒng)無法從中解脫出來，所以實時性并不是很強。這種情況有望在Linux 2.6版本中得到改善，在2.6版本中采用了搶占式的調度策略。

內核中根據(jù)任務的實時程度提供了三種調度策略：

1． SCHED_OTHER為非實時任務，采用常規(guī)的分時調度策略；

2． SCHED_FIFO為短小的實時任務，采用先進先出式調度，除非有更高優(yōu)先級進程申請運行，否則該進程將保持運行至退出才讓出CPU；

3． SCHED_RR任務較長的實時任務，由于任務較長，不能采用FIFO的策略，而是采用輪轉式調度，該進程被調度下來后將被置于運行隊列的末尾，以保證其他實時進程有機會運行。

需要說明的是，SCHED_FIFO和SCHED_RR兩種調度策略之間沒有優(yōu)先級上的區(qū)別，主要的區(qū)別是任務的大小上。另外，task_struct結構中的policy中還包含了一個SCHED_YIELD位，置位時表示該進程主動放棄CPU。

在上述三種調度策略的基礎上，進程依照優(yōu)先級的高低被分別調系統(tǒng)。優(yōu)先級是一些簡單的整數(shù)，它代表了為決定應該允許哪一個進程使用CPU的資源時判斷方便而賦予進程的權值——優(yōu)先級越高，它得到CPU時間的機會也就越大。

在Linux中，非實時進程有兩種優(yōu)先級，一種是靜態(tài)優(yōu)先級，另一種是動態(tài)優(yōu)先級。實時進程又增加了第三種優(yōu)先級，實時優(yōu)先級。

1．靜態(tài)優(yōu)先級（priority）——被稱為“靜態(tài)”是因為它不隨時間而改變，只能由用戶進行修改。它指明了在被迫和其它進程競爭CPU之前該進程所應該被允許的時間片的最大值（20）。

2．動態(tài)優(yōu)先級（counter）——counter 即系統(tǒng)為每個進程運行而分配的時間片，Linux兼用它來表示進程的動態(tài)優(yōu)先級。只要進程擁有CPU，它就隨著時間不斷減??；當它為0時，標記進程重新調度。它指明了在當前時間片中所剩余的時間量（最初為20）。

3．實時優(yōu)先級(rt_priority)——值為1000。Linux把實時優(yōu)先級與counter值相加作為實時進程的優(yōu)先權值。較高權值的進程總是優(yōu)先于較低權值的進程，如果一個進程不是實時進程，其優(yōu)先權就遠小于1000，所以實時進程總是優(yōu)先。

在每個tick到來的時候（也就是時鐘中斷發(fā)生），系統(tǒng)減小當前占有CPU的進程的counter，如果counter減小到0，則將need_resched置1，中斷返回過程中進行調度。update_process_times()為時鐘中斷處理程序調用的一個子函數(shù)：

void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id(), system = user_tick ^ 1;

update_one_process(p, user_tick, system, cpu);
if (p->pid) {
if (--p->counter = 0) {
p->counter = 0;
p->need_resched = 1;
}
if (p->nice > 0)
kstat.per_cpu_nice[cpu] = user_tick;
else
kstat.per_cpu_user[cpu] = user_tick;
kstat.per_cpu_system[cpu] = system;
} else if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
kstat.per_cpu_system[cpu] = system;
}

Linux中進程的調度使在schedule（）函數(shù)中實現(xiàn)的，該函數(shù)在下面的ARM匯編片斷中被調用到：

/*
* This is the fast syscall return path. We do as little as
* possible here, and this includes saving r0 back into the SVC
* stack.
*/
ret_fast_syscall:
ldr r1, [tsk, #TSK_NEED_RESCHED]
ldr r2, [tsk, #TSK_SIGPENDING]
teq r1, #0 need_resched || sigpending
teqeq r2, #0
bne slow
fast_restore_user_regs

/*
* Ok, we need to do extra processing, enter the slow path.
*/
slow: str r0, [sp, #S_R0 S_OFF]! returned r0
b 1f

/*
* "slow" syscall return path. "why" tells us if this was a real syscall.
*/
reschedule:
bl SYMBOL_NAME(schedule)
ENTRY(ret_to_user)
ret_slow_syscall:
ldr r1, [tsk, #TSK_NEED_RESCHED]
ldr r2, [tsk, #TSK_SIGPENDING]
1: teq r1, #0 need_resched => schedule()
bne reschedule 如果需要重新調度則調用schedule
teq r2, #0 sigpending => do_signal()
blne __do_signal
restore_user_regs

而這段代碼在中斷返回或者系統(tǒng)調用返回中反復被調用到。

1．進程狀態(tài)轉換時：如進程終止，睡眠等,當進程要調用sleep（）或exit（）等函數(shù)使進程狀態(tài)發(fā)生改變時，這些函數(shù)會主動調用schedule（）轉入進程調度。

2．可運行隊列中增加新的進程時；

ENTRY(ret_from_fork)
bl SYMBOL_NAME(schedule_tail)
get_current_task tsk
ldr ip, [tsk, #TSK_PTRACE] check for syscall tracing
mov why, #1
tst ip, #PT_TRACESYS are we tracing syscalls?
beq ret_slow_syscall
mov r1, sp
mov r0, #1 trace exit [IP = 1]
bl SYMBOL_NAME(syscall_trace)
b ret_slow_syscall 跳轉到上面的代碼片斷

3．在時鐘中斷到來后：Linux初始化時，設定系統(tǒng)定時器的周期為10毫秒。當時鐘中斷發(fā)生時，時鐘中斷服務程序timer_interrupt立即調用時鐘處理函數(shù)do_timer( )，在do_timer()會將當前進程的counter減1，如果counter為0則置need_resched標志，在從時鐘中斷返回的過程中會調用schedule.

4．進程從系統(tǒng)調用返回到用戶態(tài)時；判斷need_resched標志是否置位，若是則轉入執(zhí)行schedule()。系統(tǒng)調用實際上就是通過軟中斷實現(xiàn)的，下面是ARM平臺下軟中斷處理代碼。

.align 5
ENTRY(vector_swi)
save_user_regs
zero_fp
get_scno

enable_irqs ip

str r4, [sp, #-S_OFF]! push fifth arg

get_current_task tsk
ldr ip, [tsk, #TSK_PTRACE] check for syscall tracing
bic scno, scno, #0xff000000 mask off SWI op-code
eor scno, scno, #OS_NUMBER 20 check OS number
adr tbl, sys_call_table load syscall table pointer
tst ip, #PT_TRACESYS are we tracing syscalls?
bne __sys_trace

adrsvc al, lr, ret_fast_syscall 裝載返回地址，用于在跳轉調用后返回到
上面的代碼片斷中的ret_fast_syscall
cmp scno, #NR_syscalls check upper syscall limit
ldrcc pc, [tbl, scno, lsl #2] call sys_* routine

add r1, sp, #S_OFF
2: mov why, #0 no longer a real syscall
cmp scno, #ARMSWI_OFFSET
eor r0, scno, #OS_NUMBER 20 put OS number back
bcs SYMBOL_NAME(arm_syscall)
b SYMBOL_NAME(sys_ni_syscall) not private func

5．內核處理完中斷后，進程返回到用戶態(tài)。

6．進程主動調用schedule()請求進行進程調度。

----------------------------------------------

schedule()函數(shù)分析：

/*
* 'schedule()' is the scheduler function. It's a very simple and nice
* scheduler: it's not perfect, but certainly works for most things.
*
* The goto is "interesting".
*
* NOTE!! Task 0 is the 'idle' task, which gets called when no other
* tasks can run. It can not be killed, and it cannot sleep. The 'state'
* information in task[0] is never used.
*/
asmlinkage void schedule(void)
{
struct schedule_data * sched_data;
struct task_struct *prev, *next, *p;
struct list_head *tmp;
int this_cpu, c;

spin_lock_prefetch(runqueue_lock);

if (!current->active_mm) BUG();
need_resched_back:
prev = current;
this_cpu = prev->processor;

if (unlikely(in_interrupt())) {
printk("Scheduling in interrupt");
BUG();
}

release_kernel_lock(prev, this_cpu);

/*
* 'sched_data' is protected by the fact that we can run
* only one process per CPU.
*/
sched_data = aligned_data[this_cpu].schedule_data;

spin_lock_irq(runqueue_lock);

/* move an exhausted RR process to be last.. */
if (unlikely(prev->policy == SCHED_RR))
/*
* 如果采用輪轉法調度，則重新檢查counter是否為0, 若是則將其掛到運行隊列的最后
*/
if (!prev->counter) {
prev->counter = NICE_TO_TICKS(prev->nice);
move_last_runqueue(prev);
}

switch (prev->state) {
case TASK_INTERRUPTIBLE:
/*
* 如果是TASK_INTERRUPTIBLE,并且能夠喚醒它的信號已經(jīng)來臨,
* 則將狀態(tài)置為TASK_RUNNING
*/
if (signal_pending(prev)) {
prev->state = TASK_RUNNING;
break;
}
default:
del_from_runqueue(prev);
case TASK_RUNNING:;
}
prev->need_resched = 0;

/*
* this is the scheduler proper:
*/

repeat_schedule:
/*
* Default process to select..
*/
next = idle_task(this_cpu);
c = -1000;
list_for_each(tmp, runqueue_head) {
/*
* 遍歷運行隊列,查找優(yōu)先級最高的進程, 優(yōu)先級最高的進程將獲得CPU
*/
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p, this_cpu)) {
/*
* goodness()中，如果是實時進程，則weight = 1000 p->rt_priority,
* 使實時進程的優(yōu)先級永遠比非實時進程高
*/
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c) /注意這里是”>”而不是”>=”，如果權值相同，則先來的先上
c = weight, next = p;
}
}

/* Do we need to re-calculate counters? */
if (unlikely(!c)) {
/*
* 如果當前優(yōu)先級為0,那么整個運行隊列中的進程將重新計算優(yōu)先權
*/
struct task_struct *p;

spin_unlock_irq(runqueue_lock);
read_lock(tasklist_lock);
for_each_task(p)
p->counter = (p->counter >> 1) NICE_TO_TICKS(p->nice);
read_unlock(tasklist_lock);
spin_lock_irq(runqueue_lock);
goto repeat_schedule;
}

/*
* from this point on nothing can prevent us from
* switching to the next task, save this fact in sched_data.
*/
sched_data->curr = next;
task_set_cpu(next, this_cpu);
spin_unlock_irq(runqueue_lock);

if (unlikely(prev == next)) {
/* We won't go through the normal tail, so do this by hand */
prev->policy = ~SCHED_YIELD;
goto same_process;
}

kstat.context_swtch ;
/*
* there are 3 processes which are affected by a context switch:
*
* prev == .... ==> (last => next)
*
* It's the 'much more previous' 'prev' that is on next's stack,
* but prev is set to (the just run) 'last' process by switch_to().
* This might sound slightly confusing but makes tons of sense.
*/
prepare_to_switch(); {
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;
if (!mm) { /如果是內核線程的切換，則不做頁表處理
if (next->active_mm) BUG();
next->active_mm = oldmm;
atomic_inc(oldmm->mm_count);
enter_lazy_tlb(oldmm, next, this_cpu);
} else {
if (next->active_mm != mm) BUG();
switch_mm(oldmm, mm, next, this_cpu); /如果是用戶進程，切換頁表
}

if (!prev->mm) {
prev->active_mm = NULL;
mmdrop(oldmm);
}
}

/*
* This just switches the register state and the stack.
*/
switch_to(prev, next, prev);
__schedule_tail(prev);

same_process:
reacquire_kernel_lock(current);
if (current->need_resched)
goto need_resched_back;
return;
}

----------------------------------------------
ARM Linux 進程調度（3）

switch_mm中是進行頁表的切換，即將下一個的pgd的開始物理地址放入CP15中的C2
寄存器。進程的pgd的虛擬地址存放在task_struct結構中的pgd指針中，通過
__virt_to_phys宏可以轉變成成物理地址。

static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk, unsigned int cpu)
{
if (prev != next)
cpu_switch_mm(next->pgd, tsk);
}

#define cpu_switch_mm(pgd,tsk) cpu_set_pgd(__virt_to_phys((unsigned long)(pgd)
))

#define cpu_get_pgd()
({
unsigned long pg;
__asm__("mrc p15, 0, %0, c2, c0, 0"
: "=r" (pg));
pg = ~0x3fff;
(pgd_t *)phys_to_virt(pg);
})

switch_to()完成進程上下文的切換，通過調用匯編函數(shù)__switch_to
完成，其實現(xiàn)比較簡單，也就是保存prev進程的上下文信息，該上下文信息由
context_save_struct結構描述，包括主要的寄存器，然后將next
的上下文信息讀出，信息保存在task_struct中的thread.save中TSS_SAVE標識了thread.
save在task_struct中的位置。

/*
* Register switch for ARMv3 and ARMv4 processors
* r0 = previous, r1 = next, return previous.
* previous and next are guaranteed not to be the same.
*/
ENTRY(__switch_to)
stmfd sp!, {r4 - sl, fp, lr} Store most regs on
stack
mrs ip, cpsr
str ip, [sp, #-4]! Save cpsr_SVC
str sp, [r0, #TSS_SAVE] Save sp_SVC
ldr sp, [r1, #TSS_SAVE] Get saved sp_SVC
ldr r2, [r1, #TSS_DOMAIN]
*
* Returns amount of memory which needs to be reserved.
*/

long ed_init(long mem_start, int mem_end)
{
int i,
ep;

short tshort,
version,
length,
s_ofs;

if (register_blkdev(EPROM_MAJOR,"ed",ed_fops)) {
printk("EPROMDISK: Unable to get major %d.n", EPROM_MAJOR);
return 0;
}
blk_dev[EPROM_MAJOR].request_fn = DEVICE_REQUEST;

for(i=0;i 4) {
printk("EPROMDISK: Length (%d) Too short.n", length);
return 0;
}

ed_length = length * 512;
sector_map = ep 6;
sector_offset = ep s_ofs;

printk("EPROMDISK: Version %d installed, %d bytesn", (int)version, ed_length);
return 0;
}

int get_edisk(unsigned char *buf, int sect, int num_sect)
{
short ss, /* Sector start */
tshort;
int s; /* Sector offset */

for(s=0;s0;) {
sock = bp / EPROM_SIZE;
page = (bp % EPROM_SIZE) / EPAGE_SIZE;
offset = bp % EPAGE_SIZE;

nb = (len offset)>EPAGE_SIZE?EPAGE_SIZE-(offset%EPAGE_SIZE):len;

cr1 = socket[sock] | ((page 4) 0x30) | 0x40; /* no board select for now */
cr2 = (page >> 2) 0x03;
outb((char)cr1,CONTROL_REG1);
outb((char)cr2,CONTROL_REG2);

memcpy(buf bofs,(char *)(EPROM_WINDOW offset),nb);

len -= nb;
bp = nb;
bofs = nb;
}
return 0;
}

med.c代碼如下：
/* med.c - make eprom disk image from ramdisk image */

#include
#include
#include
#define DISK_SIZE (6291456)
#define NUM_SECT (DISK_SIZE/512)

void write_eprom_image(FILE *fi, FILE *fo);

int main(int ac, char **av)
{
FILE *fi,
*fo;

char fin[44],
fon[44];

if (ac > 1) {
strcpy(fin,av[1]);
} else {
strcpy(fin,"hda3.ram");
}

if (ac > 2) {
strcpy(fon,av[2]);
} else {
strcpy(fon,"hda3.eprom");
}

fi = fopen(fin,"r");
fo = fopen(fon,"w");

if (fi == 0 || fo == 0) {
printf("Can't open filesn");
exit(0);
}

write_eprom_image(fi,fo);

fclose(fi);
fclose(fo);
}

void write_eprom_image(FILE *fi, FILE *fo)
{
char *ini;
char *outi; /* In and out images */
short *smap; /* Sector map */
char *sp;
char c = 0;

struct {
unsigned short version;
unsigned short blocks;
unsigned short sect_ofs;
} hdr;

int ns,
s,
i,
fs;

ini = (char *)malloc(DISK_SIZE); /* Max disk size is currently 6M */
outi = (char *)malloc(DISK_SIZE); /* Max disk size is currently 6M */
smap = (short *)malloc(NUM_SECT*sizeof(short));

if (ini == NULL || outi == NULL || smap == NULL) {
printf("Can't allocate memory :(n");
exit(0);
}

if (DISK_SIZE != fread(ini,1,DISK_SIZE,fi)) {
printf("Can't read input file :(n");
exit(0);
}

memcpy(outi,ini,512); /* Copy in first sector */
smap[0] = 0;
ns = 1; /* Number of sectors in outi */

參考書目:
[1]《GNU/Linux編程指南》 (美)K.Wall,M.Watson 清華大學出版社 1999
[2] 《Linux實用指南》（美）諾頓、格蕾菲斯著翟大昆等譯機械工業(yè)出版社 1999
[3]《嵌入式系統(tǒng) -- 使用 C 與 C 》 Michael Barr 美商歐萊禮 1999
[4] 《LINUX操作指南》本社人民郵電出版社 1999
[5] 《Linux 實用大全》楊文志編著北京清華大學出版社 1999
[6] 《單片機與嵌入式系統(tǒng)應用》何立民北京航空航天大學出版社 1999
[7] 《Linux內核源代碼分析》（美）馬克斯韋爾機械工業(yè)出版社 2000
[8] 《UNIX操作系統(tǒng)設計與實現(xiàn)》陳華瑛、李建國電子工業(yè)出版社出版 1999

參考文獻:
[1]《frambuffer howto》 Geert Uytterhoeven www.linuxdoc.org 1998
[2] 《RTAI Beginner Guide》Emanuele Bianchi www.rtai.org
[3] 《Booting Linux from EPROM》 Dave Bennett www.linuxjournal.com
[4]《ramdiskhowto》 Paul Gortmaker www.linuxdoc.org 1995
[5]《kernelhowto》 Juan-Mariano de Goyeneche www.linuxdoc.org 2000
[6] 《Embedded Linux Howto》 Sebastien Huet www.linux-embedded.com 2000
[7]《lilohowto》 m.skoric www.linuxdoc.org 2001
[8]《linux from scratch howto》 Gerard Beekmans www.linux-embedded.com 2000
[9]《glibc2howto》 Eric Green www.linux.com 1998
[10] 《Kernel Jorn》Alessandro Rubini Georg Zezchwitz 《Linux Journal》1996
[11]《rtlinux doc》 Michael Barabanov www.rtlinux.com 2001
[12]《the linux boot disk howto》 Tom Fawcett www.linux-embedded.com 2000