CmSc 250 Intro to Algorithms
External Sorting
Usually several passes are needed, creating larger sorted blocks
until the whole file is sorted
Assumptions:
four tapes:
N records on Ta1
M records can fit in the memory
Step 1: Break the file into blocks of size M, [N/M]+1 blocks
Step 2: Sorting the blocks:
Each sorted block is called a run.
Each output tape will contain half of the runs
Step 3: Merge:
Read two records in main memory, compare, store the smaller
on Ta1
Read the next record (from Tb1 or Tb2 - the tape that contained the
record
stored on Ta1) compare, store on Ta1, etc.
Now Ta1 and Ta2 will contain sorted runs twice the size of the previous runs on Tb1 and Tb2
Example of a basic external sorting
The algorithm requires [log(N/M)] passes plus the initial run-constructing pass.
Each pass merges runs of length r to obtain runs of length 2*r.
The first runs are of length M. The last run would be of length N.
Let's assume that N is a multiple of M.
Initial situation:
1st tape contains N records = M records * N/M runs
After storing the runs on two tapes, each contains half of the runs:
2 tapes * M records_per_run * (1/2)(N/M) runs = N records
After merge 1st pass - double the length of the runs, halve the number of the runs:
2 tapes * 2M records_per_run * (1/2)(1/2)(N/M) runs = N records
After merge 2nd pass :
2 tapes * 4M records_per_run * (1/4)(1/2) (N/M) runs = N records
......
After merge s-th pass:
2 tapes * 2s M records_per_run * (1/2s)(1/2)(N/M) runs = N records
......
Thus the length of the runs after the s-th merge is 2sM.
After the last merge there is only one run equal to the whole file:
2sM = N
2s = N/M
s = log(N/M)
if s is the last merge, s = log(N/M)
At each pass we process N records, so the complexity is O(Nlog(N/M))
The basic algorithm is 2-way merge - we use 2 output tapes.
Assume that we have k tapes - then the number of passes will be reduced -
logk(N/M)
At a given merge step we merge the first k runs,
then the second k runs, etc.
The task here: to find the smallest element out of k elements
Solution: priority queues
Idea: Take the smallest elements from the first k
runs,
store them into main memory
in a heap tree.
Then repeatedly output the smallest element from the heap.
The smallest element is replaced with the next element
from the run from which it came.
Learning Goals