Coder must write an application that rapidly redistributes the *content* of a set of up to 30,000 input files into a new set of about 9,000 output files. This is a simple text-sorting job that operates on a very large scale and its performance is hindered by the large number of output files required.
## Deliverables
File Reprocessing Project
1. DESCRIPTION. Coder must write an application that rapidly redistributes the *content* of a set of up to 30,000 input files into a new set of about 9,000 output files. See the attachment "Simplified Concept of Operation" to understand the basics of this job request.
2. GENERAL SCHEME. The Input Directory has numerous Input Files, all with the same structure. Each Input File has a YYYY-MM-DD HH:NN:SS block. These will be used to produce Output Files named [login to view URL] or appending to existing [login to view URL] files, placing the contents of the Input File row starting at Col. 27 into the appropriate Output File (see example below) and terminating the row with an equal sign (=). Times as given in HH:NN:SS must be rounded to the nearest hour (rounding up to the next calendar date if rounding from 23:30-23:59 to 00:00).
3. THIS PROJECT REQUIRES I/O EFFICIENCY. The Input Directory will contain as many as 30,000 files with as many as 1000 rows in each file, and the Output Directory will likely comprise about 8,760 files (one for each hour of a given year). Coder will have to use strategies that reduce the I/O bottlenecks caused by dealing with multiple files. See "Project Acceptance Requirements" below.
4. INPUT SCHEME. All Input Files will reside in a single directory. The user must be allowed to choose the input directory location. Linefeeds will be in LF Linux form.
5. OUTPUT SCHEME. Output Files may be written either to a single directory, to a directory with subfolders named by month. The user must be allowed to choose the output directory location. Each Output File must have a text header as shown in the Example Code below ("METAR ARCHIVE", etc). Linefeeds must be in Windows CR+LF form.
6. CODING. Application will be written in Delphi with source & Win XP/2000 executable delivered. A 25% bonus will be awarded if the source compiles in Delphi 5 without any third-party components.
7. EXAMPLE CODE. Example Code can be found below ("Example Code"). This was written by me in about 30 minutes and fulfills the basic requirements of this project but at an unacceptably slow speed. The Coder is welcome to improve this code or rewrite it from scratch.
8. EXAMPLE INPUT DATASET. An Example Input Dataset containing 7600 files can be found here: [login to view URL] (67 MB) and should be used by the Coder to test their code's performance.
9. EXAMPLE OF INPUT FILE. Here are a couple of lines from a random Input File.
FILE: [login to view URL]
1982-08-12 07:00:00|METAR KFMN 120700Z 35008KT 15SM -TSRA BKN060 19/08 A3019 RMK LTGCGCC ALQDS SLPNO T01890083 54039
1982-08-12 07:35:00|SPECI KFMN 120735Z 35008KT 15SM BKN060 A3016 RMK TE15 RE05 WSHFT SLPNO
1982-08-12 08:00:00|METAR KFMN 120800Z 35008KT 15SM BKN/// 18/14 A//// RMK SLPNO T01830139
1982-08-12 09:00:00|METAR KFMN 120900Z 10006KT 15SM BKN060 17/11 A3019 RMK SLPNO T01720106
10. EXAMPLE OF OUTPUT FILE. Here are a couple of lines from a random Output File.
FILE: [login to view URL]
K28T 311100Z 22008KT 10SM SCT025 28/24 A3007 RMK SLPNO T02830244=
K2C2 311100Z 00000KT 20SM -SHRA BKN060 OVC100 21/18 A3022 RMK INTMT RW- SLP167 T02110178=
K3B1 311100Z 19001KT 6SM CLR 14/13 A3000 RMK SLPNO T01440128=
11. PROJECT ACCEPTANCE REQUIREMENTS. The Example Code takes 12 HOURS to fully process the Example Input Dataset on an ordinary Windows machine (2.8 GHz Pentium). The project will be accepted only on condition that the Example Input Dataset can be fully processed in less than 1 hour (i.e. achieving a 10x increase in performance) on this machine or a similar one. We will also test the code inhouse to ensure the code scales up acceptably to our full-size sets (30,000 files per set). The project will not be accepted if errors (stack overflows, inadequate array sizes, etc) appear or performance is seriously degraded by the larger number of files. Coder will be allowed to download one of these larger Input File sets if needed (200 MB).
12. DELIVERABLES: Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
## Platform
Delphi, Win32.