Mfsort quick how-to
Summary: If you use Net Express and/or Server Express you may come across moments when you need to use mfsort. I do, so this is my quick how-to run stand-alone mfsort. It's written primarily from a unix perspective but should be applicable to the windows platform as well.
Introduction
Mfsort is part of the Micro Focus IDE's Net Express (windows) and Server Express (unix). It allows you to sort and merge data files and almost completely emulates IBM's Dfsort product.
The problem with the Micro Focus documentation on mfsort is that it is too brief. The examples are few and the syntax rules are sketchy at best. They even point to IBM's dfsort documentation at times. The problem with that is that it is too mainframe oriented and deals a lot with stuff that is of absolutely no help to you.
The documentation problems mentioned is the reason for this quick how-to. It makes the following assumptions:
- You're running mfsort stand-alone from the unix prompt or from a shell script.
- The mfsort instructions are always contained in a file.
- You never use mfsort to create reports.
Running mfsort
It is possible to run mfsort and specify instructions directly on the command line. This can cause problems with the shell's quoting rules. In order to avoid this we always put our sort instructions in a file.
$ mfsort take filename
C:\> mfsort take filename
Mfsort logging
If you do nothing at all mfsort creates a file called SYSOUT in your working-directory which collects the sort log output, this is undesirable in most cases. To get around this do the following:
$ export dd_SYSOUT=/var/tmp/mysort.log
$ mfsort take mysort
C:\> set dd_SYSOUT=%TEMP%\mysort.log
C:\> mfsort take mysort
This places the mfsort log output into the named file instead of into the generic SYSOUT file.
Overview of a mfsort program
This is our preferred way of writing mfsort instructions. It is almost the same output that you get if you run only mfsort from the command prompt. A few changes has been made to fix errors, clarify usage and enhance readability.
- Text in italics is user input, i.e.
start
- Text in <angle brackets|with pipes> shows possible
choices, a field can be sorted
<a|d>
i.e. ascending or descending ...
means that the declaration can be repeated- Text in [square brackets] means that the declaration is
optional or only used in certain cases, i.e. the
key
definition is only used for indexed files
mfsort take pathname
sort|merge fields (start,length,type,<a|d>, ...)| option copy
[record <f|v>,min,max]
use pathname org <ls|sq|rl|ix>
record <f|v>,min,max
[key (start,length,<p|a|ad|c>,...)]
give (as use)
include|omit cond=(condition) [format=type]
inrec fields=(field-spec...)
outrec fields=(field-spec...)
outfil
give
startrec start-record
endrec end-record
<save|<include|omit> (condition) [format=type]>
split
outrec
Writing comments
An asterisk *
somewhere on the line
means that the rest of that line is a comment.
Line continuation
As long as the last character of the operand field on a line is a comma or semicolon followed by a blank, the program assumes that the next line is a continuation line.
Specifying sort fields
Syntax of the sort field instruction:
fields(start,length,
type,<a|d>,...)
Where, start is the starting position of the field in the record, counting bytes from 1, length is the length of the field in bytes, type is the type of data in the field, and finally you specify ordering of output, either ascending or descending.
Field types (short list) | |
Type | Cobol declaration |
bi | pic 9 comp |
ch | pic X display |
ls | pic S9 leading separate |
nu | pic 9 display |
pd | pic S9 comp-3 |
sb | pic S9 comp |
ts | pic S9 trailing separate |
You can specify up to 16 fields by repeating the parameter sets, delimited by comma.
Sort deleting duplicates
How to sort and delete duplicates isn't mentioned anywhere in the documentation for mfsort. The solution was found in the IBM manual Getting started with DFSORT R14. As can be seen, it isn't exactly obvious how to do it.
Apart from summing values, you can also use sum to delete records with duplicate control fields.
By specifying fields=none on the sum statement, as shown below, mfsort writes only one record per key/sort field:
sort fields=(106,4,ch,a)
sum fields=none
The code above will sort the file ascending using a four character alphanumeric field. By using sum fields=none any key value will only occur once in the output file. All duplicated will be deleted.
Processing order of instructions
See below for how mfsort processes instructions during a sort/merge operation.
Start | ––> | include omit |
––> | inrec | ––> | sort merge option copy sum |
––> | outrec | ––> | outfil |
Notice that
- include/omit is processed before inrec
- sum and sort instructions are processed after
This means that when specifying fields for these instructions, include/omit must refer to the original records and sort, sum and outrec must refer to the reformatted records.
Defining input and output files
input
use pathname org
organization record definition
[key
structure]
output
give pathname [org
organization] [record definition]
[key
structure]
If you omit any of the values, the last specified values are used. Therefore you only need to specify values for the first file if the input and output files are all of the same type and format.
Defining file organization and record format
File organization determines the manner in which the file(s) will be read or written to. The most common are probably sequential which is the default, and line sequential also known as data sensitive.
org <ls|sq|rl|ix>
Type | Organization |
ls | line sequential |
sq | sequential (default) |
rl | relative |
ix | indexed |
It may be worth noticing that a line sequential file transferred between unix and windows hosts may require additional conversion since the two systems use different line ending schemes, i.e. unix uses line feeds and windows uses carriage return, line feed pairs. If this is not done, strange things may happen when you sort.
NB! A line sequential file has an undocumented max record length of 256 characters. If you have lines longer than that you have to tell mfsort to use a bigger record buffer by specifying a max length for your lines.
record <f|v>,
rec-len[,max-len]
Format | Definition |
f | fixed length of rec-len |
v | variable with min length of rec-len and max of max-len |
NB! To use line sequential files with records longer than 256 characters you must increase the record buffer using the syntax described below.
org ls record
f,rec-len
This may look a little self contradictory, i.e. a line sequential file with fixed length records, but what you specify is the buffer size you read records into. The problem is, it has to be big enough to fit your longest record, but using a big number is a waste of resources. If pressed to give a recommendation I would say 2048, an acceptable trade-off between size and ease-of-use.
The problem with the line sequential buffer size is the fact that mfsort will read a line up to the buffer size, a very long line will silently be cut into two or more records.
The only time that forgetting the hidden max length of a line sequential file will give you a syntax error is if you have inrec or outrec statements in your sort that specify record positions beyond the 256 character limit.
If an output file is indexed and its key structure is not the same as that of the indexed input file, you need to specify a key instruction in addition to organization and record format.
key(start,length,
<p|a|ad|c>,...)
Where, start is the starting position of the key field in the record, counting bytes from 1, length is the length of the field in bytes, and finally you specify the type of key you are defining:
- p = primary key (must always be defined first)
- a = alternate key
- ad = alternate key with duplicates
- c = component of the last specified primary or alternate key
If you do not specify a record format for the sort workfile, the format defaults to fixed record format, with the record size equal to the largest record specified with a give or use instruction.
Reformatting records
You can reformat records in your data sets by using the outrec and inrec instructions, with them you can
- Delete fields
- Reorder fields
- Insert separators (blanks, zeros, or constants)
Below are the most basic inrec and outrec definitions. Separators can be inserted anywhere in the record, see next section for syntax and examples.
inrec fields=(start,
length,...)
outrec fields=(start,
length,...)
outfil
outrec (start,
length,...)
NB! Strange but true, when reformatting in an outfil instruction you get a syntax error if you use fields= after the outrec keyword. Using a stand-alone outrec it is required however!
inrec and outrec perform the same functions. When deciding which to use, remember their processing order. In general
- if you are deleting fields, try to use inrec because shorter records take less time to sort, merge, or copy (inrec reformats the records before they are processed).
- if you are going to insert separators, use outrec because outrec inserts the separators into the records after they are processed.
- if you are reordering fields, you can use either control statement because reordering fields does not affect the record length.
Numbering records
If you need to add a sequence number to your records use inrec or outrec with the specified syntax.
inrec fields=(seqnum, length,
type, start=n, incr=i)
outrec fields=(seqnum, length,
type, start=n, incr=i)
Type can be one of the following.
- bi - binary
- pd - packed decimal (comp-3)
- zd - zoned decimal
Here are two examples, the first uses inrec to add a number six digits long, starting with one incremented by one. The second example uses outrec and add an eight digit long number starting with 1000 and incremented by 50 for each record.
inrec fields=(seqnum, 6, zd,
start=1, incr=1)
outrec fields=(seqnum, 8, zd, start=1000,
incr=50)
Defining separators and constants
When using inrec and/or outrec to reformat records it is also possible to insert separators and other constants in the file by adding these to the instruction.
Constant | Usage |
nZ | Insert n binary zeros (low-value) |
nX | Insert n blanks |
nC'x..x' | Insert the string 'x..x' n number of times |
nX'yy..yy' | Insert the hexadecimal value(s) yy n number of times |
In all cases, if n is omitted the constant is inserted once. |
Some examples
- Insert four low-value at the end of the reformatted
record.
outrec fields=(106,4,166,4,162,4,4Z)
- Insert twenty blanks as a left margin and ten more blanks
between the first and second field.
outrec fields=(20X,106,4,10X,1,75)
- Insert twelve asterisks after the second field.
inrec fields(10,5,30,8,12C'*',1,4)
- Insert a carriage return-line feed pair at the end of the
record.
outrec fields(5,36,42,15,80,5,X'0d0a')
Replacing characters and fields
Using inrec and outrec it is possible to replace characters and fields by doing a table lookup in the manner described below:
outrec(start,
length,change=(length,
find,set,...),
[nomatch=(set|start,length)]
When doing search and replace, which can be specified anywhere in a normal inrec or outrec instruction, you begin as usual by giving the starting position and the length of the field. Next you specify the length of the field to appear in the reformatted record. These don't have to have the same length! After this you specify a number of find and set pairs, i.e. if the find value matches the field in the current record it is replaced by the set value.
If none of your find strings match, the value in nomatch is used instead. This can be either a constant or a section of the input record. If the nomatch string is shorter than the length of the change string it will be padded to the right with blanks or low-value. If nomatch isn't used and a match is not found mfsort will terminate.
NB! No limitations are mentioned in the Micro Focus documentation, but according to the IBM Dfsort Application Programming Guide, the max length of a field used in a search and replace operation is 64.
Here is a short example that fixes some Y2K related problems in a small data file.
outrec fields=(1,1,change(3,
C'6',C'196',
C'7',C'197',
C'8',C'198',
C'9',C'199',
C'0',C'200'),
nomatch=(C'190'),
2,5)
We check the first byte of a date field (pic 9(6)), if it's the sixties or later we add century to it, all others are changed to the beginning of the previous century. Finally we add the rest of the date and end up with a new date field (pic 9(8)). Pretty error prone, but for this particular data file it was all that was needed :)
Include and omit instructions
One of the more common operations when sorting a file is the desire to limit the number of output records based on some condition in the input records. To achieve this you use the include/omit instructions. They can either be used stand-alone, i.e. apply to all input records, or as part of an outfil instruction, which makes it possible to create several output files during one sort operation.
include cond=(condition)
[format=type]
omit cond=(condition)
[format=type]
outfil
<include|omit>
(condition) [format=type]
NB! Strangely enough, when using include/ omit in an outfil instruction you get a syntax error if you use cond= before the actual condition. Using a stand-alone include/ omit it is required however!
If all fields in a condition are of the same type you can use the format=type instruction which enables you to leave out the type instructions from the logical expression.
(start,length,
[<type|ss>],
<eq|ne|gt|ge|lt|le
>,
<start,length,[<
type>]|constant>,[<and|or>],...)
By now start, length, and type should be obvious. The instruction ss stands for substring and it's used to match a substring somewhere in a field.
The comparison operators are straight forward (eq, ne, gt, ge, lt, le), so are the relational conditions (and, or). What needs clarification is the use of substrings. The exact syntax for substring comparison is:
(start,length,ss,<
eq|ne>,constant)
If the constant string is shorter than the field specified the match can occur anywhere in the field, i.e. The two fields '*OK***', and '****OK' would both be matched by the following instructions:
(11,6,ss,eq,C'OK')
If the characters in a constant are separated by comma the comparison will be done on all of the included substring. The following example matches three different strings, i.e. the comma acts as a kind of or instruction:
(21,3,ss,eq,C'J69,L92,J82')
Here is another example using this technique:
include
cond=(11,6,ss,eq,C'HAMMER,CHISEL,SAW ,WRENCH')
Records with HAMMER, CHISEL, SAW or WRENCH in positions 11-16 will be included in the output data set. Note that the comma is used within the constant to separate the valid 6-character values; any character that will not appear in the field value can be used as a separator in the constant. SAW must be padded with three blanks on the right to make it a 6-character value so that it will be properly compared to the 6-character field in positions 11-16.
The single condition above can replace the following four conditions:
include cond=(11,6,ch,eq,
C'HAMMER',or,
11,6,ch,eq,
C'CHISEL',or,
11,6,ch,eq,
C'SAW',or,
11,6,ch,eq,
C'WRENCH')
Writing to multiple output files
By using one or several outfil instructions it is possible to write to multiple output files in the same sort. By using include/omit and other instructions, records from the sort process can be redirected to several output files according to different criteria. Here's how to do it:
outfil
give...
startrec
start-record
endrec
end-record
<save|include|omit
>...
split
outrec
The following figure illustrates the order in which one outfil group's records and instructions are processed.
outfil input records |
––> | startrec endrec |
––> | include omit save |
––> | outrec | ––> | split | ––> | outfil give files |
Notes
- startrec and endrec affect this outfil group only.
- A record which match several include statements will end up in more than one output file. Processing does not stop on the first match (see also save below).
- save means that records not included by include or omit for any other outfil group are to be included in the output records for this outfil group.
- split splits the output records in rotation among the output files in this outfil group. The first output file gets the first record, the second record is written to the second file, and so on until all output files have one record. Then the processing starts with the
Exit codes and error codes
Exit codes (known so far) | |
0 | OK |
16 | error |
Unix syntax:$ mfsort take mysort;print $? Windows syntax: C:\> mfsort take mysort |
I/O error codes (status) | |
These have the form 9/nnn. The nine seems to indicate that it is an extended status code (from the OS?). What follows is a short list of possible run-time errors. | |
004 | Illegal file name |
007 | Disk space exhausted |
009 | No room in directory (can be full, probably doesn't exist) |
013 | File not found |
018 | Read part record error: EOF before EOR or file open in wrong mode |
021 | File is a directory |
028 | No space on device |
031 | Not owner of file |
035 | Attempt to access a file with incorrect permission |
037 | File access denied |
Mfsort workfile
During a sort or merge operation, mfsort uses a temporary workfile. This workfile is paged to disk in the current directory or, if it is set, in the directory specified by the TMP environment variable.
Mfsort copies all the records from each of the input files to the temporary workfile, truncated or padded as appropriate. The workfile is then sorted or merged according to its key description. After being sorted or merged in the workfile, the records are copied to each of the output files and truncated or padded as appropriate.
During this operation:
- If you do not need any of the records to be truncated, you must ensure that the record length of the workfile is sufficient for the longest record to be sorted
- If your input files are of variable-length and the sort workfile is not, all concept of variable-length is lost
- If your input file is fixed-length and the workfile is variable-length, the record length of the record in the workfile is either the fixed length of the input file or the maximum record length of the workfile, whichever is the smaller
Miscellaneous
It is possible to pipe data from mfsort. If there is no give instruction in the sort, output goes to standard output. This may be undesirable if the output files organization is something else than line sequential. It does make it possible to sort a file and then do a lower to upper case conversion with tr, all in one step.
A problem with piping is that you cannot test the exit code from mfsort to catch possible errors.
What remains to be tested is piping to mfsort and if it's possible to redirect some output to standard out and some to an output file.