genscores - Generate scoring tables from a set of plaintext files.


genscores -type value [-verbose] [-elemsize nchars] [-output outfilename] [-validchars chars] file1 ?file2 ...?

All data is normalized after being generated. Multiple input files may be used to create a large sample of plaintext.


-type value
The scoring method to use in the generated scoring table. This must be one of the builtin types returned by the score types command.

Print a little more information as the table is being generated.

-elemsize nchars
The size of the elements for ngram based score types.

-output outfilename
The name of the file where the results should be written. Use '-' for stdout (which is the default)

-validchars chars
The set of valid characters for the scoring table elements. Defaults to 'abcdefghijklmnopqrstuvwxyz'. Make sure to shell-escape any questionable characters such as '*' and '?'.


genscores -type digramlog -output myDigramTable.tcl frank14.txt
Generate and save a sum-of-logs-of-digram-frequencies scoring table based on the standard Frankenstein text.

genscores -type ngramcount -verbose -elemsize 5 -output my5gramTable.tcl file1.txt file2.txt file3.txt file4.txt
Generate and save a 5-gram frequencies based on the sum of 4 input files. Extra status information is printed while the program runs.

genscores -type ngramlog -elemsize 4 -output my4gramTable.tcl -validchars "abcdefghijklmnopqrstuvwxyz " file1.txt
Generate and save a 4-gram frequency scoring table that includes word boundaries.

Back to the Index
Created on Wed Mar 31 08:18:24 PST 2004 Logo