Welcome to Menehune's Page for Dr. M.W. Berry (CS391A/691AG)

  • Dinosaur classification data (Excel) for decision tree induction and evaluation.

  • ASRS (Aviation Safety Reporting System) documents (NASA Ames):

    Narratives (raw) (31MB)
    Narratives (cleaned) (30MB)
    Perl script (clean.pl)
    Document IDs (27,596 documents parsed by GTP (General Text Parser))

    docs = textread('Documents.txt', '%s',n,'delimiter','\n'););

    will create a n × 1 vector of document titles. Dictionary_LE.txt (20,889 terms parsed by GTP (General Text Parser))
    Harwell-Boeing matrix with Log-Entropy term weighting and Term Frequency weighting (both generated by GTP using log-entropy term-weighting; gzip'ed textfile for 27,596 by 20,889 matrix; both are about 45MB uncompressed)
    Dictionary_TF.txt (similar to Dictionary_LE.txt but with simple term frequency weighting); the Matlab statement

    [words,id,freq] = textread('Dictionary_TF.txt', '%s%s%f');

    will create the appropriate string and float arrays (vectors) within Matlab.
    ASRS_TF.dat (Matlab formatted document-by-term matrix with simple term frequency weights: 54MB)
    The simple Matlab script ASRS_TF.m can be used to load and create the sparse doc-by-term matrix (in Matlab). ASRS_LE.dat is an equivalent Matlab- formatted document-by-term matrix based on log-entropy term weighting (54MB).
  • SDM07 Contest (ASRS2 Dataset):
     
Michael W. Berry

NMF Models for Course Project
Model
Number
Term
Weighting
Parms Matlab Binaries
(by rank)
1 lex β=0, sp=0, It=5 10, 20, 40, 60, 80
2 txx β=0, sp=0, It=5 10, 20, 40, 60, 80
3 lex β=10-5, sp=0.75, It=5 10, 20, 40, 60, 80
NMF Models for NASA Ames Contest
Model
Number
Term
Weighting
Parms Matlab Binaries
(by rank)
4 lex β=0, sp=0, It=5 10, 20, 40, 60, 80
5 txx β=0, sp=0, It=5 10, 20, 40, 60, 80
6 lex β=10-5, sp=0.75, It=5 10, 20, 40, 60, 80
7 lex β=10-5, sp=0.50, It=5 10, 20, 40, 60, 80
8 lex β=10-5, sp=0.25, It=5 10, 20, 40, 60, 80