Welcome to Menehune's Page for Dr. M.W. Berry (CS391A/691AG)

- Dinosaur classification data (Excel) for decision tree induction and evaluation.
- ASRS (Aviation Safety Reporting System) documents (NASA Ames):

Narratives (raw) (31MB)

Narratives (cleaned) (30MB)

Perl script (clean.pl)

Document IDs (27,596 documents parsed by GTP (General Text Parser))will create a n × 1 vector of document titles. Dictionary_LE.txt (20,889 terms parsed by GTP (General Text Parser))

`docs = textread('Documents.txt', '%s',n,'delimiter','\n'););`

Harwell-Boeing matrix with Log-Entropy term weighting and Term Frequency weighting (both generated by GTP using log-entropy term-weighting; gzip'ed textfile for 27,596 by 20,889 matrix; both are about 45MB uncompressed)

Dictionary_TF.txt (similar to Dictionary_LE.txt but with simple term frequency weighting); the Matlab statementwill create the appropriate string and float arrays (vectors) within Matlab.

`[words,id,freq] = textread('Dictionary_TF.txt', '%s%s%f');`

ASRS_TF.dat (Matlab formatted document-by-term matrix with simple term frequency weights: 54MB)

The simple Matlab script ASRS_TF.m can be used to load and create the sparse doc-by-term matrix (in Matlab). ASRS_LE.dat is an equivalent Matlab- formatted document-by-term matrix based on log-entropy term weighting (54MB).- SDM07 Contest (ASRS2 Dataset):

- Contest Description and Rules
- TrainingData.txt (25MB file; 21,519 reports with 1 report per record); (cleaned TrainingData.txt) (25MB) using Perl script clean2.pl; Document IDs (21,519 documents parsed by GTP - General Text Parser)
- TrainCategoryMatrix.csv (925KB file; 21,519 report × 22 category matrix)
- Dictionary2_TF.txt (15,722 words parsed by GTP with simple term frequency weighting; max token length is 200 chars)
- Dictionary_LE2.txt is the comparable dictionary file with global entropy weights (from GTP parsing using log-entropy term weighting)
- ASRS2_TF.dat (Matlab formatted document-by-term matrix with simple term frequency weights: 38MB); ASRS2_LE.dat is an equivalent Matlab- formatted document-by-term matrix based on log-entropy term weighting (38MB).

Michael W. Berry

NMF Models for Course Project Model

NumberTerm

WeightingParms Matlab Binaries

(by rank)1 lexβ=0, s _{p}=0, It=510, 20, 40, 60, 80 2 txxβ=0, s _{p}=0, It=510, 20, 40, 60, 80 3 lexβ=10 ^{-5}, s_{p}=0.75, It=510, 20, 40, 60, 80 NMF Models for NASA Ames Contest Model

NumberTerm

WeightingParms Matlab Binaries

(by rank)4 lexβ=0, s _{p}=0, It=510, 20, 40, 60, 80 5 txxβ=0, s _{p}=0, It=510, 20, 40, 60, 80 6 lexβ=10 ^{-5}, s_{p}=0.75, It=510, 20, 40, 60, 80 7 lexβ=10 ^{-5}, s_{p}=0.50, It=510, 20, 40, 60, 80 8 lexβ=10 ^{-5}, s_{p}=0.25, It=510, 20, 40, 60, 80