2005 Undergraduate Research Project Log

List of Scripts/Programs

July 21st, 2005

Final Presentation

July 19th, 2005

The below link is a summary of BBR runs on ten authors with seven variations of the vectors. BBR Test Results The two below spreadsheets analyze the above ROC results to pull out the vectors that performed the best. The first tableis a scattergram of the number of occurrences of particular vector representations sorted in descending ROC order. The representations that pull ahead of the others performing the best appear to be All,keywords, and addressWords. The author attribute surprisingly,performed poorly.
The second table is the same graph, but only included the top two representations for each author.

July 17th, 2005

The makeinitlabels.pl script was run with the 20 authors on the xml. Lists of document id's and a tag indicating whether they were part of the the class, not, or ambigious were created. We manually checked the ambigous ones and assigned a state to them. A script was then run to combine the lists with the vector files created. Now that the vectors have been created for all seven cases they must be split into training and test files, which will be done by using the years. Files will be split in half according to the years. After this step, BBR tests may be run.

July 15th, 2005

A meeting was held today to lay out the plan for the remaining project. It was decided that the REU students, myself and Mitchel will focus on determining what vectors will be of importance in the real data to be provided by the KDD Challenge.

This will be conducted in a multi-tiered approach. A flow chart of the work process can be found here. We have conducted analysis using a website of high impact authors to create a list of prolific authors with very unique last names. We believe the names are unique enough to use this as the sole criteria for distinguishing between papers. After the files were parsed from the 20GB of medline files we will run several scripts. The end result will be training and test documents for each author, with the complete document set split in half by chronological order to create the test and trainign sets. Negative examples were be made up of those with papers with the same last name but different initials. Seven vector sets will be created for each author.

Vector Sets for Testing

~ 1) Coauthors -> leaving out the target author as this confuses BBR
~ 2) Address words -> Addresses will be parsed into individual wordss
~ 3) Address comma deliminated -> Addresses are split up into comma deliminated fields. If half of the fields in one address collection match 1/2 of another, then we are counting it as a match.
~ 4) Mesh leaf categories -> ie, classifications and qualifiers
~ 5) Title words -> parsed into individual words
~ 6) Abstract words -> parsed into indivual words
~ 7) All all of the above

Test:
1) BBR will be run on each of the above 7 vectors files for all 20 authors parsed from medline.
Summarize Data & Review
The tests will be compiled (via a script written by D. Fradkin) into a table with the accuracy percentage and Document ID's for the two most wrong documents and the two most right documents so that we can evaluate the performance of BBR. With the tests for each author split into tests for each of the seven vectors above, we should be able to easily see which turn out to be the most important for identifying authors. We hope that we see a consistent trend among all of the authors so that we can apply a rule to the KDD challenge data.

July 14th, 2005

BBR Run 104
Description: Identical test to Run 101 except the testfile includes approx. 130 abstracts in addition to the two we are searching for. Unlike Run 101 with only 20 documents in the test file, this was not succesful in finding the documents. Some false positives were discovered however.
Command Line:BBRtrain -r trainResults -p 1 RUNDefaultLargeTest/vectorsDefault.vec model
Logs:
README
Training Log
Testing Log
Model
Classification Results
Final Notes/Results: This test was not a sucess because it did not identify the relevant document
Beta: Same as Run 101

July 13th, 2005

BBR Run 103
Description: This was a default run with field attributes that include authors, keywords, and abstract words using a laplace prior. The training data set included 10 documents and were randomly chosen and the last two documents in the set were from the same author. The test data set included the rest of the Suzuki documents (roughly 175) and the remaining document that belonged with the last two documents in the training data was the last document.
Command Line:BBRtrain -r trainResults -p 1 RUNDefault3/vectorsDefault.vec model
Logs:
README
Training Log
Testing Log
Model
Classification Results
Final Notes/Results: This test was not a sucess because it did not identify the relevant document
Beta:
-1.27703 1010041 1010041 --> TTSuzuki
-0.00289216 3030086 3030086 --> Support, Non-U.S. Gov't
-0.00112804 3030038 3030038 --> Female
0.00606987 3030094 3030094 --> Topotecan
0.0285598 3030070 3030070 --> Prodrugs
0.152761 3030034 3030034 --> Dextrans
0.170806 1010033 1010033 --> SSOkuno
1.1231 3030011 3030011 --> Antineoplastic Agents, Phytogenic
1.57349 1010022 1010022 --> MMHarada
-0.672773 BBR Run 102
Description: Default run with field attributes that included authors, keywords, abstract words using a laplace prior. This test included a smaller target class with only two documents being from the same author with one being put in a training file with the other documents randomly chosen, and one put in the test file with other documents randomly chosen. All documents in both files share feature that they have at least one author with the name TTSuzuki.
Command Line:
Logs:
README
Training Log
Testing Log
Model
Classification Results
Final Notes/Results:
Beta:
-1.2316 1010050 1010050 --> TTSuzuki
-0.949862 3030069 3030069 --> Human
2.38993e-17 3030029 3030029 --> Cell Adhesion
8.07019e-17 3030023 3030023 --> Carcinogenicity Tests
1.27972e-16 3030022 3030022 --> Cancer Vaccines
1.28239e-16 3030000 3030000 --> Adenoviridae
1.37479e-16 3030024 3030024 --> Carcinoma
3.55365e-16 3030063 3030063 --> Gene Therapy
6.06549e-14 2020084 2020084 --> subcutaneously
1.35239e-11 2020083 2020083 --> subcutaneous
3.03169e-09 2020073 2020073 --> pre-existing
6.79659e-07 2020064 2020064 --> mechanically
0.000152553 2020017 2020017 --> Matrigel-coated
0.00077942 3030060 3030060 --> Gelatin
0.000933213 1010005 1010005 --> EENagai
0.0508965 2020010 2020010 --> GM-CSF-producing
0.589841 1010001 1010001 --> AAIkubo
1.18055 1010051 1010051 --> YYAoki
Missy also indicated that she distinguished between these two documents using location, (which we haven't included yet as part of the vectors) and the fact that they shared two authors (which BBR ignored in its beta), and some of the keywords listed above.