Alexander Genkin, Ph.D. Center for Discrete Mathematics and Theoretical Computer Science (DIMACS) Rutgers University agenkin@dimacs.rutgers.edu 718 891 4802 Data and Text Mining specialist, software developer * Data analysis and modeling experience in diverse fields: text categorization, author identification, direct marketing, customer behavior, credit card fraud detection, clinical medicine, chemistry, machine engineering * Original scientific results in data mining, machine learning, and algorithms * Building software for text and data mining, statistical analysis, and business intelligence; full cycle from research prototype to mature product * Work effectively as a team leader, part of a team, or independently * Excellent written and verbal communication skills in both English and Russian Software skills: Programming Languages: C/C++; Java; Python; Prolog; PL/I Statistical Software: SPSS; R; SAS; BMDP Database technologies: SQL; OLAP; ODBC; OLE DB; MS Analysis Services; Oracle Experience: 2002 - current: DIMACS (Discrete Mathematics and Computer Science Center), Rutgers University, Piscataway, NJ o R&D in Text Mining o Co-authored and implemented software for large-scale supervised learning: Bayesian logistic and multinomial regression. Includes simultaneous automatic feature selection and regularization. Software has since been successfully used in several text mining and other data mining applications, including winning of TREC competition o Research in large-scale topic categorization. Achieved cutting-edge effectiveness with compact models o Research in large-scale author identification; studied usefulness and topic independence of stylistic feature sets and their combinations 2001 - 2002: IntelDM Inc, East Brunswick, NJ; Data Mining Consultant o Conducted data mining study in retail business, problems including supervised classification and clustering. Methodologies used: linear discriminant analysis; decision trees; optimization clustering; ad-hoc data transformations. o Designed special automation needed to solve multiple analytic problems in similar settings, involving SPSS and original programs developed in C++ 1998 - 2001: QueryObject Systems Corp., Roslyn Heights, NY; Software Development Manager o Developed Data Mining support for OLAP database design based on original ideas and algorithms. Source data is scanned with sampling on the fly and statistics collected. After statistics are analyzed, recommendations are presented to the user on how to use the columns in the OLAP cube: as measures, dimensions, subject to segmentation, etc. Data hygiene is performed at the same time, outliers detected. Implementation in C++, Windows GUI front-end, NT or UNIX back-end. o Implemented OLE DB for OLAP provider in C++. Optimized performance with proprietary back end. Resolved interoperability issues with different front-ends: Cognos, Brio, Business Objects, Excel. o Designed and implemented original client/server architecture: same code on NT and Unix; convenience for C++ developer; high performance. Object-oriented design with three basic components: network transport layer; object serialization and replication layer; application layer. This component design was later reused in several projects within the company. o Led a team of developers, coordinated team efforts; delivered quality software on tough schedule 1992 - 1998: QueryObject Systems Corp. (Cross/Z Software); Overseas Contractor and R&D Group Manager o Built models for database marketing, credit card fraud detection, customer retention. Applied and compared behavior of different methods of data mining: linear regression, logistic regression, original implementations of CHAID and CART algorithms, original method "Fragment-Potential". Obtained competitive results. o Designed and architectured software for data mining and statistical analysis. Led the development of multi-modeling environment for interactive analysis of mass business data. This software is business user oriented, does not require statistical knowledge on the user side. Basing on data analysis, software takes a lot of "guesses" how to proceed to come up with the best model. Features: automated data hygiene; automated data transformation; competitive multi-modeling; specialized business oriented delivery of results. Implementation in C, C++, Object Pascal. o Built up and managed a group of researchers and programmers, that created industrial strength software based on original and state-of-the-art science 1987 - 1998: Institute for Information Transmission, Russian Academy of Sciences, Moscow; researcher o Conducted data analysis and model building in clinical medicine, chemistry, machine engineering. Applied supervised and non-supervised classification, logical methods, and contingency tables analysis, as well as original methods of data mining. o Developed original methods of Data Mining: search for clusters with given properties; selection of predictors for voting algorithms. o Suggested a framework for machine learning from data under the supervision of an a priori knowledge base. Designed a computer program in C and Prolog. o Conducted R&D in high performance algorithms: developed the proven optimal algorithm for submodular function maximization Education: Ph.D. in Computer Science, Institute for Information Transmission, Russian Academy of Sciences, Moscow, Russia. M.S. in Automated Management Systems, Moscow Management Institute, Moscow, Russia. Software: BBR: Bayesian Logistic Regression http://stat.rutgers.edu/~madigan/BBR/ BMR: Bayesian Multinomial Regression http://stat.rutgers.edu/~madigan/BMR/ Selected list of publications: 1. D. Madigan, A. Genkin, D. D. Lewis, and D. Fradkin. Bayesian Multinomial Logistic Regression for Author Identification. 25th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering. San Jose, Aug 7-12, 2005 http://stat.rutgers.edu/~madigan/PAPERS/authorID-me05-fixed.pdf 2. D. Madigan, A. Genkin, D. D. Lewis, S. Argamon, D. Fradkin, and Li Ye. Author Identification on the Large Scale. Joint Annual Meeting of the Interface and the Classification Society of North America. St. Louis, June 8 -- 12, 2005 http://stat.rutgers.edu/~madigan/PAPERS/authorid-csna05.pdf 3. A. Genkin, D. D. Lewis, D. Madigan. Large-Scale Bayesian Logistic Regression for Text Categorization. Submitted, 2005. http://stat.rutgers.edu/~madigan/PAPERS/shortFat-v13.ps 4. A. Dayanik, D. Fradkin, A. Genkin, P. Kantor, D. D. Lewis, D. Madigan, V. Menkov. DIMACS at the TREC 2004 Genomics Track. The Thirteenth Text Retrieval Conference (TREC 2004) 5. A.Genkin, I.Muchnik, C.Kulikowski. Set covering submodular maximization: an optimal algorithm for data mining in bioinformatics and medical informatics. Journal of Intelligent and Fuzzy Systems. 12:5-17, 2002. 6. A.V.Genkin, V.A.Mikheev. The synthesis of voting collectives of regression trees. Pattern Recognition and Image Analysis, Vol. 7, #2, 1997. Pp.184-191 7. V. Yancher, A.Genkin. Multidimensional visualization using rectangles for business applications. DIMACS MiniWorkshop "Exploring Large Data Sets Using Classification, Consensus, and Pattern Recognition Techniques" 1996. Abstract at: http://dimacs.rutgers.edu/Workshops/Classification/abstracts.html 8. A.Genkin, I.Muchnik, C.Kulikowski. Causal Coverage for Diagnostic Hypothesis Generation as Submodular Set Function Maximization. Proc. of the International Multidisciplinary Conference "Intelligent Systems: A Semiotic Perspective". 1996 9. A.Genkin, I.Muchnik. Fixed points approach to clustering. Journal of Classification. Vol. 10, No. 2, 1993 http://datalaundering.com/download/fixed.pdf 10. A.Genkin, I.Muchnik. Optimal search for the maximum of submodular function. Automation and Remote Control, 1990. 11. A.V.Genkin, P.N.Dubner. Aggregation algorithm for the problem of search for informative attributes. Automation and Remote Control. 1988. References Available upon request