Bayesian Logistic Regression Software
Alexander Genkin, David D. Lewis, and David Madigan
User Guide - how to install and use the software
Developer Guide - how to build the software from the source code
This software implements Bayesian logistic regression with two choices of priors: Gaussian and Laplace. A general binary regression classifier takes the form:
![]()
where y is the class label, 1 or -1; x is the predictor vector; β is the vector of parameters; ψ is the link function. We consider logistic link function:
![]()
Our software finds the maximum a posteriori parameter estimates with two choices for prior: Gaussian or Laplace, the latter given by the formula:
![]()
where βj is a component of the vector of parameters. All prior components are assumed to be independent, equally scaled, and zero centered, though we plan to extend this in the future versions of the program. The Laplace prior corresponds to Tibshirani's LASSO algorithm.
To find the parameter estimates the software implements a coordinate descent algorithm that draws on the ideas of Zhang and Oles (2001).
The user chooses one of two prior types: Laplace or Gaussian. Then the user has to specify the hyperparameter value λ defining the scale of the prior in the Laplace case and variance in the Gaussian case. There are three ways for the user to define the hyperparameter value.
The first way is to specify the hyperparameter value explicitly. The second way is to omit any specification and allow the program to set the value by default. The program sets the default prior variance equal to the inverse average squared value of all data elements in training. The third way is to select a value from a list using (“fractional”) cross-validation.
For the cross-validation option, the user specifies the list of hyperparameter values to select from. Since these are scale type parameters, it makes sense to use a part of a geometric progression for that purpose, e.g. 0.001, 0.01, 0.1, 1, 10, 100, 1000. Each member of the list is tried in the cross-validation loop, the one that maximizes average log-likelihood on validation subsets is selected for final modeling on all training data. The user also specifies the number of folds for cross-validation. With a large data set it might be preferable to perform “fractional” cross-validation, i.e. do less training and validation runs than there are folds, so the user has the option to specify the number of runs. Default is 10 folds and 10 runs.
Logistic regression estimates the probability that a data vector belongs to the class with label 1. Classification requires a threshold: the model assigns a case to class 1 iff the probability estimate is greater or equal to the threshold value.
The software determines the threshold value after all the training samples have probability estimates assigned. Several criteria for threshold tuning are available. The confusion table below defines the necessary notation:
The program offers the following choices for threshold tuning criteria:
no tuning, threshold is equal to 0.5
sum of errors = b+c
balanced error rate = (b/(a+b) + c/(c+d))/2
T11U = 2*a - c
F1 = (2*a)/(2*a + b + c)
The two latter measures are popular in text classification.
The program allows the user to select features that will be used in modeling. Features are ranked by their utility values. Four choices for utility function are provided: Pearson's correlation coefficient, chi-square, Yule's Q, and bi-normal separation (BNS) (Forman, 2002). The user selects a utility function and the number of features (say, k) that should be used in modeling. The program then calculates the utility value for all features in the data and selects k best that will be used in modeling.
Optionally, all features may be standardized, i.e. transformed to have zero mean and unit standard deviation. This not only has a computational effect, similarly to the case with regular regression, but it also affects the relative amount of shrinkage applied to parameter values corresponding to different features.
This optional data transformation centrally projects each data vector onto the unit Euclidian sphere. After that the dot product of any two vectors is equal to the cosine of the angle betweeen those vectors, hence the name. Cosine normalization is popular in text classification because it helps to compensate for document length.
If both feature selection and cosine normalization are enabled, cosine normalization is applied first.
Technical report describing the details of the algorithm will be available shortly at http://www.stat.rutgers.edu/~madigan/mms/
This software consists of two executable modules: the training module and the classification module. The training module uses a training data file as input and generates a model file. The classification module uses a data file with new data and the model file to generate a results file with predicted probabilities and labels.
The software requires gcc runtime version 3.3.1 or higher to be installed on your system. The software consists of two executable modules: BBRtrain and BBRclassify
To complete the installation, untar the executables with a command like
tar xf bin.tar
then move BBRtrain and BBRclassify to the folder where you can execute them.
There are also sample data and results:
The sample data may be conveniently used to test the installation. For that purpose, run commands:
cd sample
mv ../BBRtrain .
mv ../BBRclassify .
mv L10corr100.res hold.res
./run-sample.bat
diff hold.res L10corr100.res
There should be no or minor differences.
Training module
Here is how to use the training module:
BBRtrain [options] training_data_file model_file
where the options are:
-p <[1,2]>, Type of prior, 1-Laplace 2-Gaussian (default is 2)
-H <float>, Hyperparameter, depends on the type of prior (optional)
-S <list of floats, comma-separated, no spaces> Search for hyperparameter value: list of values to try in cross-validation
-C <integer[,integer]> Cross-validation: number of folds, number of runs. If the number of runs is not given, it's assumed equal to the number of folds. Default is 10,10
-s <[0,1]>, Standardize variables (default is 0)
-c <[0,1]>, Cosine normalization (default is 0)
-u <[0..3]>, Feature selection utility, 0-Corr 1-Yule's Q 2-Chi-square 3-BNS (default is 0)
-f <integer>, Number of features to select (default is 0)
-t <[0..3]>, Threshold tuning, 0-no 1-sum error 2-T11U 3-F1 4-balanced error rate (default is 0)
-r <file_name>, Results file (optional)
-l <[0..2]>, Program log verbosity level (default is 0)
-v Displays version information and exits
-h Displays usage information and exits
The training data file format follows popular analogs for sparse data like SVMlight. Each row represents a case. Row format is:
<label> { <feature_id>:<value>}*
Here label may take value 1 or -1; feature_id is positive integer; value is double float.
Result file rows also correspond to cases in the same order as in the training data file. Each row has two fields: score (any number) and label predicted by the model (1 or -1). Score is the model estimate of the probability for the case to have label 1. Unless threshold tuning is used, label is 1 if and only if score is greater than or equal to 0.
Here is how to use the classification module:
BBRclassify [options] new_data_file model_file result_file
where the options are:
-l <[0..2]>, Program log verbosity level (optional, default is 0)
-v Displays version information and exits.
-h Displays usage information and exits
New data file format is the same as training data file format described above. Result file format also has been described there. Note that result file is mandatory in this case.
Building the software from the source code requires gcc compiler version 3.3.1 or above. Here are the necessary steps:
Copy the source code and untar it.
Download TCLAP software from https://sourceforge.net/projects/tclap/, then gunzip and untar the files (we tested TCLAP versions 0.9.5 and 0.9.6)
Make src directory current; update the paths in the Makefile according to the directories layout on your system; update the definition of the compiler variable in the Makefile if needed.
Run make utility
Source code used:
Peter J. Acklam's "An algorithm for computing the inverse normal cumulative distribution function" http://home.online.no/~pjacklam/notes/invnorm/
MATV package, (C) Mark Von Tress 1996
Infoscope, (C) Pavel Dubner
See individual source code files for license statements.