and

data mining

**
A. Azzalini - B. Scarpa**

**Oxford University Press 2012**

ISBN 978-0-19-976710-6

Table of Contents

- Preface
- Preface to the English Edition
- Introduction
- New problems and new opportunities
- Data, more data, and data mines
- Problems in mining
- SQL, OLTP, OLAP, DWH and KDD
- Complications

- All models are wrong
- What is a model?
- From data to model

- A matter of style
- Press the button?
- Tools for computation and graphics

- New problems and new opportunities
- A-B-C
- Old friends: Linear models
- Basic concepts
- Variable transformations
- Multivariate responses
- Computational aspects

- Computational aspects
- Least squares estimation by successive orthogonalization
- When n is large
- Recursive estimation

- Likelihood
- General concepts
- Linear models with Gaussian error terms
- Binary variables with binomial distribution

- Logistic regression and GLM
- Exercises

- Old friends: Linear models
- Optimism, Conflicts, and Trade-offs
- Matching the conceptual frame and real life
- A simple prototype problem
- If we knew
*f*(x)... - But as we do not know
*f*(x)... - Methods for model selection
- Training sets and test sets
- Cross-validation
- Criteria based on information

- Reduction of dimensions and selection of most appropriate model
- Automatic selection of variables
- Principal component analysis
- Methods of regularization

- Exercises

- Prediction of Quantitative Variables
- Nonparametric estimation: Why?
- Local regression
- Basic formulation
- Choice of smoothing parameters
- Variability bands
- Variable bandwidths and loess
- Extension to several dimensions

- The curse of dimensionality
- Splines
- Spline functions
- Regression splines
- Smoothing splines
- Multidimensional splines
- MARS

- Additive models and GAM
- Projection pursuit
- Inferential aspects
- Effective degrees of freedom
- Analysis of variance

- Regression trees
- Approximations via step functions
- Regression trees: growth
- Regression trees: pruning
- Discussion

- Neural networks
- Case studies
- Traffic prediction in telecommunications
- Insurance pricing

- Exercises

- Methods of Classification
- Prediction of categorical variables
- An introduction based on a marketing problem
- Prediction via logistic regression
- Misclassification tables and adequacy measures
- ROC curve
- Lift curve

- Extension to several categories
- Multivariate logit and multinomial regression
- Ordinal categorical variables and cumulative logit models

- Classification via linear regression
- Case with two categories
- Case with several categories
- Discussion

- Discriminant analysis
- General remarks
- Linear discriminant analysis
- Quadratic discriminant analysis
- Discussion

- Some nonparametric methods
- Classification trees
- Some other topics
- Neural networks
- Support vector machines

- Combination of classifiers
- Bagging
- Boosting
- Random forests

- Case studies
- The traffic of a telephone company
- Churn analysis
- Customer satisfaction
- Web usage mining

- Exercises

- Methods of Internal Analysis
- Cluster analysis
- General remarks
- Distances and dissimilarities
- Non-hierarchical methods
- Hierarchical methods

- Associations among variables
- Elementary notions of graphical models
- Association rules

- Case study: Web usage mining
- Profiling website visitors
- Sequence rules and usage behaviour

- Cluster analysis
- Appendix A Complements of Mathematics and Statistics
- Concepts on linear algebra
- Concepts of probability theory
- Concepts of linear models

- Appendix B Data Sets
- Simulated data
- Car data
- Brazilian bank data
- Data for telephone company customers
- Insurance data
- Choice of fruit juice data
- Customer satisfaction
- Web usage data

- Appendix C Symbols and Acronyms

- References
- Author Index
- Subject Index

Back to the home-page of the book