Training, Validation and Test Data
Example:
(A)We have data on 16 data items , their attributes and class labels.
RANDOMLY divide them into 8 for training, 4 for validation and 4 for
testing.
Training Item No. d– Attributes Class
1. 0
2. 0
3. KNOWN FOR ALL 1
4. 1
5. DATA ITEMS 1
6. 1
7. 0
8. 0
Validation 9. 0
10. 0
11. 1
12. 0
Test 13. 0
14. 0
15. 1
16. 1
(B). Next, suppose we develop, three classification models A, B, C from
the training data. Let the training errors on these models be as shown
below (recall that the models do not necessarily provide perfect
results on training data—neither they are required to).
Classification results from
Item No. d- Attributes True Class Model A Model B Model C
1. 0 0 1 1
2. ALL KNOWN 0 0 0 0
3. 1 0 1 0
4. 1 1 0 1
5. 1 0 0 0
6. 1 1 1 1
7. 0 0 0 0
8. 0 0 0 0
Classification Error 2/8 3/8 3/8
(C). Next, use the three models A, B, C to classify each item in the
validation set based on its attribute vales. Recall that we do know
their true labels as well. Suppose we get the following
results:
Classification results from
Item No. d- Attributes True Class Model A Model B Model C
9. 0 1 0 0
10. 0 0 1 0
11. 1 0 1 0
12. 0 0 1 0
Classification Error 2/4 2/4 1/4
If we use minimum validation error as model selection criterion, we
would select model C.
(D). Now use model C to determine class values for each data point in
the test set. We do so by substituting the (known) attribute value
into the classification model C. Again, recall that we know the true
label of each of these data items so that we can compare the values
obtained from the classification model with the true labels to
determine classification error on the test set. Suppose we get the
following results.
Classification results
from
Item No. d- Attributes True Class Model C
13. 0 0
14. ALL KNOWN 0 0
15. 1 0
16. 1 1
Classification Error 1/4
(E). Based on the above, an estimate of generalization error is 25%.
What this means is that if we use Model C to classify future items for
which only the attributes will be known, not the class labels, we are
likely to make incorrect classifications about 25% of the time.
(F). A summary of the above is as follows:
Model Training Validation Test
A 25 50 ---
-
B 50 ----
C 25 25
Cross Validation
If available data are limited, we employ Cross Validation (CV). In this
approach, data are randomly divided into almost k equal sets. Training
is done based on (k-1) sets and the k-th set is used for test. This
process is repeated k times (k-fold CV). The average error on the k
repetitions is used as a measure of the test error.
For the special case when k=1, the above is called Leave- One –Out-
Cross-Validation (LOO-CV).
EXAMPLE:
Consider the above data consisting of 16 items.
(A). Let k= 4, ., 4- fold Cross Validation.
Divide the data into four sets of 4 items each.
Suppose the following set up occurs and the errors obtained are as shown.
Set 1 Set 2 Set 3 Set 4
Training Items 1 - 12 Items 1 - 8 13-16 Items 1 - 4 9-16 Items 5-16
Test Items 13-16 Items 9-12 Items 5 - 8 Items 1 –
4
Error on test set (assume) 25% 35% 28% 32%
Estimated Classification Error (CE) = 25+35+28+32 = 30%
4
(B). LOO – CV
For this, data are divided into 16 sets, each consisting of 15 training
data and one test data.
Set 1 Set 2 Set 15 Set 16
Training Items 1 - 15 Items 1 – 14,16 Item 1, 3-8 Items 2-16
Test Item 16 Item 15 Item 2 Item 1
Error on test set (assume) 0% 100% 100% 100%
Suppose Average Classification Error based on the values in the last row
is
CE)= 32%
Then the estimate of test error is 32% .
精心搜集整理,只为你的需要