Text Data

Download Text Data (college1) Download Text Data (college2) Download Text Data (college3) Download Text Data (politics1) Download Text Data (politics2) Download Text Data (politics3)

Python Part

Linear with Cost=0.1

Linear with Cost=10

Polynomial with Cost=0.1

Polynomial with Cost=10

From the graphs and confusion matrics above, SVM with linear kernel is more suitble to classify text data than SVM with polynomial kernel is. And polynomial kernel SVM with 0.1 cost has the worst prediction out of four SVM. It predicted two variables in text data, and it gets two wrong predictions.

R Part

For this part, there are visualizations of Polynomial with Cost=0.1, Polynomail with Cost=5, Linear with Cost=0.1, Linear with Cost=10, Radial with Cost=0.1, and Radial with Cost=10. And all of these SVM visualizations and confusion matrics look same to each other. From the graph, the word 'tuition' is frequent in articles that talk about college tuitions and word 'law' is not mentioned so many times compared to the word 'tuition' in those articles. On the other hand, in politics articles, 'law' is mentioned for several times. Download Python Code ( for text data) Download R Code ( for text data)

Record Data

Download Record Data

Python Part

Linear with Cost=0.1

Linear with Cost=10

Polynomial with Cost=0.1

Polynomial with Cost=10

From the confusion matrics that has been shown above, the SVM that has linear kernel is the most suitable to predict the label of test record data. And linear kernel SVM with 0.1 cost is better than linear kernel SVM with 10 cost, although the difference is so slight that can be ignored. Polynomial kernel SVM with 10 cost has the best prediction from four graphs above. But when the cost becomes 0.1, the performance is less accurate compared to Polynomial kernel SVM with 0.1 cost.

R Part

Polynomial with Cost=0.1

Polynomail with Cost=5

Linear with Cost=0.1

Linear with Cost=0.5

Radial with Cost=0.1

Radial with Cost=2

Radial kernel SVM is less suitable compare to linear kernel SVM and polynomial kernel SVM. All labels of test data are predicted as 'high' by radial kernel SVM. Only about half of them are predicted correctly as their true labels.

Download Python Code ( for record data) Download R Code ( for record data)

Conclusion

For text data, the frequency of the word 'tuition' is useful to predict if this article talks about college tuition. There are other words that can be used to predict and classify the article. For example, if a random article has a high frequency of the word 'tuition', it is more likely to be predicted as an article that talks about college tuition. If it has high frequency of word 'law', or 'policy', it is more likely to be predicted as an article that talks about politics.

For record data, the majority of universities are predicted as expensive colleges for most people in the United States. However, students who graduated from expensive universities have more probability of getting high incomes at work. Therefore, although many universities have expensive tuition, it is still worh studying at these universities. The investment in a college education will pay-back once students start their careers.