This post describes an analysis performed on an online news dataset. Data cleaning, data transformation, and dimensinality reduction are performed. Next, we try some supervised and unsupervised models such as decision trees, clustering and logistic models to check their accuracy on the prediction of the popularity of the news.
Introduction
For this analysis I have chosen a dataset with features about news published on the web www.mashable.com. This dataset can be found at the following address: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
In the dataset we can find parameters collected from the published news, its usefulness lies in being able to make predictive models about the possible popularity of other news based on these parameters.
Popularity is based on the number of times the page is shared, indicated in the "shares" column of the dataset.
Popularity is based on the number of times the page is shared, indicated in the "shares" column of the dataset.
The reasons for choosing this dataset are several: it has a sufficient number of variables to be able to perform dimensionality reduction, it also contains many continuous variables, which allows to perform discretization. It also allows the application of supervised models, since the target variable is available, and unsupervised models, ignoring it.
The variables contained in the dataset are the following (in summary):
- url: url of the news.
- Timedelta: days between dataset publication and data collection.
- number of words, unique words, words without meaning (prepositions, pronouns, articles) and unique words without meaning.
- number of references and references to the same page.
- Number of references and references to the same page.
- Number of references and references to the same page.
- number of images and videos.
- average word length.
- number of keywords.
- type of channel where the news is published.
- keyword rankings (best, worst and average).
- Maximum number of keywords.
- maximum, minimum and average number of references to the article from the same page.
- Day of the week in which the article was published.
- Day of the week in which the article was published.
- Metrics of the category model (LDA) of the article.
- Metrics of the category model (LDA) of the article.
- Other sentiment analysis metrics such as positive or negative polarity, or subjectivity.
- Number of times the article has been shared (this will be the target variable for determining popularity) .
We start by reading the dataset and displaying a summary of the data. We see that all the variables except the url are numeric, there are some binary ones, those indicating whether the article was published on a certain day of the week and whether the article belongs to a particular type of channel. Most of them are continuous, in some cases like subjectivity and polarity they have a defined range between 0 and 1 or between -1 and 1. Others have a wider range of values like number of words or keywords.
data<-read.csv('../Datos/OnlineNewsPopularity.csv')
summary(data)
## url
## http://mashable.com/2013/01/07/amazon-instant-video-browser/ : 1
## http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ : 1
## http://mashable.com/2013/01/07/apple-40-billion-app-downloads/: 1
## http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/ : 1
## http://mashable.com/2013/01/07/att-u-verse-apps/ : 1
## http://mashable.com/2013/01/07/beewi-smart-toys/ : 1
## (Other) :39638
## timedelta n_tokens_title n_tokens_content n_unique_tokens
## Min. : 8.0 Min. : 2.0 Min. : 0.0 Min. : 0.0000
## 1st Qu.:164.0 1st Qu.: 9.0 1st Qu.: 246.0 1st Qu.: 0.4709
## Median :339.0 Median :10.0 Median : 409.0 Median : 0.5392
## Mean :354.5 Mean :10.4 Mean : 546.5 Mean : 0.5482
## 3rd Qu.:542.0 3rd Qu.:12.0 3rd Qu.: 716.0 3rd Qu.: 0.6087
## Max. :731.0 Max. :23.0 Max. :8474.0 Max. :701.0000
##
## n_non_stop_words n_non_stop_unique_tokens num_hrefs
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 1.0000 1st Qu.: 0.6257 1st Qu.: 4.00
## Median : 1.0000 Median : 0.6905 Median : 8.00
## Mean : 0.9965 Mean : 0.6892 Mean : 10.88
## 3rd Qu.: 1.0000 3rd Qu.: 0.7546 3rd Qu.: 14.00
## Max. :1042.0000 Max. :650.0000 Max. :304.00
##
## num_self_hrefs num_imgs num_videos average_token_length
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. :0.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.:4.478
## Median : 3.000 Median : 1.000 Median : 0.00 Median :4.664
## Mean : 3.294 Mean : 4.544 Mean : 1.25 Mean :4.548
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 1.00 3rd Qu.:4.855
## Max. :116.000 Max. :128.000 Max. :91.00 Max. :8.042
##
## num_keywords data_channel_is_lifestyle data_channel_is_entertainment
## Min. : 1.000 Min. :0.00000 Min. :0.000
## 1st Qu.: 6.000 1st Qu.:0.00000 1st Qu.:0.000
## Median : 7.000 Median :0.00000 Median :0.000
## Mean : 7.224 Mean :0.05295 Mean :0.178
## 3rd Qu.: 9.000 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :10.000 Max. :1.00000 Max. :1.000
##
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1579 Mean :0.0586 Mean :0.1853
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## data_channel_is_world kw_min_min kw_max_min kw_avg_min
## Min. :0.0000 Min. : -1.00 Min. : 0 Min. : -1.0
## 1st Qu.:0.0000 1st Qu.: -1.00 1st Qu.: 445 1st Qu.: 141.8
## Median :0.0000 Median : -1.00 Median : 660 Median : 235.5
## Mean :0.2126 Mean : 26.11 Mean : 1154 Mean : 312.4
## 3rd Qu.:0.0000 3rd Qu.: 4.00 3rd Qu.: 1000 3rd Qu.: 357.0
## Max. :1.0000 Max. :377.00 Max. :298400 Max. :42827.9
##
## kw_min_max kw_max_max kw_avg_max kw_min_avg
## Min. : 0 Min. : 0 Min. : 0 Min. : -1
## 1st Qu.: 0 1st Qu.:843300 1st Qu.:172847 1st Qu.: 0
## Median : 1400 Median :843300 Median :244572 Median :1024
## Mean : 13612 Mean :752324 Mean :259282 Mean :1117
## 3rd Qu.: 7900 3rd Qu.:843300 3rd Qu.:330980 3rd Qu.:2057
## Max. :843300 Max. :843300 Max. :843300 Max. :3613
##
## kw_max_avg kw_avg_avg self_reference_min_shares
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3562 1st Qu.: 2382 1st Qu.: 639
## Median : 4356 Median : 2870 Median : 1200
## Mean : 5657 Mean : 3136 Mean : 3999
## 3rd Qu.: 6020 3rd Qu.: 3600 3rd Qu.: 2600
## Max. :298400 Max. :43568 Max. :843300
##
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## Min. : 0 Min. : 0.0 Min. :0.000
## 1st Qu.: 1100 1st Qu.: 981.2 1st Qu.:0.000
## Median : 2800 Median : 2200.0 Median :0.000
## Mean : 10329 Mean : 6401.7 Mean :0.168
## 3rd Qu.: 8000 3rd Qu.: 5200.0 3rd Qu.:0.000
## Max. :843300 Max. :843300.0 Max. :1.000
##
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1864 Mean :0.1875 Mean :0.1833
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.1438 Mean :0.06188 Mean :0.06904 Mean :0.1309
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## LDA_00 LDA_01 LDA_02 LDA_03
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.02505 1st Qu.:0.02501 1st Qu.:0.02857 1st Qu.:0.02857
## Median :0.03339 Median :0.03334 Median :0.04000 Median :0.04000
## Mean :0.18460 Mean :0.14126 Mean :0.21632 Mean :0.22377
## 3rd Qu.:0.24096 3rd Qu.:0.15083 3rd Qu.:0.33422 3rd Qu.:0.37576
## Max. :0.92699 Max. :0.92595 Max. :0.92000 Max. :0.92653
##
## LDA_04 global_subjectivity global_sentiment_polarity
## Min. :0.00000 Min. :0.0000 Min. :-0.39375
## 1st Qu.:0.02857 1st Qu.:0.3962 1st Qu.: 0.05776
## Median :0.04073 Median :0.4535 Median : 0.11912
## Mean :0.23403 Mean :0.4434 Mean : 0.11931
## 3rd Qu.:0.39999 3rd Qu.:0.5083 3rd Qu.: 0.17783
## Max. :0.92719 Max. :1.0000 Max. : 0.72784
##
## global_rate_positive_words global_rate_negative_words rate_positive_words
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.02838 1st Qu.:0.009615 1st Qu.:0.6000
## Median :0.03902 Median :0.015337 Median :0.7105
## Mean :0.03962 Mean :0.016612 Mean :0.6822
## 3rd Qu.:0.05028 3rd Qu.:0.021739 3rd Qu.:0.8000
## Max. :0.15549 Max. :0.184932 Max. :1.0000
##
## rate_negative_words avg_positive_polarity min_positive_polarity
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1852 1st Qu.:0.3062 1st Qu.:0.05000
## Median :0.2800 Median :0.3588 Median :0.10000
## Mean :0.2879 Mean :0.3538 Mean :0.09545
## 3rd Qu.:0.3846 3rd Qu.:0.4114 3rd Qu.:0.10000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## max_positive_polarity avg_negative_polarity min_negative_polarity
## Min. :0.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:0.6000 1st Qu.:-0.3284 1st Qu.:-0.7000
## Median :0.8000 Median :-0.2533 Median :-0.5000
## Mean :0.7567 Mean :-0.2595 Mean :-0.5219
## 3rd Qu.:1.0000 3rd Qu.:-0.1869 3rd Qu.:-0.3000
## Max. :1.0000 Max. : 0.0000 Max. : 0.0000
##
## max_negative_polarity title_subjectivity title_sentiment_polarity
## Min. :-1.0000 Min. :0.0000 Min. :-1.00000
## 1st Qu.:-0.1250 1st Qu.:0.0000 1st Qu.: 0.00000
## Median :-0.1000 Median :0.1500 Median : 0.00000
## Mean :-0.1075 Mean :0.2824 Mean : 0.07143
## 3rd Qu.:-0.0500 3rd Qu.:0.5000 3rd Qu.: 0.15000
## Max. : 0.0000 Max. :1.0000 Max. : 1.00000
##
## abs_title_subjectivity abs_title_sentiment_polarity shares
## Min. :0.0000 Min. :0.0000 Min. : 1
## 1st Qu.:0.1667 1st Qu.:0.0000 1st Qu.: 946
## Median :0.5000 Median :0.0000 Median : 1400
## Mean :0.3418 Mean :0.1561 Mean : 3395
## 3rd Qu.:0.5000 3rd Qu.:0.2500 3rd Qu.: 2800
## Max. :0.5000 Max. :1.0000 Max. :843300
##
str(data)
## 'data.frame': 39644 obs. of 61 variables:
## $ url : Factor w/ 39644 levels "http://mashable.com/2013/01/07/amazon-instant-video-browser/",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ timedelta : num 731 731 731 731 731 731 731 731 731 731 ...
## $ n_tokens_title : num 12 9 9 9 13 10 8 12 11 10 ...
## $ n_tokens_content : num 219 255 211 531 1072 ...
## $ n_unique_tokens : num 0.664 0.605 0.575 0.504 0.416 ...
## $ n_non_stop_words : num 1 1 1 1 1 ...
## $ n_non_stop_unique_tokens : num 0.815 0.792 0.664 0.666 0.541 ...
## $ num_hrefs : num 4 3 3 9 19 2 21 20 2 4 ...
## $ num_self_hrefs : num 2 1 1 0 19 2 20 20 0 1 ...
## $ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
## $ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
## $ average_token_length : num 4.68 4.91 4.39 4.4 4.68 ...
## $ num_keywords : num 5 4 6 7 7 9 10 9 7 5 ...
## $ data_channel_is_lifestyle : num 0 0 0 0 0 0 1 0 0 0 ...
## $ data_channel_is_entertainment: num 1 0 0 1 0 0 0 0 0 0 ...
## $ data_channel_is_bus : num 0 1 1 0 0 0 0 0 0 0 ...
## $ data_channel_is_socmed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ data_channel_is_tech : num 0 0 0 0 1 1 0 1 1 0 ...
## $ data_channel_is_world : num 0 0 0 0 0 0 0 0 0 1 ...
## $ kw_min_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_min_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_min_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ self_reference_min_shares : num 496 0 918 0 545 8500 545 545 0 0 ...
## $ self_reference_max_shares : num 496 0 918 0 16000 8500 16000 16000 0 0 ...
## $ self_reference_avg_sharess : num 496 0 918 0 3151 ...
## $ weekday_is_monday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_tuesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_wednesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_thursday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_friday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_saturday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_sunday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_weekend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LDA_00 : num 0.5003 0.7998 0.2178 0.0286 0.0286 ...
## $ LDA_01 : num 0.3783 0.05 0.0333 0.4193 0.0288 ...
## $ LDA_02 : num 0.04 0.0501 0.0334 0.4947 0.0286 ...
## $ LDA_03 : num 0.0413 0.0501 0.0333 0.0289 0.0286 ...
## $ LDA_04 : num 0.0401 0.05 0.6822 0.0286 0.8854 ...
## $ global_subjectivity : num 0.522 0.341 0.702 0.43 0.514 ...
## $ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
## $ global_rate_positive_words : num 0.0457 0.0431 0.0569 0.0414 0.0746 ...
## $ global_rate_negative_words : num 0.0137 0.01569 0.00948 0.02072 0.01213 ...
## $ rate_positive_words : num 0.769 0.733 0.857 0.667 0.86 ...
## $ rate_negative_words : num 0.231 0.267 0.143 0.333 0.14 ...
## $ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
## $ min_positive_polarity : num 0.1 0.0333 0.1 0.1364 0.0333 ...
## $ max_positive_polarity : num 0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
## $ avg_negative_polarity : num -0.35 -0.119 -0.467 -0.37 -0.22 ...
## $ min_negative_polarity : num -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
## $ max_negative_polarity : num -0.2 -0.1 -0.133 -0.167 -0.05 ...
## $ title_subjectivity : num 0.5 0 0 0 0.455 ...
## $ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
## $ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
## $ abs_title_sentiment_polarity : num 0.188 0 0 0 0.136 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
Data cleaning
We start by saving the original dataset, in case we will need it later on.
dataorig<-data
Let's check if there are null or empty values in the dataset, in this case no null or empty values are obtained, so the dataset is ready to continue with the analysis.
colSums(is.na(data))
## url timedelta
## 0 0
## n_tokens_title n_tokens_content
## 0 0
## n_unique_tokens n_non_stop_words
## 0 0
## n_non_stop_unique_tokens num_hrefs
## 0 0
## num_self_hrefs num_imgs
## 0 0
## num_videos average_token_length
## 0 0
## num_keywords data_channel_is_lifestyle
## 0 0
## data_channel_is_entertainment data_channel_is_bus
## 0 0
## data_channel_is_socmed data_channel_is_tech
## 0 0
## data_channel_is_world kw_min_min
## 0 0
## kw_max_min kw_avg_min
## 0 0
## kw_min_max kw_max_max
## 0 0
## kw_avg_max kw_min_avg
## 0 0
## kw_max_avg kw_avg_avg
## 0 0
## self_reference_min_shares self_reference_max_shares
## 0 0
## self_reference_avg_sharess weekday_is_monday
## 0 0
## weekday_is_tuesday weekday_is_wednesday
## 0 0
## weekday_is_thursday weekday_is_friday
## 0 0
## weekday_is_saturday weekday_is_sunday
## 0 0
## is_weekend LDA_00
## 0 0
## LDA_01 LDA_02
## 0 0
## LDA_03 LDA_04
## 0 0
## global_subjectivity global_sentiment_polarity
## 0 0
## global_rate_positive_words global_rate_negative_words
## 0 0
## rate_positive_words rate_negative_words
## 0 0
## avg_positive_polarity min_positive_polarity
## 0 0
## max_positive_polarity avg_negative_polarity
## 0 0
## min_negative_polarity max_negative_polarity
## 0 0
## title_subjectivity title_sentiment_polarity
## 0 0
## abs_title_subjectivity abs_title_sentiment_polarity
## 0 0
## shares
## 0
colSums(data=="")
## url timedelta
## 0 0
## n_tokens_title n_tokens_content
## 0 0
## n_unique_tokens n_non_stop_words
## 0 0
## n_non_stop_unique_tokens num_hrefs
## 0 0
## num_self_hrefs num_imgs
## 0 0
## num_videos average_token_length
## 0 0
## num_keywords data_channel_is_lifestyle
## 0 0
## data_channel_is_entertainment data_channel_is_bus
## 0 0
## data_channel_is_socmed data_channel_is_tech
## 0 0
## data_channel_is_world kw_min_min
## 0 0
## kw_max_min kw_avg_min
## 0 0
## kw_min_max kw_max_max
## 0 0
## kw_avg_max kw_min_avg
## 0 0
## kw_max_avg kw_avg_avg
## 0 0
## self_reference_min_shares self_reference_max_shares
## 0 0
## self_reference_avg_sharess weekday_is_monday
## 0 0
## weekday_is_tuesday weekday_is_wednesday
## 0 0
## weekday_is_thursday weekday_is_friday
## 0 0
## weekday_is_saturday weekday_is_sunday
## 0 0
## is_weekend LDA_00
## 0 0
## LDA_01 LDA_02
## 0 0
## LDA_03 LDA_04
## 0 0
## global_subjectivity global_sentiment_polarity
## 0 0
## global_rate_positive_words global_rate_negative_words
## 0 0
## rate_positive_words rate_negative_words
## 0 0
## avg_positive_polarity min_positive_polarity
## 0 0
## max_positive_polarity avg_negative_polarity
## 0 0
## min_negative_polarity max_negative_polarity
## 0 0
## title_subjectivity title_sentiment_polarity
## 0 0
## abs_title_subjectivity abs_title_sentiment_polarity
## 0 0
## shares
## 0
We check the values of the variables that must be binary to verify that they do not have any value outside their domain (0 or 1), in this case everything is correct.
apply(data[,c(14:19,32:39)],2,function(x) levels(as.factor(x)))
## data_channel_is_lifestyle data_channel_is_entertainment
## [1,] "0" "0"
## [2,] "1" "1"
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## [1,] "0" "0" "0"
## [2,] "1" "1" "1"
## data_channel_is_world weekday_is_monday weekday_is_tuesday
## [1,] "0" "0" "0"
## [2,] "1" "1" "1"
## weekday_is_wednesday weekday_is_thursday weekday_is_friday
## [1,] "0" "0" "0"
## [2,] "1" "1" "1"
## weekday_is_saturday weekday_is_sunday is_weekend
## [1,] "0" "0" "0"
## [2,] "1" "1" "1"
It seems that the variable is_weekend provides redundant information, since it provides the same information as the variables weekday_is_saturday and weekday_is_sunday. Let's remove it.
data$is_weekend<-NULL
Let's check now that the binary variables that need exclusive do not have several "1" values for the same record, nor all values at 0.
#Día de la semana
nrow(data[rowSums(data[,c(32:38)])>1,])
## [1] 0
nrow(data[rowSums(data[,c(32:38)])<1,])
## [1] 0
#Tipo de canal
nrow(data[rowSums(data[,c(14:19)])>1,])
## [1] 0
nrow(data[rowSums(data[,c(14:19)])<1,])
## [1] 6134
We see that we have 6134 rows without any type of channel assigned, in order not to delete them, we are going to create a new variable called data_channel_is_other with value 1 in these rows and 0 in the rest, in this way these records are also classified.
#Create column with value 0
data$data_channel_is_other<-0
#Assign value 1 to rows without an assigned channel type
data[rowSums(data[,c(14:19)])<1,]$data_channel_is_other<-1
Let's see the correlations between continuous variables. In this case, given the number of variables, we are going to change the names of the rows and columns to numbers, so that the graph is somewhat clearer.
library(corrplot)
M<-cor(data[,c(3:13,20:31,40:60)])
#Save the variable names
coln<-colnames(M)
rown<-rownames(M)
colnames(M)<-1:44
rownames(M)<-1:44
corrplot(M, type = "upper", method = "circle", tl.cex = 0.6)
We see a high correlation between some variables, we obtain the graph only with the rows and columns that interest us to see it better.
#We return the variable names to the matrix to make it clearer.
colnames(M)<-coln
rownames(M)<-rown
corrplot(M[c(3:5,12,13,19,29),c(3:5,14,16,20,32)], type = "upper", method = "number", tl.cex = 0.9)
The variables n_unique_tokens, n_non_stop_words and n_non_stop_unique_tokens have a maximum correlation. The first variable refers to the total number of different words, the other two variables refer to the number of empty, meaningless words (prepositions, articles or pronouns). It seems that the latter do not provide additional information with respect to the total number of different words, so we are going to eliminate them. The rest of the variables, even though they have a high correlation, as they are not maximum (1), we will leave them since we will apply the svd algorithm on them later.
data$n_non_stop_words<-NULL
data$n_non_stop_unique_tokens<-NULL
Descriptive Analysis
We are going to perform a descriptive analysis of the dataset.
First let's look at the distributions of the variables. After studying the distributions of all the variables, given the amount of them, for clarity, we show only the most representative ones since the rest have a distribution similar to some of these.
We can observe that the distributions of the variables are quite heterogeneous, none of them has a normal distribution, although some of them are close. Some of the variables have much density in a reduced range and much less density in the rest of the values.
library(dplyr)
library(ggplot2)
dens <- lapply(colnames(data[,c(3,4,10,11,23,40,42,46,47,49,52)]),function(cn) ggplot(data,aes(x=data[,cn]))+geom_density(color="darkblue", fill="lightblue")+labs(x=cn))
gridExtra::marrangeGrob(grobs=dens, nrow=2, ncol=2)
Next we will normalize the data to a range (0,1) and obtain the density plots again for comparison.
library("BBmisc")
datanorm<-data[,c(3,4,10,11,23,40,42,46,47,49,52)]
datanorm<-normalize(datanorm,method="range")
We obtain the plots with the normalized data.
We see that the distributions of the variables do not change.
We see that the distributions of the variables do not vary, the graphs are the same only that the range of these is from 0 to 1, this would allow us to make comparisons between variables since they are now of the same order of magnitude.
library(dplyr)
library(ggplot2)
dens <- lapply(colnames(datanorm[,c(1:ncol(datanorm))]),function(cn) ggplot(datanorm,aes(x=datanorm[,cn]))+geom_density(color="darkblue", fill="lightblue")+labs(x=cn))
gridExtra::marrangeGrob(grobs=dens, nrow=2, ncol=2)
Discretization
Let's see how each of these variables behaves by applying different types of discretization.
To perform this discretization we will use the discretize function of the arules package.
First we perform discretization with equal-amplitude intervals.
Amplitude
library (arules)
disc <- lapply(c(3,4,10,11,23,40,42,46,47,49,52),function(cnum) ggplot(data,aes(discretize(data[,cnum],breaks=5,method="interval")))+geom_bar(color="darkblue",fill="lightblue")+
xlab(colnames(data)[cnum])+theme(axis.text.x = element_text(angle = 45, hjust = 1)))
gridExtra::marrangeGrob(grobs=disc, nrow=2, ncol=2)
We can observe that in most cases, this type of discretization groups in a quite irregular way, we observe how the shape of the graph of the intervals is quite similar to the shape of the density graph of each variable, having more values in the intervals where more density of values exists in that variable, which is logical if we use this type of discretization.
Frequency
Next, we perform discretization with equal-frequency intervals.
library (arules)
disc <- lapply(c(3,4,10,11,23,40,42,46,47,49,52),function(cnum) ggplot(data,aes(discretize(data[,cnum],breaks=5,method="frequency")))+geom_bar(color="darkblue",fill="lightblue")+
xlab(colnames(data)[cnum])+theme(axis.text.x = element_text(angle = 45, hjust = 1)))
gridExtra::marrangeGrob(grobs=disc, nrow=2, ncol=2)
In this case we see how in most of the variables we get more uniform graphs in terms of the number of values, its shape is not so similar to the density plot, but depending on the type of variable we get very "wide" intervals with very different values within them. For example in the case of LDA_03 the last interval is quite wide since there is little diversity of values in that range. It is striking the case of min_positive_polarity, where, given the irregularity of the density of its values, it is not possible to discretize with equal frequencies.
K-means
library (arules)
disc <- lapply(c(3,4,10,11,23,40,42,46,47,49,52),function(cnum) ggplot(data,aes(discretize(data[,cnum],breaks = 5,method="cluster")))+geom_bar(color="darkblue",fill="lightblue")+
xlab(colnames(data)[cnum])+theme(axis.text.x = element_text(angle = 45, hjust = 1)))
gridExtra::marrangeGrob(grobs=disc, nrow=2, ncol=2)
In this case the algorithm tries to minimize the variation of values within the cluster, this makes the real distribution of the values more visible, having a higher frequency of them where the variable has more density, in this regard the graphs are more similar to the discretization by equal intervals, although in some cases more regularity is achieved.
Normal function
We are going to perform the discretization according to the normal function, although none of the variables seems to follow this type of distribution, we will see how they respond. We create a function that calculates the probabilities according to the chosen z (-1.25,-0.5,0.5,1.25) and creates the intervals according to these proportions.
#Función que crea los intervalos
zdiscr<-function(x)
{
#Save the probabilities of the chosen z values (-1.25,-0.5,0.5,1.25)
probs<-c(pnorm(-1.25),pnorm(-0.5)-pnorm(-1.25),pnorm(0.5)-pnorm(-0.5),pnorm(1.25)-pnorm(0.5),1-pnorm(1.25))
#Sort the values
col<-sort(x,decreasing=FALSE)
#Define the intervals
#Get an array with the unique values
uniq<-unique(col)
#Index of the value in the interval
cut1<-1
#Value in the interval
lowint<-col[1]
intervals<-vector() #Interval vector (return value)
interval_levels<-vector() #levels vector, they are saved to be applied at the end and not to reorder the intervals.
preuniq<-0 #Controls the index of uniq to sum the values in proportion to z
for (i in 1:length(probs))
{
#we look for the cutoff number of values * probability of z.
uniqindex<-round(length(uniq)*probs[i])
#number in the selected position
highint<-uniq[preuniq+uniqindex]
#We check if the last interval is possible. It is possible that only the last value is left and we have left the array.
#In this case highint is null. We assign the same value of lowint and the last bin will be only that value.
if(is.na(highint))
{
highint=lowint
cut2<-length(col)
}
else
{
#If there are few values, highint will be the smallest value, we have to adjust and take the next one, since
#the first interval with only one value, it is not possible to
if(highint==min(uniq))
{
uniqindex<-uniqindex+1
highint<-uniq[preuniq+uniqindex]
}
#We look for the one where is the last value that will go in the interval.
if(i<length(probs))
cut2<-max(which(col<highint))
else
#In the last interval we are looking for the value to be included because it will be closed on the right side
cut2<-max(which(col<=highint))
}
preuniq<-preuniq+uniqindex
#It is only possible to create the interval if there is cut2. In cases
#where the largest value of the interval coincides with the minimum value of the column, such an interval is not possible.
if(is.finite(cut1) && is.finite(cut2))
{
#Create as many entries in the interval array as there are values in that interval.
if(i<length(probs))
{
#We have to save the levels to prevent r from sorting them.
interval_levels<-c(interval_levels,paste0("[",round(lowint,3),",",round(highint,3),")"))
cut2-cut1+1
#Copy the interval into the array as many times as there are values of the interval in the original array.
intervals<-c(intervals,rep(paste0("[",round(lowint,3),",",round(highint,3),")"),cut2-cut1+1))
table(intervals)
}
else
{
#The last interval will be closed from the right.
interval_levels<-c(interval_levels,paste0("[",round(lowint,3),",",round(highint,3),"]"))
intervals<-c(intervals,rep(paste0("[",round(lowint,3),",",round(highint,3),"]"),cut2-cut1+1))
table(intervals)
}
#We move the cutoff, for the following interval
cut1<-cut2+1
}
#The smaller value of the interval will be the larger of the previous one.
lowint<-highint
}
#We change the result to type dataframe, we assign the levels again
#to be in the original order and return the result.
intervals<-as.data.frame(intervals)
intervals$intervals<-factor(intervals$intervals,levels=interval_levels)
return(intervals)
}
#We apply the function to all the columns and obtain the graph.
disc <- lapply(c(3,4,10,11,23,40,42,46,47,49,52),function(cnum) {
intervals<-zdiscr(data[,cnum])
ggplot(intervals,aes(x=intervals))+geom_bar(stat="count",color="darkblue",fill="lightblue")+
xlab(colnames(data)[cnum])+theme(axis.text.x = element_text(angle = 45, hjust = 1))
})
#ggplot(intervals,aes(x=intervals))+geom_bar(stat="count",color="darkblue",fill="lightblue")+
# xlab(colnames(data)[23])+theme(axis.text.x = element_text(angle = 45, hjust = 1))
gridExtra::marrangeGrob(grobs=disc, nrow=2, ncol=2)
Despite the fact that none of the variables follows a normal distribution we can see how some of them admit well this type of discretization.
The highest frequency of values tends to be concentrated in the central intervals, this is normal since this is where the highest proportion of values exist in the normal distribution. We see several variables where the resulting plot resembles the normal distribution, although the width of the intervals is quite uneven in cases where the density of values of the variable is more concentrated at one end, an example of this would be the variable LDA_03. We also observe how, if there are few different values, we obtain intervals with only one value inside them, especially the last one, this occurs in num_keywords and kw_avg_max.
SVD dimensionality reduction
We are going to perform dimensionality reduction of the dataset by applying SVD (singular value decomposition) on the continuous numerical variables of the dataset.
SVD decomposes a matrix into singular values and singular vectors. The value matrix is decomposed into a vector (d) of singular values and 2 matrices of singular vectors (u and v), the first refers to the rows of the matrix (principal components) and the other the columns. The length of the vector "d" will be equal to the number of columns of the matrix, the matrix "u" has the same size of the original matrix and the matrix "v" will have the same number of rows and columns, matching the number of columns of the dataset.
We then apply the vector "d" to the matrix "u", the matrix "v" will have the same number of rows and columns, matching the number of columns of the dataset.
Next we apply svd on the continuous variables and display the values of the vectors of our dataset. To simplify the presentation, only the first 5 rows of u and v are displayed.
#Creamos un dataset con las variables continuas
datanum <- (data[,c(2:10,17:28,36:56)])
#Scaling the data
datanum<-scale(datanum)
#We apply SVD and display the data.
datanum.svd<-svd(datanum)
"d"
## [1] "d"
datanum.svd$d
## [1] 435.86495908 392.97389437 373.00211802 329.50144643 300.54098598
## [6] 288.33208123 259.87679145 246.96165444 240.03495172 238.71929170
## [11] 233.14862176 220.20153443 210.79590732 207.06389298 199.54850472
## [16] 197.60486053 193.48271139 187.50067045 179.32521247 166.75167621
## [21] 160.72846361 154.20653132 152.64297871 148.43576736 140.48371299
## [26] 134.89712031 134.31326141 128.75125574 126.60830703 116.45830674
## [31] 108.21799033 105.02227950 85.88362175 83.93061932 73.04281379
## [36] 62.37521050 56.29744729 55.59221505 48.63510408 39.93920605
## [41] 32.11331643 0.02854457
"u"
## [1] "u"
datanum.svd$u[1:5,]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -3.385812e-03 0.006482267 -0.006124425 0.003765417 -0.007649380
## [2,] 7.160552e-05 0.010893263 -0.004993432 0.003369068 -0.003279016
## [3,] -8.940032e-03 0.008645131 -0.005460680 0.001766840 -0.004323186
## [4,] -1.973601e-03 0.005191147 -0.009384980 0.004877264 -0.004584371
## [5,] -9.191780e-03 0.008729192 -0.004061428 -0.001309566 0.004055826
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.0008022925 0.005712196 -5.810998e-03 -0.0023044690 -0.0001962938
## [2,] 0.0015301889 -0.004711192 -6.837987e-03 -0.0002711066 -0.0038951485
## [3,] -0.0023831448 -0.003891976 -1.620292e-03 0.0004742731 0.0009639064
## [4,] -0.0004544707 -0.001793962 -7.970181e-05 0.0009011354 -0.0017215056
## [5,] 0.0132565096 0.002553895 3.154651e-03 -0.0029947192 0.0065904581
## [,11] [,12] [,13] [,14] [,15]
## [1,] -0.004417958 0.0013223745 0.004017870 -0.0003510615 0.0018983328
## [2,] 0.002964934 0.0033843555 0.004391979 -0.0055342393 0.0002399426
## [3,] -0.003068118 -0.0006437289 -0.006344687 0.0013045685 -0.0009672438
## [4,] -0.003810086 0.0040769780 0.002817938 0.0042052519 0.0005383038
## [5,] -0.001261883 -0.0054531252 -0.004692972 0.0019186179 0.0028051413
## [,16] [,17] [,18] [,19] [,20]
## [1,] -0.003207650 0.0054956882 0.0017243019 0.0059939961 -0.0058545529
## [2,] -0.000585199 0.0009942088 0.0013589702 0.0008725356 -0.0001222914
## [3,] 0.001199733 -0.0031858479 0.0057170035 -0.0044088114 -0.0043207743
## [4,] -0.002522068 -0.0027396626 -0.0012290974 -0.0039576367 -0.0030495272
## [5,] -0.003332712 0.0091605835 0.0007881861 0.0068524002 0.0011518221
## [,21] [,22] [,23] [,24] [,25]
## [1,] 0.0009219086 0.0008116234 0.0006152438 0.0009257692 0.0029916862
## [2,] -0.0053754431 0.0025061767 -0.0014188080 0.0019004582 -0.0007882620
## [3,] 0.0014927664 -0.0024788368 0.0065281555 -0.0026304395 -0.0002210594
## [4,] -0.0006189323 -0.0005332388 0.0026593720 -0.0010316101 0.0029397696
## [5,] -0.0075918668 -0.0012860026 0.0103827916 0.0053658913 0.0025345687
## [,26] [,27] [,28] [,29] [,30]
## [1,] -2.419917e-03 -0.001785278 -0.0018219301 -0.005497054 -0.004750886
## [2,] 5.550875e-03 -0.003660889 -0.0004846979 -0.002622815 0.001913659
## [3,] -3.168172e-03 -0.001859394 -0.0055794480 -0.003508295 -0.014145093
## [4,] 7.931757e-05 -0.002427204 -0.0062993395 -0.003865837 0.003513539
## [5,] 5.583558e-04 0.003984597 -0.0006616196 -0.001520384 -0.005504363
## [,31] [,32] [,33] [,34] [,35]
## [1,] 0.0030965488 -0.010320805 -0.0005335837 0.001892414 -0.03463568
## [2,] -0.0019229863 -0.012759956 -0.0020608487 0.001078080 -0.03443611
## [3,] 0.0063091260 -0.008155755 -0.0042873231 0.001326227 -0.03550967
## [4,] -0.0004069632 -0.005223992 -0.0033778086 0.014733847 -0.03473534
## [5,] 0.0009561588 -0.005076812 -0.0047335245 -0.001599254 -0.03613900
## [,36] [,37] [,38] [,39] [,40]
## [1,] -0.003873997 0.0007563766 -0.009059118 -0.008668327 0.0019382965
## [2,] 0.001449192 -0.0010396272 -0.009097761 0.002476414 0.0003082738
## [3,] 0.001506646 -0.0068628751 -0.009921063 0.008989540 0.0028175504
## [4,] -0.003497095 -0.0023530477 -0.008894878 0.002151546 0.0023306861
## [5,] 0.001489364 -0.0077657183 -0.006413718 -0.001870882 0.0023400405
## [,41] [,42]
## [1,] 0.0032175328 -0.0038690460
## [2,] 0.0019817553 -0.0016703504
## [3,] 0.0004969286 -0.0006988299
## [4,] -0.0041086339 0.0050197330
## [5,] 0.0055072330 0.0027956778
"v"
## [1] "v"
datanum.svd$v[1:5,]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.160176528 0.212655760 -0.087917290 0.3006809405 -0.127347683
## [2,] 0.057624242 -0.072963877 0.015681959 -0.0839603606 0.002522261
## [3,] -0.138756231 -0.120186226 -0.118620240 -0.1098440427 0.323525008
## [4,] -0.001886699 0.001056695 0.000614258 0.0000989348 -0.005865528
## [5,] -0.151918979 -0.139421275 -0.008222312 -0.0630964546 0.215453543
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.117861359 -0.079799426 0.018431306 -0.0509609438 -0.12012821
## [2,] -0.001433014 0.249892639 -0.005786841 -0.0134710244 0.08172337
## [3,] 0.322829291 -0.025924881 -0.084098736 0.0166977660 -0.03816705
## [4,] -0.003166784 -0.007444727 0.015356609 0.0005415314 0.01111672
## [5,] 0.281549104 -0.110350543 0.174827337 -0.0046773757 -0.11503890
## [,11] [,12] [,13] [,14] [,15]
## [1,] 0.058916882 -0.07596585 0.03692845 0.004615699 0.001669931
## [2,] -0.082433542 0.19048019 -0.12721847 0.150026085 0.097274270
## [3,] 0.012894253 0.01190671 0.02189248 0.052507880 -0.001656651
## [4,] -0.004631027 -0.04208872 0.11083770 -0.031435153 0.889656752
## [5,] -0.098173031 -0.05976804 0.11049366 -0.181561340 0.013260949
## [,16] [,17] [,18] [,19] [,20]
## [1,] -0.01189324 -0.035835667 -0.06846093 0.07692079 0.014744969
## [2,] -0.07750860 0.719846877 0.42223127 -0.01179334 0.238552938
## [3,] -0.01955257 0.008153615 0.12450141 -0.19118297 -0.195908668
## [4,] 0.42276868 -0.067618755 0.04806171 -0.05294360 -0.002468255
## [5,] -0.07532025 0.096635603 -0.16154077 0.11702986 -0.017647491
## [,21] [,22] [,23] [,24] [,25]
## [1,] -0.08091506 -0.041017076 -0.00530985 0.08481055 1.042435e-01
## [2,] -0.03235169 0.003388664 -0.03830150 -0.06659852 -5.409947e-02
## [3,] 0.16466115 0.013542924 -0.06230209 0.11593981 -3.239975e-02
## [4,] 0.03424717 -0.028228721 0.02885058 -0.03467925 -8.355276e-06
## [5,] 0.10974381 -0.045298941 -0.28812328 -0.49280635 2.411568e-01
## [,26] [,27] [,28] [,29] [,30]
## [1,] 0.12961870 0.040784050 -0.22389909 -0.765407997 0.0110983098
## [2,] 0.08309065 0.029003632 -0.15065757 -0.150774013 0.0015689583
## [3,] -0.17315350 0.131474721 0.08361562 -0.066177029 0.6204194884
## [4,] 0.00751278 -0.001757247 -0.01010677 -0.003471738 -0.0005628256
## [5,] 0.13278299 -0.138708900 -0.42363307 0.158593423 -0.0138376222
## [,31] [,32] [,33] [,34] [,35]
## [1,] 0.079271564 0.2402076559 -0.011142287 -0.0759035859 0.0373670293
## [2,] -0.004399013 0.0101972524 -0.009848919 -0.0046400713 0.0070068610
## [3,] 0.340866487 0.0186629140 -0.047343141 -0.0587009917 -0.0054278059
## [4,] 0.005633113 -0.0001072322 0.006400669 0.0003668092 0.0005151918
## [5,] -0.073535012 0.0758025349 0.034274521 0.0092647868 -0.0221446993
## [,36] [,37] [,38] [,39] [,40]
## [1,] 0.017364096 -0.0166638004 -0.0270754730 -0.0011494122 -0.027729431
## [2,] 0.003595495 -0.0002643839 0.0144764347 0.0003837793 -0.010483728
## [3,] -0.085136156 -0.0605387853 0.0323258233 0.0499971586 -0.014185718
## [4,] -0.001710734 0.0030371435 0.0001139472 -0.0019293816 0.001175496
## [5,] -0.031068643 0.0131640133 -0.0375182867 -0.0136114318 0.026783131
## [,41] [,42]
## [1,] 0.005351511 1.889274e-05
## [2,] 0.016655576 6.602757e-08
## [3,] 0.035034272 -1.350970e-04
## [4,] -0.014758312 -8.276680e-03
## [5,] -0.043744018 -2.845891e-06
In the following graph we can observe how the singular values of the vector "d" are ordered, the smaller ones will have less importance in the explanation of the total variation of the dataset.
plot( datanum.svd$d, xlab="Column", ylab="Singular value", pch=19 )
In the following graph we show the percentage of variance explained by the values of "d", you can see how they go in decreasing order. The values of variability explained from the singular value 30 onwards are very close to 0 so they will be less important in explaining this variability. This means that we can eliminate these values without fear of losing too much information when applying a model to them.
plot(prop.table(datanum.svd$d^2),ylab="Percent variability explained")
We see how with the first 30 columns we have an explained variability of 96.5%, so we can dispense with these singular values without losing much information.
sum(prop.table(datanum.svd$d^2)[1:30])
## [1] 0.9653823
The decomposition of the matrix into singular values and vectors allows us to recover the original data with the following multiplication::
X=udt(v)
We show the first 5 rows of the resulting dataset which will be the same as the original.
X=datanum.svd$u %*% diag(datanum.svd$d) %*% t(datanum.svd$v)
X[1:5,]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.757858 0.7574377 -0.69520168 0.032771459 -0.6074549 -0.3355619
## [2,] 1.757858 -0.6616483 -0.61878600 0.016055679 -0.6957005 -0.5949557
## [3,] 1.757858 -0.6616483 -0.71218294 0.007644432 -0.6957005 -0.5949557
## [4,] 1.757858 -0.6616483 -0.03293246 -0.012619006 -0.1662272 -0.8543496
## [5,] 1.757858 1.2304663 1.11542538 -0.037654383 0.7162282 4.0741340
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] -0.4265204 -0.3042644 0.1564722 -0.5195597 -0.3749188 -0.2991069
## [2,] -0.4265204 -0.3042644 0.4328325 -0.5195597 -0.3749188 -0.2991069
## [3,] -0.4265204 -0.3042644 -0.1834123 -0.5195597 -0.3749188 -0.2991069
## [4,] -0.4265204 -0.3042644 -0.1697560 -0.5195597 -0.3749188 -0.2991069
## [5,] 1.8600374 -0.3042644 0.1593979 -0.5195597 -0.3749188 -0.2991069
## [,13] [,14] [,15] [,16] [,17] [,18]
## [1,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [2,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [3,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [4,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [5,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] -2.378984 -0.1774565 -0.2396733 -0.2723188 1.2006179 1.0788132
## [2,] -2.378984 -0.2025848 -0.2517627 -0.2723188 2.3392236 -0.4151391
## [3,] -2.378984 -0.1560771 -0.2293875 -0.2723188 0.1262222 -0.4912049
## [4,] -2.378984 -0.2025848 -0.2517627 -0.2723188 -0.5933112 1.2655193
## [5,] -2.378984 -0.1749741 0.1382189 -0.2723188 -0.5930846 -0.5118729
## [,25] [,26] [,27] [,28] [,29] [,30]
## [1,] -0.6249133 -0.6182679 -0.6705322 0.6705848 -0.2759425 0.3463989
## [2,] -0.5891460 -0.5883279 -0.6363734 -0.8752172 0.3057706 0.2015314
## [3,] -0.6484942 -0.6451289 1.5497390 2.2183898 2.1048455 0.9895888
## [4,] 0.9864774 -0.6601322 -0.7104755 -0.1158723 -0.1919372 0.1036470
## [5,] -0.6654226 -0.6612604 2.2525402 0.6010382 1.6681430 2.0083034
## [,31] [,32] [,33] [,34] [,35] [,36]
## [1,] -0.26907511 0.45782152 -0.3660720 0.2373340 0.06386407 -0.2289379
## [2,] -0.08550635 0.26909259 -0.1361899 -0.6400320 -0.87095652 -0.2289379
## [3,] -0.65880896 0.92001481 -0.9290484 1.3583836 0.06386407 0.9817856
## [4,] 0.37897951 -0.08140399 0.2907339 0.3074379 0.57376621 0.1746366
## [5,] -0.41423531 0.93616673 -0.9487223 0.5481279 -0.87095652 0.9817856
## [,37] [,38] [,39] [,40] [,41] [,42]
## [1,] -0.7083605 -0.26889113 -0.96987381 0.6712369 -0.9754199 -1.8106960
## [2,] 1.1021597 1.36740642 0.07864114 -0.8707956 -0.2690728 0.8377381
## [3,] -1.6217761 -0.95785852 -0.27086384 -0.8707956 -0.2690728 0.8377381
## [4,] -0.8625735 -0.26889113 -0.62036882 -0.8707956 -0.2690728 0.8377381
## [5,] 0.3079398 0.07559256 0.60289861 0.5310522 0.2446342 -1.5699293
In this case, since we are trying to reduce the dimensionality, we will be interested in working with a reduced version of the data, in order to apply the appropriate models more easily and to make the algorithms work more efficiently. In this case, to obtain the relevant information, it is enough to multiply the reduced versions of "u" and "d", with the columns we are interested in, so we can implement the models with the most relevant information.
datareduced=datanum.svd$u[,1:30] %*% diag(datanum.svd$d[1:30])
datareduced[1:5,]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -1.47575683 2.547362 -2.284423 1.2407105 -2.2989522 0.2313267
## [2,] 0.03121033 4.280768 -1.862561 1.1101126 -0.9854788 0.4412026
## [3,] -3.89664683 3.397311 -2.036845 0.5821763 -1.2992944 -0.6871371
## [4,] -0.86022337 2.039985 -3.500617 1.6070654 -1.3777914 -0.1310385
## [5,] -4.00637470 3.430344 -1.514921 -0.4315038 1.2189420 3.8222770
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 1.484467 -1.43509356 -0.55315311 -0.04685913 -1.0300407 0.2911889
## [2,] -1.224329 -1.68872056 -0.06507506 -0.92984709 0.6912702 0.7452403
## [3,] -1.011434 -0.40014993 0.11384211 0.23010305 -0.7153275 -0.1417501
## [4,] -0.466209 -0.01968329 0.21630400 -0.41095659 -0.8883163 0.8977568
## [5,] 0.663698 0.77907774 -0.71883727 1.57326948 -0.2942063 -1.2007865
## [,13] [,14] [,15] [,16] [,17] [,18]
## [1,] 0.8469506 -0.07269215 0.37880947 -0.6338472 1.0633207 0.3233078
## [2,] 0.9258112 -1.14594114 0.04788018 -0.1156382 0.1923622 0.2548078
## [3,] -1.3374341 0.27012904 -0.19301206 0.2370730 -0.6164065 1.0719420
## [4,] 0.5940097 0.87075582 0.10741771 -0.4983728 -0.5300773 -0.2304566
## [5,] -0.9892594 0.39727649 0.55976174 -0.6585601 1.7724145 0.1477854
## [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] 1.0748746 -0.9762565 0.14817696 0.12515763 0.09391265 0.1374173
## [2,] 0.1564676 -0.0203923 -0.86398671 0.38646882 -0.21657108 0.2820960
## [3,] -0.7906110 -0.7204964 0.23993006 -0.38225283 0.99647710 -0.3904513
## [4,] -0.7097040 -0.5085138 -0.09948003 -0.08222891 0.40593446 -0.1531278
## [5,] 1.2288081 0.1920683 -1.22022908 -0.19831000 1.58486024 0.7964902
## [,25] [,26] [,27] [,28] [,29] [,30]
## [1,] 0.42028318 -0.32643986 -0.2397866 -0.23457578 -0.6959727 -0.5532801
## [2,] -0.11073797 0.74879709 -0.4917060 -0.06240546 -0.3320702 0.2228615
## [3,] -0.03105525 -0.42737721 -0.2497413 -0.71836093 -0.4441793 -1.6473136
## [4,] 0.41298974 0.01069971 -0.3260057 -0.81104787 -0.4894471 0.4091807
## [5,] 0.35606562 0.07532059 0.5351842 -0.08518436 -0.1924933 -0.6410288
Data conversion and creation of train and test sets
To optimize results, we are going to convert the binary data to factor and normalize the numerical data, except shares which is the target variable and we will remove it for the models when we create another column with the thresholds. We apply the scale function as we did with the svd algorithm in practice 1.
library(BBmisc)
datascale<-data
datascale[,c(12:17,30:36,59)]<-lapply(c(12:17,30:36), function(x) as.factor(datascale[,x]))
datascale[,c(3:11,18:29,37:57)]<-lapply(c(3:11,18:29,37:58), function(x) datascale[,x]<-as.numeric(scale(datascale[,x]),center=TRUE,scale=TRUE))
str(datascale)
## 'data.frame': 39644 obs. of 59 variables:
## $ url : Factor w/ 39644 levels "http://mashable.com/2013/01/07/amazon-instant-video-browser/",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ timedelta : num 731 731 731 731 731 731 731 731 731 731 ...
## $ n_tokens_title : num 0.757 -0.662 -0.662 -0.662 1.23 ...
## $ n_tokens_content : num -0.6952 -0.6188 -0.7122 -0.0329 1.1154 ...
## $ n_unique_tokens : num 0.03277 0.01606 0.00764 -0.01262 -0.03765 ...
## $ num_hrefs : num -0.607 -0.696 -0.696 -0.166 0.716 ...
## $ num_self_hrefs : num -0.336 -0.595 -0.595 -0.854 4.074 ...
## $ num_imgs : num -0.427 -0.427 -0.427 -0.427 1.86 ...
## $ num_videos : num -0.304 -0.304 -0.304 -0.304 -0.304 ...
## $ average_token_length : num 0.156 0.433 -0.183 -0.17 0.159 ...
## $ num_keywords : num -1.165 -1.689 -0.641 -0.117 -0.117 ...
## $ data_channel_is_lifestyle : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ data_channel_is_entertainment: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 1 1 ...
## $ data_channel_is_bus : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
## $ data_channel_is_socmed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ data_channel_is_tech : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1 ...
## $ data_channel_is_world : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ kw_min_min : num -0.375 -0.375 -0.375 -0.375 -0.375 ...
## $ kw_max_min : num -0.299 -0.299 -0.299 -0.299 -0.299 ...
## $ kw_avg_min : num -0.503 -0.503 -0.503 -0.503 -0.503 ...
## $ kw_min_max : num -0.235 -0.235 -0.235 -0.235 -0.235 ...
## $ kw_max_max : num -3.51 -3.51 -3.51 -3.51 -3.51 ...
## $ kw_avg_max : num -1.92 -1.92 -1.92 -1.92 -1.92 ...
## $ kw_min_avg : num -0.982 -0.982 -0.982 -0.982 -0.982 ...
## $ kw_max_avg : num -0.928 -0.928 -0.928 -0.928 -0.928 ...
## $ kw_avg_avg : num -2.38 -2.38 -2.38 -2.38 -2.38 ...
## $ self_reference_min_shares : num -0.177 -0.203 -0.156 -0.203 -0.175 ...
## $ self_reference_max_shares : num -0.24 -0.252 -0.229 -0.252 0.138 ...
## $ self_reference_avg_sharess : num -0.244 -0.264 -0.226 -0.264 -0.134 ...
## $ weekday_is_monday : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ weekday_is_tuesday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_wednesday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_thursday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_friday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_saturday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_sunday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ LDA_00 : num 1.201 2.339 0.126 -0.593 -0.593 ...
## $ LDA_01 : num 1.079 -0.415 -0.491 1.266 -0.512 ...
## $ LDA_02 : num -0.625 -0.589 -0.648 0.986 -0.665 ...
## $ LDA_03 : num -0.618 -0.588 -0.645 -0.66 -0.661 ...
## $ LDA_04 : num -0.671 -0.636 1.55 -0.71 2.253 ...
## $ global_subjectivity : num 0.671 -0.875 2.218 -0.116 0.601 ...
## $ global_sentiment_polarity : num -0.276 0.306 2.105 -0.192 1.668 ...
## $ global_rate_positive_words : num 0.346 0.202 0.99 0.104 2.008 ...
## $ global_rate_negative_words : num -0.2691 -0.0855 -0.6588 0.379 -0.4142 ...
## $ rate_positive_words : num 0.4578 0.2691 0.92 -0.0814 0.9362 ...
## $ rate_negative_words : num -0.366 -0.136 -0.929 0.291 -0.949 ...
## $ avg_positive_polarity : num 0.237 -0.64 1.358 0.307 0.548 ...
## $ min_positive_polarity : num 0.0639 -0.871 0.0639 0.5738 -0.871 ...
## $ max_positive_polarity : num -0.229 -0.229 0.982 0.175 0.982 ...
## $ avg_negative_polarity : num -0.708 1.102 -1.622 -0.863 0.308 ...
## $ min_negative_polarity : num -0.2689 1.3674 -0.9579 -0.2689 0.0756 ...
## $ max_negative_polarity : num -0.9699 0.0786 -0.2709 -0.6204 0.6029 ...
## $ title_subjectivity : num 0.671 -0.871 -0.871 -0.871 0.531 ...
## $ title_sentiment_polarity : num -0.975 -0.269 -0.269 -0.269 0.245 ...
## $ abs_title_subjectivity : num -1.811 0.838 0.838 0.838 -1.57 ...
## $ abs_title_sentiment_polarity : num 0.1389 -0.6896 -0.6896 -0.6896 -0.0871 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
## $ data_channel_is_other : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
Given the large size of the dataset, to speed up processes and to be able to execute functions that otherwise run out of memory, we will use the sample.split function to perform the exercise with 35% of the data.
set.seed(277)
library(caTools)
datascale$division=sample.split(datascale$shares,SplitRatio=.35)
summary(datascale)
## url
## http://mashable.com/2013/01/07/amazon-instant-video-browser/ : 1
## http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ : 1
## http://mashable.com/2013/01/07/apple-40-billion-app-downloads/: 1
## http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/ : 1
## http://mashable.com/2013/01/07/att-u-verse-apps/ : 1
## http://mashable.com/2013/01/07/beewi-smart-toys/ : 1
## (Other) :39638
## timedelta n_tokens_title n_tokens_content n_unique_tokens
## Min. : 8.0 Min. :-3.9728 Min. :-1.1601 Min. : -0.15571
## 1st Qu.:164.0 1st Qu.:-0.6616 1st Qu.:-0.6379 1st Qu.: -0.02197
## Median :339.0 Median :-0.1886 Median :-0.2919 Median : -0.00255
## Mean :354.5 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.:542.0 3rd Qu.: 0.7574 3rd Qu.: 0.3598 3rd Qu.: 0.01718
## Max. :731.0 Max. : 5.9608 Max. :16.8273 Max. :198.95195
##
## num_hrefs num_self_hrefs num_imgs
## Min. :-0.9604 Min. :-0.85435 Min. :-0.54687
## 1st Qu.:-0.6075 1st Qu.:-0.59496 1st Qu.:-0.42652
## Median :-0.2545 Median :-0.07617 Median :-0.42652
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.2750 3rd Qu.: 0.18323 3rd Qu.:-0.06549
## Max. :25.8662 Max. :29.23534 Max. :14.85731
##
## num_videos average_token_length num_keywords
## Min. :-0.30426 Min. :-5.3863 Min. :-3.2600
## 1st Qu.:-0.30426 1st Qu.:-0.0827 1st Qu.:-0.6410
## Median :-0.30426 Median : 0.1372 Median :-0.1172
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.06083 3rd Qu.: 0.3631 3rd Qu.: 0.9304
## Max. :21.84842 Max. : 4.1370 Max. : 1.4542
##
## data_channel_is_lifestyle data_channel_is_entertainment
## 0:37545 0:32587
## 1: 2099 1: 7057
##
##
##
##
##
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## 0:33386 0:37321 0:32298
## 1: 6258 1: 2323 1: 7346
##
##
##
##
##
## data_channel_is_world kw_min_min kw_max_min
## 0:31217 Min. :-0.3893 Min. :-0.2991
## 1: 8427 1st Qu.:-0.3893 1st Qu.:-0.1838
## Median :-0.3893 Median :-0.1280
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.3175 3rd Qu.:-0.0399
## Max. : 5.0392 Max. :77.0469
##
## kw_avg_min kw_min_max kw_max_max kw_avg_max
## Min. :-0.5048 Min. :-0.23475 Min. :-3.5073 Min. :-1.9192
## 1st Qu.:-0.2748 1st Qu.:-0.23475 1st Qu.: 0.4241 1st Qu.:-0.6398
## Median :-0.1238 Median :-0.21061 Median : 0.4241 Median :-0.1089
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.0719 3rd Qu.:-0.09851 3rd Qu.: 0.4241 3rd Qu.: 0.5307
## Max. :68.4868 Max. :14.30841 Max. : 0.4241 Max. : 4.3228
##
## kw_min_avg kw_max_avg kw_avg_avg
## Min. :-0.98302 Min. :-0.92758 Min. :-2.3790
## 1st Qu.:-0.98214 1st Qu.:-0.34352 1st Qu.:-0.5716
## Median :-0.08221 Median :-0.21340 Median :-0.2016
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.82608 3rd Qu.: 0.05948 3rd Qu.: 0.3523
## Max. : 2.19427 Max. :47.99950 Max. :30.6731
##
## self_reference_min_shares self_reference_max_shares
## Min. :-0.20258 Min. :-0.25176
## 1st Qu.:-0.17021 1st Qu.:-0.22495
## Median :-0.14179 Median :-0.18352
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.07086 3rd Qu.:-0.05677
## Max. :42.52066 Max. :20.30270
##
## self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## Min. :-0.26441 0:32983 0:32254
## 1st Qu.:-0.22388 1: 6661 1: 7390
## Median :-0.17354
## Mean : 0.00000
## 3rd Qu.:-0.04963
## Max. :34.56639
##
## weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 0:32209 0:32377 0:33943
## 1: 7435 1: 7267 1: 5701
##
##
##
##
##
## weekday_is_saturday weekday_is_sunday LDA_00
## 0:37191 0:36907 Min. :-0.7020
## 1: 2453 1: 2737 1st Qu.:-0.6067
## Median :-0.5750
## Mean : 0.0000
## 3rd Qu.: 0.2143
## Max. : 2.8231
##
## LDA_01 LDA_02 LDA_03 LDA_04
## Min. :-0.64293 Min. :-0.7667 Min. :-0.7581 Min. :-0.8093
## 1st Qu.:-0.52908 1st Qu.:-0.6654 1st Qu.:-0.6613 1st Qu.:-0.7105
## Median :-0.49116 Median :-0.6249 Median :-0.6225 Median :-0.6684
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04358 3rd Qu.: 0.4179 3rd Qu.: 0.5149 3rd Qu.: 0.5739
## Max. : 3.57153 Max. : 2.4940 Max. : 2.3807 Max. : 2.3970
##
## global_subjectivity global_sentiment_polarity global_rate_positive_words
## Min. :-3.79973 Min. :-5.293054 Min. :-2.27354
## 1st Qu.:-0.40454 1st Qu.:-0.635010 1st Qu.:-0.64495
## Median : 0.08645 Median :-0.001986 Median :-0.03454
## Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.55674 3rd Qu.: 0.603761 3rd Qu.: 0.61132
## Max. : 4.77038 Max. : 6.278010 Max. : 6.64784
##
## global_rate_negative_words rate_positive_words rate_negative_words
## Min. :-1.5342 Min. :-3.5864 Min. :-1.84388
## 1st Qu.:-0.6462 1st Qu.:-0.4319 1st Qu.:-0.65799
## Median :-0.1177 Median : 0.1492 Median :-0.05081
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4735 3rd Qu.: 0.6196 3rd Qu.: 0.61914
## Max. :15.5451 Max. : 1.6711 Max. : 4.55997
##
## avg_positive_polarity min_positive_polarity max_positive_polarity
## Min. :-3.38452 Min. :-1.33837 Min. :-3.0540
## 1st Qu.:-0.45514 1st Qu.:-0.63725 1st Qu.:-0.6325
## Median : 0.04716 Median : 0.06386 Median : 0.1746
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.55100 3rd Qu.: 0.06386 3rd Qu.: 0.9818
## Max. : 6.18100 Max. :12.68394 Max. : 0.9818
##
## avg_negative_polarity min_negative_polarity max_negative_polarity
## Min. :-5.79739 Min. :-1.64683 Min. :-9.35799
## 1st Qu.:-0.53911 1st Qu.:-0.61337 1st Qu.:-0.18349
## Median : 0.04847 Median : 0.07559 Median : 0.07864
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.56856 3rd Qu.: 0.76456 3rd Qu.: 0.60290
## Max. : 2.03189 Max. : 1.79801 Max. : 1.12716
##
## title_subjectivity title_sentiment_polarity abs_title_subjectivity
## Min. :-0.8708 Min. :-4.0363 Min. :-1.8107
## 1st Qu.:-0.8708 1st Qu.:-0.2691 1st Qu.:-0.9279
## Median :-0.4082 Median :-0.2691 Median : 0.8377
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6712 3rd Qu.: 0.2960 3rd Qu.: 0.8377
## Max. : 2.2133 Max. : 3.4981 Max. : 0.8377
##
## abs_title_sentiment_polarity shares data_channel_is_other
## Min. :-0.6896 Min. : 1 0:37545
## 1st Qu.:-0.6896 1st Qu.: 946 1: 2099
## Median :-0.6896 Median : 1400
## Mean : 0.0000 Mean : 3395
## 3rd Qu.: 0.4151 3rd Qu.: 2800
## Max. : 3.7294 Max. :843300
##
## division
## Mode :logical
## FALSE:25499
## TRUE :14145
##
##
##
##
Let's keep the 14145 rows that have TRUE, we compare the values with the original dataset
newdatascale<-datascale[datascale$division==TRUE,]
summary(newdatascale)
## url
## http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ : 1
## http://mashable.com/2013/01/07/chuck-hagel-website/ : 1
## http://mashable.com/2013/01/07/cosmic-events-doomsday/ : 1
## http://mashable.com/2013/01/07/earth-size-planets-milky-way/: 1
## http://mashable.com/2013/01/07/felt-audio-pulse-speaker/ : 1
## http://mashable.com/2013/01/07/ford-glympse/ : 1
## (Other) :14139
## timedelta n_tokens_title n_tokens_content
## Min. : 8.0 Min. :-3.972849 Min. :-1.160064
## 1st Qu.:163.0 1st Qu.:-0.661648 1st Qu.:-0.637890
## Median :342.0 Median :-0.188620 Median :-0.294019
## Mean :354.5 Mean : 0.002766 Mean :-0.003818
## 3rd Qu.:545.0 3rd Qu.: 0.757438 3rd Qu.: 0.357637
## Max. :731.0 Max. : 4.068638 Max. :15.320251
##
## n_unique_tokens num_hrefs num_self_hrefs
## Min. :-0.155712 Min. :-0.960437 Min. :-0.854350
## 1st Qu.:-0.021780 1st Qu.:-0.607455 1st Qu.:-0.594956
## Median :-0.002379 Median :-0.254473 Median :-0.335562
## Mean :-0.004774 Mean : 0.006377 Mean : 0.000027
## 3rd Qu.: 0.017442 3rd Qu.: 0.275000 3rd Qu.: 0.183226
## Max. : 0.128322 Max. :25.866207 Max. :18.340797
##
## num_imgs num_videos average_token_length
## Min. :-0.546866 Min. :-0.304264 Min. :-5.386321
## 1st Qu.:-0.426520 1st Qu.:-0.304264 1st Qu.:-0.081199
## Median :-0.426520 Median :-0.304264 Median : 0.137250
## Mean : 0.001753 Mean :-0.002848 Mean : 0.000181
## 3rd Qu.:-0.065485 3rd Qu.:-0.060828 3rd Qu.: 0.363095
## Max. :12.811446 Max. :17.953439 Max. : 3.162214
##
## num_keywords data_channel_is_lifestyle
## Min. :-3.260001 0:13374
## 1st Qu.:-0.641007 1: 771
## Median :-0.117209
## Mean : 0.005252
## 3rd Qu.: 0.930389
## Max. : 1.454187
##
## data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 0:11627 0:11942 0:13347
## 1: 2518 1: 2203 1: 798
##
##
##
##
##
## data_channel_is_tech data_channel_is_world kw_min_min
## 0:11550 0:11104 Min. :-0.389280
## 1: 2595 1: 3041 1st Qu.:-0.389280
## Median :-0.389280
## Mean :-0.004804
## 3rd Qu.:-0.317475
## Max. : 3.847204
##
## kw_max_min kw_avg_min kw_min_max
## Min. :-0.29911 Min. :-0.50479 Min. :-0.234752
## 1st Qu.:-0.18454 1st Qu.:-0.27766 1st Qu.:-0.234752
## Median :-0.12855 Median :-0.12391 Median :-0.210609
## Mean : 0.01112 Mean : 0.01002 Mean : 0.005899
## 3rd Qu.:-0.03990 3rd Qu.: 0.07109 3rd Qu.:-0.098513
## Max. :77.04685 Max. :68.48678 Max. :14.308406
##
## kw_max_max kw_avg_max kw_min_avg
## Min. :-3.507303 Min. :-1.91915 Min. :-0.9830232
## 1st Qu.: 0.424126 1st Qu.:-0.64013 1st Qu.:-0.9821441
## Median : 0.424126 Median :-0.11348 Median :-0.0858464
## Mean : 0.000872 Mean :-0.00275 Mean :-0.0007414
## 3rd Qu.: 0.424126 3rd Qu.: 0.53249 3rd Qu.: 0.8282670
## Max. : 0.424126 Max. : 4.32279 Max. : 2.1917123
##
## kw_max_avg kw_avg_avg self_reference_min_shares
## Min. :-0.92758 Min. :-2.378984 Min. :-0.20258
## 1st Qu.:-0.34260 1st Qu.:-0.575329 1st Qu.:-0.17097
## Median :-0.21287 Median :-0.197807 Median :-0.14179
## Mean : 0.00423 Mean : 0.005212 Mean : 0.00193
## 3rd Qu.: 0.06113 3rd Qu.: 0.353306 3rd Qu.:-0.06580
## Max. :47.99950 Max. :30.673132 Max. :34.77444
##
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## Min. :-0.251763 Min. :-0.264409 0:11752
## 1st Qu.:-0.227389 1st Qu.:-0.224816 1: 2393
## Median :-0.183516 Median :-0.173543
## Mean : 0.008241 Mean : 0.006025
## 3rd Qu.:-0.056772 3rd Qu.:-0.049634
## Max. :20.302705 Max. :28.251163
##
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## 0:11468 0:11526 0:11596
## 1: 2677 1: 2619 1: 2549
##
##
##
##
##
## weekday_is_friday weekday_is_saturday weekday_is_sunday
## 0:12123 0:13276 0:13129
## 1: 2022 1: 869 1: 1016
##
##
##
##
##
## LDA_00 LDA_01 LDA_02
## Min. :-0.632822 Min. :-0.560156 Min. :-0.702245
## 1st Qu.:-0.606776 1st Qu.:-0.529085 1st Qu.:-0.673544
## Median :-0.575079 Median :-0.491170 Median :-0.624917
## Mean :-0.008645 Mean :-0.006069 Mean : 0.002803
## 3rd Qu.: 0.204995 3rd Qu.: 0.040190 3rd Qu.: 0.420783
## Max. : 2.796469 Max. : 3.544347 Max. : 2.494028
##
## LDA_03 LDA_04 global_subjectivity
## Min. :-0.69645 Min. :-0.746401 Min. :-3.799731
## 1st Qu.:-0.66126 1st Qu.:-0.710468 1st Qu.:-0.399456
## Median :-0.62254 Median :-0.669101 Median : 0.092228
## Mean : 0.01211 Mean :-0.002534 Mean : 0.007891
## 3rd Qu.: 0.56121 3rd Qu.: 0.573797 3rd Qu.: 0.564676
## Max. : 2.35851 Max. : 2.394678 Max. : 4.770378
##
## global_sentiment_polarity global_rate_positive_words
## Min. :-5.15335 Min. :-2.27354
## 1st Qu.:-0.64523 1st Qu.:-0.66369
## Median :-0.02027 Median :-0.04814
## Mean :-0.01105 Mean :-0.00753
## 3rd Qu.: 0.60796 3rd Qu.: 0.61188
## Max. : 5.21704 Max. : 5.55056
##
## global_rate_negative_words rate_positive_words rate_negative_words
## Min. :-1.534211 Min. :-3.58637 Min. :-1.84388
## 1st Qu.:-0.637561 1st Qu.:-0.43190 1st Qu.:-0.64316
## Median :-0.108981 Median : 0.12637 Median :-0.01421
## Mean : 0.007059 Mean :-0.01122 Mean : 0.01374
## 3rd Qu.: 0.480242 3rd Qu.: 0.61959 3rd Qu.: 0.61914
## Max. :11.379826 Max. : 1.67108 Max. : 4.55997
##
## avg_positive_polarity min_positive_polarity max_positive_polarity
## Min. :-3.384518 Min. :-1.338367 Min. :-3.053959
## 1st Qu.:-0.454255 1st Qu.:-0.637251 1st Qu.:-0.632512
## Median : 0.043393 Median : 0.063864 Median : 0.174637
## Mean : 0.001005 Mean : 0.003956 Mean :-0.008165
## 3rd Qu.: 0.561257 3rd Qu.: 0.063864 3rd Qu.: 0.981786
## Max. : 4.963569 Max. : 9.879480 Max. : 0.981786
##
## avg_negative_polarity min_negative_polarity max_negative_polarity
## Min. :-5.79739 Min. :-1.646826 Min. :-9.357993
## 1st Qu.:-0.54558 1st Qu.:-0.613375 1st Qu.:-0.183488
## Median : 0.04722 Median : 0.075593 Median : 0.078641
## Mean :-0.00382 Mean : 0.003193 Mean :-0.005856
## 3rd Qu.: 0.56390 3rd Qu.: 0.764560 3rd Qu.: 0.602899
## Max. : 2.03189 Max. : 1.798011 Max. : 1.127156
##
## title_subjectivity title_sentiment_polarity abs_title_subjectivity
## Min. :-0.870796 Min. :-4.036257 Min. :-1.81070
## 1st Qu.:-0.870796 1st Qu.:-0.269073 1st Qu.:-0.92788
## Median :-0.485287 Median :-0.269073 Median : 0.83774
## Mean :-0.001223 Mean :-0.002584 Mean : 0.00646
## 3rd Qu.: 0.671237 3rd Qu.: 0.248915 3rd Qu.: 0.83774
## Max. : 2.213270 Max. : 3.498112 Max. : 0.83774
##
## abs_title_sentiment_polarity shares data_channel_is_other
## Min. :-0.689649 Min. : 1 0:13374
## 1st Qu.:-0.689649 1st Qu.: 942 1: 771
## Median :-0.689649 Median : 1400
## Mean : 0.002646 Mean : 4404
## 3rd Qu.: 0.415107 3rd Qu.: 2900
## Max. : 3.729377 Max. :843300
##
## division
## Mode:logical
## TRUE:14145
##
##
##
##
##
It seems that the statistical data of the variables of the reduced dataset are more or less similar to those of the original dataset so we will perform the analysis with these 14145 rows
datascale<-newdatascale
To make the predictive models we will divide the class shares in 2, setting the threshold at 1400 shares, a higher value indicates a popular page, and a lower value indicates an unpopular page. We will create a new column, with the values 'NP' (not popular) for < of 1400 shares and 'P' for >= of 1400 shares, we will create it as a factor.
datascale$popularity<-ifelse(datascale$shares<1400,'NP','P')
datascale$popularity<- factor(datascale$popularity)
To perform the modeling, we will divide the dataset in 2, to have training and test records, we will do it based on the new popularity variable, which will be the target variable.
set.seed(277)
library(caTools)
datascale$division=sample.split(datascale$popularity,SplitRatio=0.7)
summary(datascale)
## url
## http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ : 1
## http://mashable.com/2013/01/07/chuck-hagel-website/ : 1
## http://mashable.com/2013/01/07/cosmic-events-doomsday/ : 1
## http://mashable.com/2013/01/07/earth-size-planets-milky-way/: 1
## http://mashable.com/2013/01/07/felt-audio-pulse-speaker/ : 1
## http://mashable.com/2013/01/07/ford-glympse/ : 1
## (Other) :14139
## timedelta n_tokens_title n_tokens_content
## Min. : 8.0 Min. :-3.972849 Min. :-1.160064
## 1st Qu.:163.0 1st Qu.:-0.661648 1st Qu.:-0.637890
## Median :342.0 Median :-0.188620 Median :-0.294019
## Mean :354.5 Mean : 0.002766 Mean :-0.003818
## 3rd Qu.:545.0 3rd Qu.: 0.757438 3rd Qu.: 0.357637
## Max. :731.0 Max. : 4.068638 Max. :15.320251
##
## n_unique_tokens num_hrefs num_self_hrefs
## Min. :-0.155712 Min. :-0.960437 Min. :-0.854350
## 1st Qu.:-0.021780 1st Qu.:-0.607455 1st Qu.:-0.594956
## Median :-0.002379 Median :-0.254473 Median :-0.335562
## Mean :-0.004774 Mean : 0.006377 Mean : 0.000027
## 3rd Qu.: 0.017442 3rd Qu.: 0.275000 3rd Qu.: 0.183226
## Max. : 0.128322 Max. :25.866207 Max. :18.340797
##
## num_imgs num_videos average_token_length
## Min. :-0.546866 Min. :-0.304264 Min. :-5.386321
## 1st Qu.:-0.426520 1st Qu.:-0.304264 1st Qu.:-0.081199
## Median :-0.426520 Median :-0.304264 Median : 0.137250
## Mean : 0.001753 Mean :-0.002848 Mean : 0.000181
## 3rd Qu.:-0.065485 3rd Qu.:-0.060828 3rd Qu.: 0.363095
## Max. :12.811446 Max. :17.953439 Max. : 3.162214
##
## num_keywords data_channel_is_lifestyle
## Min. :-3.260001 0:13374
## 1st Qu.:-0.641007 1: 771
## Median :-0.117209
## Mean : 0.005252
## 3rd Qu.: 0.930389
## Max. : 1.454187
##
## data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 0:11627 0:11942 0:13347
## 1: 2518 1: 2203 1: 798
##
##
##
##
##
## data_channel_is_tech data_channel_is_world kw_min_min
## 0:11550 0:11104 Min. :-0.389280
## 1: 2595 1: 3041 1st Qu.:-0.389280
## Median :-0.389280
## Mean :-0.004804
## 3rd Qu.:-0.317475
## Max. : 3.847204
##
## kw_max_min kw_avg_min kw_min_max
## Min. :-0.29911 Min. :-0.50479 Min. :-0.234752
## 1st Qu.:-0.18454 1st Qu.:-0.27766 1st Qu.:-0.234752
## Median :-0.12855 Median :-0.12391 Median :-0.210609
## Mean : 0.01112 Mean : 0.01002 Mean : 0.005899
## 3rd Qu.:-0.03990 3rd Qu.: 0.07109 3rd Qu.:-0.098513
## Max. :77.04685 Max. :68.48678 Max. :14.308406
##
## kw_max_max kw_avg_max kw_min_avg
## Min. :-3.507303 Min. :-1.91915 Min. :-0.9830232
## 1st Qu.: 0.424126 1st Qu.:-0.64013 1st Qu.:-0.9821441
## Median : 0.424126 Median :-0.11348 Median :-0.0858464
## Mean : 0.000872 Mean :-0.00275 Mean :-0.0007414
## 3rd Qu.: 0.424126 3rd Qu.: 0.53249 3rd Qu.: 0.8282670
## Max. : 0.424126 Max. : 4.32279 Max. : 2.1917123
##
## kw_max_avg kw_avg_avg self_reference_min_shares
## Min. :-0.92758 Min. :-2.378984 Min. :-0.20258
## 1st Qu.:-0.34260 1st Qu.:-0.575329 1st Qu.:-0.17097
## Median :-0.21287 Median :-0.197807 Median :-0.14179
## Mean : 0.00423 Mean : 0.005212 Mean : 0.00193
## 3rd Qu.: 0.06113 3rd Qu.: 0.353306 3rd Qu.:-0.06580
## Max. :47.99950 Max. :30.673132 Max. :34.77444
##
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## Min. :-0.251763 Min. :-0.264409 0:11752
## 1st Qu.:-0.227389 1st Qu.:-0.224816 1: 2393
## Median :-0.183516 Median :-0.173543
## Mean : 0.008241 Mean : 0.006025
## 3rd Qu.:-0.056772 3rd Qu.:-0.049634
## Max. :20.302705 Max. :28.251163
##
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## 0:11468 0:11526 0:11596
## 1: 2677 1: 2619 1: 2549
##
##
##
##
##
## weekday_is_friday weekday_is_saturday weekday_is_sunday
## 0:12123 0:13276 0:13129
## 1: 2022 1: 869 1: 1016
##
##
##
##
##
## LDA_00 LDA_01 LDA_02
## Min. :-0.632822 Min. :-0.560156 Min. :-0.702245
## 1st Qu.:-0.606776 1st Qu.:-0.529085 1st Qu.:-0.673544
## Median :-0.575079 Median :-0.491170 Median :-0.624917
## Mean :-0.008645 Mean :-0.006069 Mean : 0.002803
## 3rd Qu.: 0.204995 3rd Qu.: 0.040190 3rd Qu.: 0.420783
## Max. : 2.796469 Max. : 3.544347 Max. : 2.494028
##
## LDA_03 LDA_04 global_subjectivity
## Min. :-0.69645 Min. :-0.746401 Min. :-3.799731
## 1st Qu.:-0.66126 1st Qu.:-0.710468 1st Qu.:-0.399456
## Median :-0.62254 Median :-0.669101 Median : 0.092228
## Mean : 0.01211 Mean :-0.002534 Mean : 0.007891
## 3rd Qu.: 0.56121 3rd Qu.: 0.573797 3rd Qu.: 0.564676
## Max. : 2.35851 Max. : 2.394678 Max. : 4.770378
##
## global_sentiment_polarity global_rate_positive_words
## Min. :-5.15335 Min. :-2.27354
## 1st Qu.:-0.64523 1st Qu.:-0.66369
## Median :-0.02027 Median :-0.04814
## Mean :-0.01105 Mean :-0.00753
## 3rd Qu.: 0.60796 3rd Qu.: 0.61188
## Max. : 5.21704 Max. : 5.55056
##
## global_rate_negative_words rate_positive_words rate_negative_words
## Min. :-1.534211 Min. :-3.58637 Min. :-1.84388
## 1st Qu.:-0.637561 1st Qu.:-0.43190 1st Qu.:-0.64316
## Median :-0.108981 Median : 0.12637 Median :-0.01421
## Mean : 0.007059 Mean :-0.01122 Mean : 0.01374
## 3rd Qu.: 0.480242 3rd Qu.: 0.61959 3rd Qu.: 0.61914
## Max. :11.379826 Max. : 1.67108 Max. : 4.55997
##
## avg_positive_polarity min_positive_polarity max_positive_polarity
## Min. :-3.384518 Min. :-1.338367 Min. :-3.053959
## 1st Qu.:-0.454255 1st Qu.:-0.637251 1st Qu.:-0.632512
## Median : 0.043393 Median : 0.063864 Median : 0.174637
## Mean : 0.001005 Mean : 0.003956 Mean :-0.008165
## 3rd Qu.: 0.561257 3rd Qu.: 0.063864 3rd Qu.: 0.981786
## Max. : 4.963569 Max. : 9.879480 Max. : 0.981786
##
## avg_negative_polarity min_negative_polarity max_negative_polarity
## Min. :-5.79739 Min. :-1.646826 Min. :-9.357993
## 1st Qu.:-0.54558 1st Qu.:-0.613375 1st Qu.:-0.183488
## Median : 0.04722 Median : 0.075593 Median : 0.078641
## Mean :-0.00382 Mean : 0.003193 Mean :-0.005856
## 3rd Qu.: 0.56390 3rd Qu.: 0.764560 3rd Qu.: 0.602899
## Max. : 2.03189 Max. : 1.798011 Max. : 1.127156
##
## title_subjectivity title_sentiment_polarity abs_title_subjectivity
## Min. :-0.870796 Min. :-4.036257 Min. :-1.81070
## 1st Qu.:-0.870796 1st Qu.:-0.269073 1st Qu.:-0.92788
## Median :-0.485287 Median :-0.269073 Median : 0.83774
## Mean :-0.001223 Mean :-0.002584 Mean : 0.00646
## 3rd Qu.: 0.671237 3rd Qu.: 0.248915 3rd Qu.: 0.83774
## Max. : 2.213270 Max. : 3.498112 Max. : 0.83774
##
## abs_title_sentiment_polarity shares data_channel_is_other
## Min. :-0.689649 Min. : 1 0:13374
## 1st Qu.:-0.689649 1st Qu.: 942 1: 771
## Median :-0.689649 Median : 1400
## Mean : 0.002646 Mean : 4404
## 3rd Qu.: 0.415107 3rd Qu.: 2900
## Max. : 3.729377 Max. :843300
##
## division popularity
## Mode :logical NP:6566
## FALSE:4244 P :7579
## TRUE :9901
##
##
##
##
Let's see if the division is adequate and both subsets resemble each other and the original.
"Percentage of total P and NP:"
## [1] "Percentage of total P and NP:"
tab<-table(datascale$popularity)
prop.table(tab)
##
## NP P
## 0.4641923 0.5358077
"Percentage of P and NP in training set:"
## [1] "Percentage of P and NP in training set:"
tab<-table(datascale$popularity[datascale$division==TRUE])
prop.table(tab)
##
## NP P
## 0.4641955 0.5358045
"Percentage of P and NP in test set:"
## [1] "Percentage of P and NP in test set:"
tab<-table(datascale$popularity[datascale$division==FALSE])
prop.table(tab)
##
## NP P
## 0.4641847 0.5358153
We observe that the percentages in the variable are quite consistent with what we can use these divisions to make our models.
We are going to separate the training and test datasets, then we remove the division variable from both.
Let's separate the training and test datasets, then remove the division variable from both, also url and timedelta because they do not provide information, and shares because being the initial target variable it would interfere in the result.
datascale$url<-NULL
datascale$timedelta<-NULL
train<-datascale[datascale$division == TRUE,]
test <- datascale[datascale$division == FALSE,]
train$division<-NULL
test$division<-NULL
train$shares<-NULL
test$shares<-NULL
Decision Tree
We are going to start with the implementation of a model on a decision tree, we will use the rpart (CART) algorithm, applying a pruning.
library('caret')
set.seed(277)
model_prune <- train(
x=train[,1:ncol(train)-1],y=train$popularity, method = "rpart",
trControl = trainControl("cv", number = 10,classProbs = FALSE),
tuneLength = 10
)
#Show the graph
plot(model_prune)
We show the best complexity parameter achieved by the model.
model_prune$bestTune
## cp
## 1 0.002828547
We show the graph and the tree rules.
library(rpart.plot)
par(xpd = NA)
prp(model_prune$finalModel)
model_prune$finalModel
## n= 9901
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 9901 4596 P (0.4641955 0.5358045)
## 2) kw_avg_avg< -0.1696958 5104 2199 NP (0.5691614 0.4308386)
## 4) data_channel_is_tech=0 3924 1524 NP (0.6116208 0.3883792)
## 8) data_channel_is_socmed=0 3725 1381 NP (0.6292617 0.3707383)
## 16) kw_max_max>=-0.4576834 3015 1021 NP (0.6613599 0.3386401) *
## 17) kw_max_max< -0.4576834 710 350 P (0.4929577 0.5070423)
## 34) weekday_is_saturday=0 692 342 NP (0.5057803 0.4942197)
## 68) num_hrefs< 0.8485965 636 300 NP (0.5283019 0.4716981) *
## 69) num_hrefs>=0.8485965 56 14 P (0.2500000 0.7500000) *
## 35) weekday_is_saturday=1 18 0 P (0.0000000 1.0000000) *
## 9) data_channel_is_socmed=1 199 56 P (0.2814070 0.7185930) *
## 5) data_channel_is_tech=1 1180 505 P (0.4279661 0.5720339)
## 10) kw_max_avg< -0.3832603 270 121 NP (0.5518519 0.4481481)
## 20) kw_min_avg< 0.04304663 195 74 NP (0.6205128 0.3794872) *
## 21) kw_min_avg>=0.04304663 75 28 P (0.3733333 0.6266667) *
## 11) kw_max_avg>=-0.3832603 910 356 P (0.3912088 0.6087912) *
## 3) kw_avg_avg>=-0.1696958 4797 1691 P (0.3525120 0.6474880)
## 6) data_channel_is_entertainment=1 948 473 NP (0.5010549 0.4989451)
## 12) self_reference_min_shares< -0.1189926 616 269 NP (0.5633117 0.4366883)
## 24) weekday_is_sunday=0 555 227 NP (0.5909910 0.4090090) *
## 25) weekday_is_sunday=1 61 19 P (0.3114754 0.6885246) *
## 13) self_reference_min_shares>=-0.1189926 332 128 P (0.3855422 0.6144578) *
## 7) data_channel_is_entertainment=0 3849 1216 P (0.3159262 0.6840738)
## 14) self_reference_avg_sharess< -0.1697771 1489 582 P (0.3908664 0.6091336)
## 28) LDA_02>=0.8527931 155 63 NP (0.5935484 0.4064516)
## 56) num_imgs< -0.1256575 127 41 NP (0.6771654 0.3228346) *
## 57) num_imgs>=-0.1256575 28 6 P (0.2142857 0.7857143) *
## 29) LDA_02< 0.8527931 1334 490 P (0.3673163 0.6326837) *
## 15) self_reference_avg_sharess>=-0.1697771 2360 634 P (0.2686441 0.7313559) *
We see the tree graph and the generated rules. We see that it first checks the value of kw_avg_avg (Avg. keyword (avg. shares)), at the next level it checks the type of channel where the news was published, on the one hand, it looks if it is technology or not, on the other hand it looks if it is entertainment or not, from there it continues discriminating by different variables.
For example, one of the paths would be, if kw_avg_avg <-0.17 and the news was published in an entertainment channel and also the value of the average number of times the referenced articles have been shared is <-0.17, the page will be considered popular.
Let's run the prediction with the test dataset using this model.
predicted.classes <- predict( model_prune, test[,1:ncol(test)-1])
print(sprintf("The accuracy of the tree is: %.4f %%",100*sum(predicted.classes == test$popularity) / length(predicted.classes)))
## [1] "The accuracy of the tree is:: 64.1376 %"
We see that the predictive capacity of the tree is 64.13%.
We obtain the confusion matrix:
.table(test$popularity,predicted.classes)
## predicted.classes
## NP P
## NP 1210 760
## P 762 1512
Unsupervised Models
We are going to use the kmeans algorithm with the pam function that allows us to use different distances.
We start by running the algorithm with a Euclidean distance.
library(cluster)
pameucl<-pam(train[,1:ncol(train)-1],2,metric="euclidean")
We will evaluate the algorithm with the same training dataset, since, as clustering algorithms are not used for predictive tasks, there is no predict function for them.
conf_matrix<-table(pameucl$clustering,train$popularity)
conf_matrix
##
## NP P
## 1 3145 3442
## 2 1451 1863
porcentaje_correcto<-100 * sum(diag(conf_matrix)) / sum(conf_matrix)
porcentaje_correcto
## [1] 50.58075
We see that it stays at 50.58% effectiveness, well below the tree.
Let's evaluate the quality of the partitions.
d<-daisy(train[,1:ncol(train)-1])
library(cluster)
library(factoextra)
sil<-silhouette(pameucl$clustering,d)
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)
In this case we see how the silhouettes are below zero, so it does not seem that this algorithm gets good partitions (cohesive with each other), at least by forcing it to make 2 partitions.
We repeat with the manhattan distance
library(cluster)
pammanh<-pam(train[,1:ncol(train)-1],2,metric="manhattan")
table(pammanh$clustering)
##
## 1 2
## 5587 4314
conf_matrix<-table(pammanh$clustering,train$popularity)
conf_matrix
##
## NP P
## 1 2779 2808
## 2 1817 2497
porcentaje_correcto<-100 * sum(diag(conf_matrix)) / sum(conf_matrix)
porcentaje_correcto
## [1] 53.28755
We see that we get a somewhat higher predictive power, reaching 53.29%
.We study the silhouettes:
library(cluster)
library(factoextra)
sil<-silhouette(pammanh$clustering,d)
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)
As in the previous model, we can see that we do not achieve well cohesive groups by performing 2 partitions with these algorithms.
Supervised Models
We go with the supervised models, we perform a logistic model on the data.
We include the functions tic() and toc() from the pracma library to show the time the logistic model takes to run, this will allow us to compare performances with and without SVD.
library(pracma)
tic("Logit:")
logit<-glm(formula = popularity ~ ., family = binomial(link = "logit"),
data = train[,1:ncol(train)])
toc()
## elapsed time is 0.350000 seconds
summary(logit)
##
## Call:
## glm(formula = popularity ~ ., family = binomial(link = "logit"),
## data = train[, 1:ncol(train)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7641 -1.0558 0.5594 1.0134 2.1526
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.879545 0.122491 7.181 6.95e-13 ***
## n_tokens_title 0.033466 0.022558 1.484 0.137928
## n_tokens_content 0.031245 0.037785 0.827 0.408289
## n_unique_tokens -2.342211 1.240585 -1.888 0.059027 .
## num_hrefs 0.118144 0.029309 4.031 5.55e-05 ***
## num_self_hrefs -0.086719 0.026527 -3.269 0.001079 **
## num_imgs 0.031888 0.027089 1.177 0.239140
## num_videos -0.033925 0.025017 -1.356 0.175080
## average_token_length -0.226168 0.072230 -3.131 0.001741 **
## num_keywords 0.067116 0.026710 2.513 0.011979 *
## data_channel_is_lifestyle1 -0.242671 0.149726 -1.621 0.105068
## data_channel_is_entertainment1 -0.317804 0.093865 -3.386 0.000710 ***
## data_channel_is_bus1 -0.278576 0.147472 -1.889 0.058890 .
## data_channel_is_socmed1 0.762917 0.146636 5.203 1.96e-07 ***
## data_channel_is_tech1 0.477328 0.138928 3.436 0.000591 ***
## data_channel_is_world1 -0.147545 0.139054 -1.061 0.288661
## kw_min_min 0.183267 0.044253 4.141 3.45e-05 ***
## kw_max_min 0.236125 0.076744 3.077 0.002092 **
## kw_avg_min -0.285507 0.073322 -3.894 9.87e-05 ***
## kw_min_max -0.072657 0.024510 -2.964 0.003033 **
## kw_max_max -0.027829 0.047715 -0.583 0.559733
## kw_avg_max -0.051563 0.042765 -1.206 0.227923
## kw_min_avg -0.090672 0.033649 -2.695 0.007047 **
## kw_max_avg -0.586559 0.067550 -8.683 < 2e-16 ***
## kw_avg_avg 0.979314 0.079906 12.256 < 2e-16 ***
## self_reference_min_shares 0.251175 0.080886 3.105 0.001901 **
## self_reference_max_shares 0.052557 0.073067 0.719 0.471952
## self_reference_avg_sharess -0.021476 0.109253 -0.197 0.844166
## weekday_is_monday1 -0.770710 0.100580 -7.663 1.82e-14 ***
## weekday_is_tuesday1 -0.861900 0.099141 -8.694 < 2e-16 ***
## weekday_is_wednesday1 -0.908351 0.099443 -9.134 < 2e-16 ***
## weekday_is_thursday1 -0.839257 0.099766 -8.412 < 2e-16 ***
## weekday_is_friday1 -0.667387 0.102985 -6.480 9.15e-11 ***
## weekday_is_saturday1 0.132031 0.128103 1.031 0.302699
## weekday_is_sunday1 NA NA NA NA
## LDA_00 0.199159 0.045982 4.331 1.48e-05 ***
## LDA_01 -0.106461 0.041766 -2.549 0.010803 *
## LDA_02 -0.134101 0.047958 -2.796 0.005170 **
## LDA_03 -0.125147 0.052985 -2.362 0.018181 *
## LDA_04 NA NA NA NA
## global_subjectivity 0.131618 0.036899 3.567 0.000361 ***
## global_sentiment_polarity 0.038739 0.059724 0.649 0.516574
## global_rate_positive_words -0.053122 0.047153 -1.127 0.259916
## global_rate_negative_words -0.001018 0.056486 -0.018 0.985616
## rate_positive_words 0.252735 0.097770 2.585 0.009738 **
## rate_negative_words 0.205334 0.090707 2.264 0.023593 *
## avg_positive_polarity 0.046422 0.052809 0.879 0.379376
## min_positive_polarity -0.102391 0.030658 -3.340 0.000838 ***
## max_positive_polarity -0.092068 0.039731 -2.317 0.020489 *
## avg_negative_polarity -0.043859 0.059994 -0.731 0.464743
## min_negative_polarity 0.019014 0.050049 0.380 0.704012
## max_negative_polarity 0.004562 0.037635 0.121 0.903513
## title_subjectivity -0.008470 0.033457 -0.253 0.800134
## title_sentiment_polarity 0.061587 0.024984 2.465 0.013697 *
## abs_title_subjectivity 0.010106 0.025644 0.394 0.693519
## abs_title_sentiment_polarity 0.011383 0.034051 0.334 0.738155
## data_channel_is_other1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13675 on 9900 degrees of freedom
## Residual deviance: 12263 on 9847 degrees of freedom
## AIC: 12371
##
## Number of Fisher Scoring iterations: 5
Let's look at the resulting coefficients. Let's run the prediction model with the test data. Since the logistic model returns probabilities, we will take the values > 0.5 as pages that will be popular and the rest as unpopular pages.
predict.logit<-predict(logit,newdata=test[,1:ncol(test)],type='response')
glm_prediction = ifelse(predict.logit>0.5,'P','NP')
conf_matrix<-table(glm_prediction,test$popularity)
conf_matrix
##
## glm_prediction NP P
## NP 1205 714
## P 765 1560
porcentaje_correcto<-100 * sum(diag(conf_matrix)) / sum(conf_matrix)
porcentaje_correcto
## [1] 65.1508
In this case we observe that for the moment it is the best model reaching a 65.15% accuracy rate.
We will repeat the SVD method with a reduced dataset in order to compare the effectiveness of the models.
We are going to repeat the SVD method with the reduced dataset in order to compare the effectiveness of the models.
#We create a dataset with the continuous variables
datanum <- (datascale[,c(1:9,16:27,35:55)])
#We apply SVD and display the data.
datanum.svd<-svd(datanum)
"d"
## [1] "d"
datanum.svd$d
## [1] 2.596782e+02 2.530943e+02 2.307989e+02 2.068307e+02 1.842217e+02
## [6] 1.791178e+02 1.760621e+02 1.575913e+02 1.526338e+02 1.398370e+02
## [11] 1.340989e+02 1.315987e+02 1.271044e+02 1.237501e+02 1.165461e+02
## [16] 1.108530e+02 1.072220e+02 1.005792e+02 9.917303e+01 9.685914e+01
## [21] 9.324016e+01 9.171836e+01 8.681378e+01 8.591147e+01 8.069453e+01
## [26] 7.924628e+01 7.614582e+01 7.035524e+01 6.634008e+01 6.043407e+01
## [31] 5.831404e+01 5.140172e+01 4.266649e+01 3.806221e+01 3.699571e+01
## [36] 3.345848e+01 2.914867e+01 2.416142e+01 2.207835e+01 1.916849e+01
## [41] 2.161844e+00 4.760031e-03
"u"
## [1] "u"
datanum.svd$u[1:5,]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8.474179e-03 0.008079700 -0.009693002 0.009105234 -0.004559022
## [2,] 1.038957e-02 0.006289088 0.002466005 0.016794292 -0.003163306
## [3,] -6.297939e-05 0.010933419 0.004489008 0.015856386 -0.001354212
## [4,] 5.849472e-03 0.007731413 0.005256751 0.014962748 0.006906951
## [5,] -9.503979e-03 0.017174825 -0.008828758 0.007019208 0.003609519
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.007226643 -0.001884535 0.0042915846 -0.017627939 0.0003603647
## [2,] 0.008988401 0.001297720 0.0009061188 -0.007334306 0.0053942348
## [3,] 0.003140518 0.002411026 0.0071538148 -0.004324472 -0.0040684787
## [4,] 0.007174044 0.008848170 -0.0053930446 -0.009860199 0.0068677606
## [5,] -0.018368086 0.017539816 0.0154743975 0.001335152 0.0041225923
## [,11] [,12] [,13] [,14] [,15]
## [1,] 0.0039004145 0.004936470 0.002215434 -0.011899893 -0.0055226990
## [2,] 0.0001152835 0.011917052 -0.003835803 0.007224432 0.0016276479
## [3,] 0.0050747695 0.007461346 -0.003871376 0.004200978 0.0100594438
## [4,] 0.0047457141 0.008824711 -0.002744474 0.010127089 0.0001342847
## [5,] -0.0172380132 -0.012027651 -0.011031114 -0.003976230 -0.0120551978
## [,16] [,17] [,18] [,19] [,20]
## [1,] -0.000015124 0.0012275351 -0.0007308813 0.0088210900 0.008710317
## [2,] -0.003993509 -0.0009475676 0.0045604629 0.0037604752 0.010817725
## [3,] 0.006271079 0.0092262992 -0.0083708614 -0.0001654949 -0.003966483
## [4,] 0.005199082 -0.0164739853 -0.0020013972 0.0029761297 0.002852818
## [5,] -0.009684398 -0.0139561749 -0.0134461229 0.0172416368 0.025383085
## [,21] [,22] [,23] [,24] [,25]
## [1,] -0.0008102396 -0.006748125 0.0029034781 0.003591092 0.011470259
## [2,] -0.0044325270 -0.001530149 -0.0010068089 0.002261690 0.004816694
## [3,] -0.0024950028 0.003310255 -0.0083926890 -0.003564886 -0.002293547
## [4,] -0.0085145953 -0.005032175 -0.0005624262 0.003002982 0.006233457
## [5,] -0.0094081141 0.021090796 -0.0034651220 0.007640199 -0.004159618
## [,26] [,27] [,28] [,29] [,30]
## [1,] -1.402691e-05 0.003146952 0.003066231 -0.005250015 0.02873764
## [2,] 3.316266e-03 -0.008761224 0.002712635 -0.011310352 0.01974892
## [3,] 1.117306e-02 0.009038712 0.006292482 -0.001414496 0.01282847
## [4,] 8.430057e-03 -0.012473031 0.006056042 0.006398241 0.01840958
## [5,] -3.216749e-03 -0.010951087 -0.015777807 0.004219008 0.01213607
## [,31] [,32] [,33] [,34] [,35]
## [1,] -3.606729e-05 -0.0008944496 0.05576603 -0.0008911391 0.012876513
## [2,] 2.657058e-04 0.0075175455 0.05796864 0.0033542159 0.006767369
## [3,] -5.539388e-04 -0.0039071984 0.06088392 0.0028363995 0.011805009
## [4,] 1.651099e-03 0.0074245570 0.05944609 0.0101394976 0.006532310
## [5,] 8.440541e-03 0.0036476529 0.06057276 -0.0025637345 0.012897826
## [,36] [,37] [,38] [,39] [,40]
## [1,] -0.001051347 -0.005638723 -0.005238395 -0.002156088 -0.004314705
## [2,] -0.006127420 -0.008578038 -0.003908962 -0.001695845 -0.007080223
## [3,] -0.001481918 -0.005834722 -0.012182896 -0.003789634 -0.001670958
## [4,] -0.012681872 -0.003496313 -0.006511559 -0.002951859 0.001895029
## [5,] 0.008860331 -0.002030300 -0.013504874 -0.001456773 -0.007647298
## [,41] [,42]
## [1,] -0.0009605776 0.008749506
## [2,] -0.0022078429 0.009019184
## [3,] -0.0032563586 0.009384242
## [4,] 0.0122139336 0.004732981
## [5,] 0.0019817326 0.008232328
"v"
## [1] "v"
datanum.svd$v[1:5,]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.025227076 -0.032697512 5.136301e-02 -0.0581870583 0.081661759
## [2,] -0.138246486 0.092560975 1.165227e-01 0.0829187899 0.027496268
## [3,] -0.006356406 0.004057403 7.890069e-05 0.0006009658 -0.003169818
## [4,] -0.173302173 0.044862694 1.021083e-01 -0.0138737246 0.026709426
## [5,] -0.118805464 0.069250555 2.470946e-02 0.0064507175 -0.020787458
## [,6] [,7] [,8] [,9] [,10]
## [1,] -0.01886185 0.018799858 -0.149344207 0.084890438 -0.124581632
## [2,] -0.42889078 0.128305465 0.001539783 -0.187479176 -0.007797895
## [3,] 0.01346192 -0.007586421 -0.003646484 0.003714064 0.003578140
## [4,] -0.34198290 0.108884647 0.191988647 0.028107349 -0.027239147
## [5,] -0.30398631 0.123840342 0.148683175 -0.011160306 -0.033660528
## [,11] [,12] [,13] [,14] [,15]
## [1,] 0.2561483274 -0.0523054447 -0.2796331388 0.343038139 -0.678467550
## [2,] 0.0112902831 0.0422143250 -0.0630508738 0.049522196 0.028779106
## [3,] -0.0004902339 -0.0001171761 -0.0002540196 -0.000906006 -0.003562594
## [4,] -0.1325059597 0.2305854118 -0.0082068666 -0.125667409 -0.127666930
## [5,] -0.1768113779 -0.1600904356 -0.0930413840 -0.035432880 -0.341247854
## [,16] [,17] [,18] [,19] [,20]
## [1,] 0.341870625 0.072729442 0.154295673 -0.11981656 -0.043375251
## [2,] 0.168200283 0.166076622 -0.122681222 0.12216379 -0.125656200
## [3,] -0.008240845 -0.007719391 0.006261413 -0.00419713 0.007483101
## [4,] -0.209080808 -0.129269838 -0.007312209 -0.16870180 -0.214482025
## [5,] -0.506048124 -0.085408014 0.057905037 -0.02503298 0.127046988
## [,21] [,22] [,23] [,24] [,25]
## [1,] -0.084778131 -0.167787318 0.057509332 0.0218303100 0.030011908
## [2,] 0.128567994 0.114270007 0.085701032 0.0035703583 -0.281922621
## [3,] 0.002061368 -0.008066023 0.004630831 0.0005913157 -0.001108898
## [4,] 0.297354415 -0.401714131 -0.271499888 -0.1336835314 0.281053005
## [5,] -0.451290382 0.174485154 -0.164054917 0.1356755053 -0.157185467
## [,26] [,27] [,28] [,29] [,30]
## [1,] 0.040601152 -0.094620468 -0.007696249 -0.014418779 0.039604710
## [2,] 0.006356591 0.087381363 0.642047181 0.198554876 -0.013846631
## [3,] -0.002439620 0.005348887 -0.003015765 -0.008360102 -0.001697863
## [4,] 0.210135011 -0.277577107 0.005364675 -0.057518124 -0.039651802
## [5,] -0.252450718 0.075422582 -0.065022969 0.044364710 0.011956800
## [,31] [,32] [,33] [,34] [,35]
## [1,] 0.002929870 0.0011377400 -0.0006836195 -0.0079184605 -0.0134180687
## [2,] -0.112010940 0.0488705607 0.0118542824 0.0673414947 -0.0331112205
## [3,] 0.001287894 0.0037107881 0.0003464149 0.0001536687 0.0002638498
## [4,] 0.010824662 -0.0272014611 0.0123458061 0.0337237657 0.0120722701
## [5,] -0.008062561 0.0001922159 -0.0200877547 -0.0079844699 -0.0131771982
## [,36] [,37] [,38] [,39] [,40]
## [1,] 2.566376e-03 0.002428335 0.012217254 0.0008316503 -0.01546642
## [2,] 6.729817e-02 -0.043327542 0.017621452 0.0025713634 -0.03283003
## [3,] 1.040524e-05 0.003684227 -0.002507583 -0.0010774284 0.02784410
## [4,] -1.710710e-02 0.010112608 -0.025594205 0.0001165700 0.04204715
## [5,] -6.229666e-03 0.000290615 0.003209535 0.0348461130 -0.01666532
## [,41] [,42]
## [1,] -0.0002961890 -2.321305e-08
## [2,] -0.0154374966 9.294279e-06
## [3,] -0.9991890687 6.154922e-04
## [4,] -0.0006005106 2.731273e-08
## [5,] 0.0004486355 -2.361950e-07
In the following graph we show the percentage of variance explained by the values of "d", you can see how they go in decreasing order. The values of variability explained from the singular value 30 onwards are very close to 0 so they will be less important in explaining this variability. This means that we can delete these values without fear of losing too much information when applying a model to them.
plot(prop.table(datanum.svd$d^2),ylab="Percent variability explained")
We can see that with the first 20 columns we have an explained variability of 87%, so that we can dispense with these singular values without losing much information.
sum(prop.table(datanum.svd$d^2)[1:20])
## [1] 0.8712191
We obtain the reduced dataset with 20 variables instead of 42 variables
datareduced=datanum.svd$u[,1:20] %*% diag(datanum.svd$d[1:20])
datareduced[1:5,]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 2.20055960 2.044926 -2.2371338 1.883242 -0.8398708 1.2944206
## [2,] 2.69794575 1.591732 0.5691511 3.473575 -0.5827496 1.6099829
## [3,] -0.01635437 2.767186 1.0360579 3.279588 -0.2494753 0.5625227
## [4,] 1.51898040 1.956776 1.2132522 3.094756 1.2724103 1.2849992
## [5,] -2.46797621 4.346850 -2.0376673 1.451788 0.6649517 -3.2900518
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] -0.3317951 0.6763164 -2.6906192 0.05039231 0.52304129 0.6496331
## [2,] 0.2284792 0.1427964 -1.1194630 0.75431348 0.01545939 1.5682687
## [3,] 0.4244903 1.1273790 -0.6600606 -0.56892375 0.68052100 0.9819035
## [4,] 1.5578272 -0.8498969 -1.5049995 0.96036687 0.63639504 1.1613206
## [5,] 3.0880964 2.4386305 0.2037894 0.57649083 -2.31159859 -1.5828234
## [,13] [,14] [,15] [,16] [,17] [,18]
## [1,] 0.2815915 -1.4726128 -0.64364905 -0.001676542 0.1316188 -0.07351144
## [2,] -0.4875476 0.8940241 0.18969602 -0.442692509 -0.1016001 0.45868757
## [3,] -0.4920691 0.5198714 1.17238899 0.695168025 0.9892626 -0.84193429
## [4,] -0.3488348 1.2532282 0.01565036 0.576333988 -1.7663742 -0.20129886
## [5,] -1.4021036 -0.4920588 -1.40498634 -1.073544860 -1.4964094 -1.35239987
## [,19] [,20]
## [1,] 0.87481423 0.8436738
## [2,] 0.37293772 1.0477955
## [3,] -0.01641263 -0.3841901
## [4,] 0.29515180 0.2763215
## [5,] 1.70990537 2.4585839
Next, we add the binary variables that we had removed.
datareduced<-as.data.frame(datareduced)
summary(datareduced)
## V1 V2 V3
## Min. :-54.08354 Min. :-93.43009 Min. :-25.53550
## 1st Qu.: -1.11052 1st Qu.: -0.42645 1st Qu.: -1.25335
## Median : -0.09406 Median : 0.26744 Median : -0.00099
## Mean : -0.00428 Mean : -0.02186 Mean : 0.01753
## 3rd Qu.: 0.88382 3rd Qu.: 0.86922 3rd Qu.: 1.25108
## Max. : 9.12531 Max. : 4.34685 Max. : 10.79651
## V4 V5 V6
## Min. :-16.59956 Min. :-30.835767 Min. :-11.158572
## 1st Qu.: -0.95477 1st Qu.: -0.942126 1st Qu.: -0.786552
## Median : 0.08090 Median : -0.165464 Median : -0.019286
## Mean : 0.00674 Mean : -0.005707 Mean : 0.006638
## 3rd Qu.: 1.05642 3rd Qu.: 0.966642 3rd Qu.: 0.824490
## Max. : 35.11486 Max. : 4.495073 Max. : 11.946039
## V7 V8 V9
## Min. :-8.752088 Min. :-10.860565 Min. :-7.46464
## 1st Qu.:-0.836347 1st Qu.: -0.936993 1st Qu.:-0.71057
## Median :-0.115613 Median : -0.197204 Median : 0.13105
## Mean : 0.005024 Mean : 0.004595 Mean : 0.01229
## 3rd Qu.: 0.704222 3rd Qu.: 0.838462 3rd Qu.: 0.84251
## Max. :28.052381 Max. : 6.125843 Max. : 5.78144
## V10 V11 V12
## Min. :-9.577826 Min. :-6.713680 Min. :-3.474411
## 1st Qu.:-0.623855 1st Qu.:-0.685392 1st Qu.:-0.814898
## Median : 0.061604 Median :-0.011178 Median : 0.041616
## Mean : 0.004169 Mean :-0.006783 Mean : 0.004939
## 3rd Qu.: 0.684528 3rd Qu.: 0.679685 3rd Qu.: 0.811483
## Max. : 6.231708 Max. : 6.908861 Max. : 5.119551
## V13 V14 V15
## Min. :-6.014161 Min. :-5.001655 Min. :-6.243757
## 1st Qu.:-0.495247 1st Qu.:-0.688332 1st Qu.:-0.639673
## Median : 0.146845 Median : 0.073099 Median : 0.030930
## Mean : 0.001888 Mean : 0.001817 Mean :-0.003354
## 3rd Qu.: 0.687477 3rd Qu.: 0.681013 3rd Qu.: 0.671317
## Max. : 5.524902 Max. : 7.026784 Max. : 3.304626
## V16 V17 V18
## Min. :-13.280350 Min. :-4.643394 Min. :-8.033320
## 1st Qu.: -0.475241 1st Qu.:-0.500606 1st Qu.:-0.447538
## Median : 0.048955 Median : 0.059384 Median : 0.016957
## Mean : -0.000829 Mean :-0.001992 Mean :-0.001881
## 3rd Qu.: 0.552918 3rd Qu.: 0.530947 3rd Qu.: 0.528577
## Max. : 7.709455 Max. : 4.902544 Max. : 7.259807
## V19 V20
## Min. :-7.575508 Min. :-6.50378
## 1st Qu.:-0.514284 1st Qu.:-0.45468
## Median : 0.021577 Median : 0.01933
## Mean :-0.005453 Mean : 0.00600
## 3rd Qu.: 0.516713 3rd Qu.: 0.47236
## Max. : 5.017422 Max. : 4.97175
datareduced$popularity<-datascale$popularity
datareduced$division<-datascale$division
datareduced$weekday_is_monday<-datascale$weekday_is_monday
datareduced$weekday_is_tuesday<-datascale$weekday_is_monday
datareduced$weekday_is_wednesday<-datascale$weekday_is_monday
datareduced$weekday_is_thursday<-datascale$weekday_is_monday
datareduced$weekday_is_friday<-datascale$weekday_is_monday
datareduced$weekday_is_saturday<-datascale$weekday_is_monday
datareduced$weekday_is_sunday<-datascale$weekday_is_monday
datareduced$data_channel_is_lifestyle<-datascale$data_channel_is_lifestyle
datareduced$data_channel_is_entertainment<-datascale$data_channel_is_entertainment
datareduced$data_channel_is_bus<-datascale$data_channel_is_bus
datareduced$data_channel_is_socmed<-datascale$data_channel_is_socmed
datareduced$data_channel_is_tech<-datascale$data_channel_is_tech1
datareduced$data_channel_is_world<-datascale$data_channel_is_world
datareduced$data_channel_is_other<-datascale$data_channel_is_other
summary(datareduced)
## V1 V2 V3
## Min. :-54.08354 Min. :-93.43009 Min. :-25.53550
## 1st Qu.: -1.11052 1st Qu.: -0.42645 1st Qu.: -1.25335
## Median : -0.09406 Median : 0.26744 Median : -0.00099
## Mean : -0.00428 Mean : -0.02186 Mean : 0.01753
## 3rd Qu.: 0.88382 3rd Qu.: 0.86922 3rd Qu.: 1.25108
## Max. : 9.12531 Max. : 4.34685 Max. : 10.79651
## V4 V5 V6
## Min. :-16.59956 Min. :-30.835767 Min. :-11.158572
## 1st Qu.: -0.95477 1st Qu.: -0.942126 1st Qu.: -0.786552
## Median : 0.08090 Median : -0.165464 Median : -0.019286
## Mean : 0.00674 Mean : -0.005707 Mean : 0.006638
## 3rd Qu.: 1.05642 3rd Qu.: 0.966642 3rd Qu.: 0.824490
## Max. : 35.11486 Max. : 4.495073 Max. : 11.946039
## V7 V8 V9
## Min. :-8.752088 Min. :-10.860565 Min. :-7.46464
## 1st Qu.:-0.836347 1st Qu.: -0.936993 1st Qu.:-0.71057
## Median :-0.115613 Median : -0.197204 Median : 0.13105
## Mean : 0.005024 Mean : 0.004595 Mean : 0.01229
## 3rd Qu.: 0.704222 3rd Qu.: 0.838462 3rd Qu.: 0.84251
## Max. :28.052381 Max. : 6.125843 Max. : 5.78144
## V10 V11 V12
## Min. :-9.577826 Min. :-6.713680 Min. :-3.474411
## 1st Qu.:-0.623855 1st Qu.:-0.685392 1st Qu.:-0.814898
## Median : 0.061604 Median :-0.011178 Median : 0.041616
## Mean : 0.004169 Mean :-0.006783 Mean : 0.004939
## 3rd Qu.: 0.684528 3rd Qu.: 0.679685 3rd Qu.: 0.811483
## Max. : 6.231708 Max. : 6.908861 Max. : 5.119551
## V13 V14 V15
## Min. :-6.014161 Min. :-5.001655 Min. :-6.243757
## 1st Qu.:-0.495247 1st Qu.:-0.688332 1st Qu.:-0.639673
## Median : 0.146845 Median : 0.073099 Median : 0.030930
## Mean : 0.001888 Mean : 0.001817 Mean :-0.003354
## 3rd Qu.: 0.687477 3rd Qu.: 0.681013 3rd Qu.: 0.671317
## Max. : 5.524902 Max. : 7.026784 Max. : 3.304626
## V16 V17 V18
## Min. :-13.280350 Min. :-4.643394 Min. :-8.033320
## 1st Qu.: -0.475241 1st Qu.:-0.500606 1st Qu.:-0.447538
## Median : 0.048955 Median : 0.059384 Median : 0.016957
## Mean : -0.000829 Mean :-0.001992 Mean :-0.001881
## 3rd Qu.: 0.552918 3rd Qu.: 0.530947 3rd Qu.: 0.528577
## Max. : 7.709455 Max. : 4.902544 Max. : 7.259807
## V19 V20 popularity division
## Min. :-7.575508 Min. :-6.50378 NP:6566 Mode :logical
## 1st Qu.:-0.514284 1st Qu.:-0.45468 P :7579 FALSE:4244
## Median : 0.021577 Median : 0.01933 TRUE :9901
## Mean :-0.005453 Mean : 0.00600
## 3rd Qu.: 0.516713 3rd Qu.: 0.47236
## Max. : 5.017422 Max. : 4.97175
## weekday_is_monday weekday_is_tuesday weekday_is_wednesday
## 0:11752 0:11752 0:11752
## 1: 2393 1: 2393 1: 2393
##
##
##
##
## weekday_is_thursday weekday_is_friday weekday_is_saturday
## 0:11752 0:11752 0:11752
## 1: 2393 1: 2393 1: 2393
##
##
##
##
## weekday_is_sunday data_channel_is_lifestyle data_channel_is_entertainment
## 0:11752 0:13374 0:11627
## 1: 2393 1: 771 1: 2518
##
##
##
##
## data_channel_is_bus data_channel_is_socmed data_channel_is_world
## 0:11942 0:13347 0:11104
## 1: 2203 1: 798 1: 3041
##
##
##
##
## data_channel_is_other
## 0:13374
## 1: 771
##
##
##
##
We separate the training and testing sets and remove the division column from the reduced dataset.
traindr <- datareduced[datareduced$division==TRUE,]
testdr<- datareduced[datareduced$division==FALSE,]
traindr$division<-NULL
testdr$division<-NULL
We run the logistic model, also showing the time invested by the logit model.
tic("Inicio logit:")
logitsvd<-glm(formula = popularity ~ ., family = binomial(link = "logit"),
data = traindr)
toc(echo=TRUE)
## elapsed time is 0.150000 seconds
summary(logitsvd)
##
## Call:
## glm(formula = popularity ~ ., family = binomial(link = "logit"),
## data = traindr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2854 -1.0930 0.6587 1.0366 1.7874
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.51244 0.04817 10.639 < 2e-16 ***
## V1 -0.10284 0.01145 -8.978 < 2e-16 ***
## V2 -0.04157 0.01133 -3.670 0.000242 ***
## V3 -0.02350 0.01206 -1.949 0.051273 .
## V4 -0.12063 0.01501 -8.035 9.39e-16 ***
## V5 -0.04888 0.02018 -2.422 0.015433 *
## V6 -0.02929 0.01539 -1.903 0.057029 .
## V7 0.10828 0.02059 5.260 1.44e-07 ***
## V8 0.10901 0.02020 5.398 6.75e-08 ***
## V9 -0.07211 0.02237 -3.224 0.001266 **
## V10 -0.06537 0.01894 -3.452 0.000556 ***
## V11 -0.05343 0.02089 -2.557 0.010545 *
## V12 -0.06214 0.02613 -2.378 0.017417 *
## V13 0.13936 0.02406 5.792 6.95e-09 ***
## V14 -0.16460 0.02761 -5.963 2.48e-09 ***
## V15 -0.05259 0.02503 -2.101 0.035663 *
## V16 0.19763 0.02428 8.141 3.92e-16 ***
## V17 0.01803 0.02433 0.741 0.458451
## V18 0.16194 0.02572 6.295 3.07e-10 ***
## V19 -0.11853 0.02643 -4.485 7.29e-06 ***
## V20 -0.16759 0.02699 -6.208 5.36e-10 ***
## weekday_is_monday1 -0.07794 0.05673 -1.374 0.169501
## weekday_is_tuesday1 NA NA NA NA
## weekday_is_wednesday1 NA NA NA NA
## weekday_is_thursday1 NA NA NA NA
## weekday_is_friday1 NA NA NA NA
## weekday_is_saturday1 NA NA NA NA
## weekday_is_sunday1 NA NA NA NA
## data_channel_is_lifestyle1 -0.41973 0.10334 -4.062 4.87e-05 ***
## data_channel_is_entertainment1 -0.66880 0.07741 -8.639 < 2e-16 ***
## data_channel_is_bus1 -0.65253 0.10965 -5.951 2.66e-09 ***
## data_channel_is_socmed1 0.49320 0.11696 4.217 2.48e-05 ***
## data_channel_is_world1 -0.54174 0.09919 -5.462 4.72e-08 ***
## data_channel_is_other1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13675 on 9900 degrees of freedom
## Residual deviance: 12648 on 9874 degrees of freedom
## AIC: 12702
##
## Number of Fisher Scoring iterations: 4
We run the prediction in the same way as before.
predict.logitsvd<-predict(logitsvd,newdata=testdr,type='response')
glm_prediction = ifelse(predict.logitsvd>0.5,'P','NP')
conf_matrix<-table(glm_prediction,testdr$popularity)
conf_matrix
##
## glm_prediction NP P
## NP 1149 700
## P 821 1574
porcentaje_correcto<-100 * sum(diag(conf_matrix)) / sum(conf_matrix)
porcentaje_correcto
## [1] 64.16117
We see that the efficiency of the model with SVD applied drops a little, it remains at 64.16%, so it seems that, in this case, the dimensionality reduction is not able to improve the predictive capacity of the logistic model, however, it is able to yield similar results using many fewer variables, and as can be seen in the time invested by the execution of both models, it yields it with a much shorter time.
Discussion
Different models have been run to try to solve the problem of predicting the popularity of a published news item using data from the mashable.com portal. We have seen that the generated decision tree is able to overcome 60% of effectiveness, and a not very extensive tree is obtained, giving special importance to the fields related to the keywords, as well as to the channel where the news was published.
With regard to the unsupervised models, we see that in this case they do not have great predictive ability, it may be because we have forced them to generate 2 classes and it does not seem that good partitions are generated by the model used (kmeans) in these conditions and with these data. We have observed that depending on the distance used to run the model (euclidean or manhattan) somewhat different results can be obtained.
As for the supervised models, the logistic regression with all fields has achieved a somewhat better result than the decision tree, and we have applied the SVD decomposition to the fields, no improvement in predictive results has been achieved, however, we have achieved similar results faster and with quite less data, which can be a great advantage if we talk about large amounts of data. In the Pr(>|z|) column of the regression summaries, we observe how in the case of the complete data there are many variables that the regression does not consider statistically significant (Pr(>|z|) > 0.05), while in the model with SVD applied we see how all the variables have statistical significance.