Predicting NO2 levels in Madrid
While looking for data to develop my data science skills, I came up with the idea of searching open data portals. I wanted to look at actual datasets and find out what they were like. For this purpose, I chose open data from the Madrid Open Data Portal (https://datos.madrid.es/portal/site/egob). I will try to predict NO2 concentration using weather and traffic data.
This is not meant to be a definitive prediction algorithm, as I am only using a small dataset with specific geographic and temporal points, but just a small exercise with real data to learn how certain machine learning algorithms can work. After I prepared the data, I realized that I had encountered an unbalanced data set, so I tried to deal with that.
For this exercise, I obtained, cleaned, and merged three different datasets to create a model. As you will read later, the pollution dataset is really unbalanced because there are few cases with high NO2 concentrations. I will try to fix this using the subsampling technique.
The datasets I have chosen contain traffic, weather, and pollution data. I am trying to predict the pollution data (NO2) from traffic and weather data. The data ranges from January 2019 to January 2021, and since the weather data starts in 2019, earlier data is not available.
To shorten the datasets, I filtered the data and select only data from a specific area in Madrid. I select data from sensors in this area (Plaza Elíptica) and choose one pollution and weather sensor as well as 5 traffic sensors nearby.
Data acquisition and cleaning¶
Let's start importing the required libraries.
import pandas as pd
import glob
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
Mounting the google drive unit where the data was uploaded.
from google.colab import drive
drive.mount('/content/drive')
Retrieving traffic data,simultaneously filtering the sensors we are interested in at the same time. Since the files are large I read them piecemeal to avoid memory errors. This has been executed several times in order to load all the files to google drive. The I save and load the dataset to pkl file to make faster subsequent executions.
#Import pollution data from January 2019. All files in the folder
#path = r'/content/drive/MyDrive/Traffic/'
#all_csv = glob.glob(path + "/*.csv")
#traffic_df=pd.DataFrame()
#id_list = [5067,5085,5090,10580,10250] #Selected traffic sensors.
#for filename in all_csv:
# iter_csv = pd.read_csv(filename, index_col=None, header=0,sep=';', iterator=True, chunksize=10000)
# aux_df = pd.concat((chunk for chunk in iter_csv), axis=0, ignore_index=True)
# traffic_df=traffic_df.append(aux_df[aux_df['id'].isin(id_list)],ignore_index=True)
#save dataframe to load it faster in other executions.
#traffic_df.to_pickle(r'/content/drive/MyDrive/Traffic/traffic_df.pkl')
traffic_df=pd.read_pickle(r'/content/drive/MyDrive/Traffic/traffic_df.pkl')
#traffic_df=traffic_df.loc[traffic_df['id'].isin([5067,5085,5090])]
Let's check what we've got.
traffic_df.describe()
As we can see, the column vmed (average speed) is always 0 because the selected sensors do not measure traffic speed, so we will delete this column, also the column periodo_integracion can be deleted because it is always 15 minutes, which is the period of sensor measurements in each row. The tipo_elem column also always has the same value, it contains the sensor type, so it is of no interest to us. I will also translate the rest of the columns into English for better understanding.
traffic_df.drop(['tipo_elem','vmed','periodo_integracion'], inplace=True,axis=1)
traffic_df.rename(columns={'fecha':'DATE','intensidad':'INTENSITY','ocupacion':'OCCUPANCY','carga':'LOAD'},inplace=True)
According to the data definition found on the same website, an E or an S in the error column means that the readings are not valid, which we should check.
traffic_df[(traffic_df['error']=='E') | (traffic_df['error']=='S')]
The data seems to be in order, let's proceed with the preparation of our data set. Since we have 5 traffic sensors, we want to use the hourly average of all the readings from the sensors, so we will get the hour and day from the date and calculate the average from those columns. Then we get rid of the columns DATE and error.
traffic_df["HOUR"] = pd.DatetimeIndex(traffic_df.DATE).hour
traffic_df["DAY"] = pd.DatetimeIndex(traffic_df.DATE).date
traffic_df=traffic_df.drop(['DATE','error'],axis=1)
traffic_df.head()
We proceed to group the data and average all the sensors.
traffic_grouped_df=traffic_df.groupby(['HOUR','DAY']).mean()
traffic_grouped_df=traffic_grouped_df.drop(['id'],axis=1)
traffic_grouped_df.head()
We need to reset the index to ungroup the data once we have the mean.
traffic_grouped_df=traffic_grouped_df.reset_index()
traffic_grouped_df.head()
All right, we have the traffic data set where we want it. Let's move on to the pollution data set. Let's import the data.
#Import pollution data from January 2019. All files in the folder
#path = r'/content/drive/MyDrive/Pollution/'
#all_csv = glob.glob(path + "/*.csv")
#li = []
#for filename in all_csv:
# df = pd.read_csv(filename, index_col=None, header=0,sep=';')
# li.append(df)
#pollution_df = pd.concat(li, axis=0, ignore_index=True)
#pollution_df.to_pickle(r'/content/drive/MyDrive/Traffic/pollution_df.pkl')
pollution_df = pd.read_pickle(r'/content/drive/MyDrive/Traffic/pollution_df.pkl')
pollution_df.iloc[:,4:16].head()
This dataset is structured differently, containing the hourly data in columns. Let's start by filtering the data and picking out the sensor we are interested in.(Some columns are hidden for clarity)
pollution_df=pollution_df[pollution_df['PUNTO_MUESTREO'].str.contains('28079056')]
Also, we want to filter the magnitude we are interested in (NO2).
pollution_df=pollution_df.loc[pollution_df['MAGNITUD'] == 8]
#Get the Vxx columns
v_cols = [col for col in pollution_df.columns if col.startswith('V')]
print(v_cols)
According to the data specification, columns Vxx contain the validity of the measured values. If a column contains an N, the reading is invalid. Let's check how many we have.
#Let's count the V and N values
pollution_df[v_cols].stack().value_counts()
We have some invalid readings. To eliminate them, I will assign the previous day's value at the same hour. To do this, we create a function and apply it.
#Create a function to set the previous day value
def setPrevious(vcol,df):
#Get the colname with the hour
hcolname=vcol.name.replace('V','H')
#Set the previous values
df[hcolname]=np.where(df[vcol.name] == 'N',df[hcolname].shift(1), df[hcolname])
#Apply function to set the values. Use _ (drop away variable) to avoid output.
_ =pollution_df.apply(lambda x: setPrevious(x,pollution_df) if x.name.startswith('V') else x)
Now we can drop the V columns.
pollution_df=pollution_df.drop(v_cols,axis=1)
We need to have the NO$_{2}$ readings in one column to be able to merge with the traffic dataset. Let's get the Hxx columns, and melt the data.
#Get the Hxx columns
h_cols = [col for col in pollution_df.columns if col.startswith('H')]
print(h_cols)
pollution_df=pd.melt(pollution_df, id_vars=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','MAGNITUD'], value_vars=h_cols,var_name='Hour', value_name='NO2')
Good. We need a column DATE in the same format as in the traffic record. Also the hour (0 to 23)
pollution_df['DATE'] = pd.to_datetime(pollution_df['ANO'].astype(str)+'/'+pollution_df['MES'].astype(str)+'/'+pollution_df['DIA'].astype(str),dayfirst=True)
pollution_df.head()
Now we need to convert the format of the hour column in our traffic data to match the format in the pollution data set.
traffic_grouped_df['HOUR']=traffic_grouped_df['HOUR'].apply(lambda x: 'H'+str(x+1).zfill(2))
traffic_grouped_df.head()
Let's drop some unnecessary columns and change the column names in the pollution dataset.
pollution_df=pollution_df.drop(['ANO','MES','DIA'],axis=1)
pollution_df.rename(columns={'DATE':'DAY','Hour':'HOUR'},inplace=True)
traffic_grouped_df['DAY']=pd.to_datetime(traffic_grouped_df['DAY'])
pollution_df.head()
Now we can merge the data!. Let's also drop the unnecessary columns.
df=pd.merge(traffic_grouped_df, pollution_df, how="left", on=['DAY','HOUR'])
df=df.drop(['PROVINCIA','MUNICIPIO','MAGNITUD'],axis=1)
df.describe()
All right, we have got the traffic and pollution data together, we are almost done. Now let's read in the weather data. This data set is similar to the pollution data, so we need to repeat the same data cleaning and transformation.
#Import weather data
path = r'/content/drive/MyDrive/Meteo/'
all_csv = glob.glob(path + "/*.csv")
li = []
for filename in all_csv:
auxdf = pd.read_csv(filename, index_col=None, header=0,sep=';')
li.append(auxdf)
meteo_df = pd.concat(li, axis=0, ignore_index=True)
meteo_df.iloc[:,4:12].head()
Let's get the sensor we are interested in.
meteo_df=meteo_df.loc[meteo_df['PUNTO_MUESTREO'].str.contains('28079056')]
Let's fix the incorrect values and melt the data
#Apply function to set the previous values. Use _ (drop away variable) to avoid output.
_ =meteo_df.apply(lambda x: setPrevious(x,meteo_df) if x.name.startswith('V') else x)
#Get the Vxx columns
v_cols = [col for col in meteo_df.columns if col.startswith('V')]
meteo_df=meteo_df.drop(v_cols,axis=1)
#Get hcols
h_cols = [col for col in meteo_df.columns if col.startswith('H')]
#Melt the data
meteo_df=pd.melt(meteo_df, id_vars=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','MAGNITUD'], value_vars=h_cols,var_name='Hour', value_name='VALUE')
meteo_df.head()
Since we want to use multiple measures in this case, we need an extra step. Let's pivot the data to convert the measures into columns.
meteo_df=meteo_df.pivot(index=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','Hour'],columns='MAGNITUD',values='VALUE')
meteo_df.reset_index(inplace=True)
meteo_df.head()
We rename the columns so that they've proper names instead of codes, as specified in the data specification.
meteo_df.columns=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','Hour','WIND_SPEED','WIND_DIR','TEMPERATURE','REL_HUMIDITY','BAR_PRESSURE','PRECIPITATION']
Now we create a date column in DateTime format.
meteo_df['DATE'] = pd.to_datetime(meteo_df['ANO'].astype(str)+'/'+meteo_df['MES'].astype(str)+'/'+meteo_df['DIA'].astype(str),dayfirst=True)
meteo_df.iloc[:,5:].head()
Drop unnecessary columns and change the key column names to prepare the dataset for merging.
meteo_df=meteo_df.drop(['PROVINCIA','MUNICIPIO','ANO','MES','DIA'],axis=1)
meteo_df.rename(columns={'DATE':'DAY','Hour':'HOUR'},inplace=True)
meteo_df.head()
Great. Let's merge!
df=pd.merge(df, meteo_df, how="left", on=['DAY','HOUR'])
df.describe()
Finally, let's check for NA values.
df.isnull().any()
We have some null values in the column Wind Direction. Let us check how many there are.
df[df['WIND_DIR'].isnull()]
We have some rows with null value, since the set is not a large percentage of the data set, we just drop them.
df.dropna(inplace=True)
df.describe()
Right. To conclude this section, I will categorize the NO$_{2}$ data, so that the exercise is clearer and we can build a classification model.
The limit for the hourly level of NO$_{2}$ in the air triggers an alarm in Madrid is 180 µg/m3, so we will use this number to classify the lines into low and high (pollution level)
df['NO2_CAT']=pd.cut(df['NO2'], bins=[0, 181,330], include_lowest=True,right=False,labels=['Low','High'])
df.head()
df[['NO2_CAT']].value_counts()
A little bit of data exploration.¶
Let's check the data in both categories (High and Low)
df[df['NO2_CAT']=='Low'].describe()
df[df['NO2_CAT']=='High'].describe()
The traffic-related variables (intensity, occupancy, and load) differ in both categories. The average, standard deviation and quantile of the two categories are far apart. Wind speed also has different values in the two categories, suggesting that these variables may be important in determining NO$_{2}$ values.
Let's check now the correlation among the variables.
correlation=df.corr()
fig, ax = plt.subplots(figsize=(10,10))
sn.heatmap(correlation,annot=True,linewidths=.5, ax=ax)
plt.show()
It seems that the variable most correlated with NO$_{2}$ is wind speed, which is really important. Traffic variables also seem to have an important effect on NO$_{2}$.
Let's look at some scatter plots. As we can see below, the low points aren't separated from the high points, only the INTENSITY -OCCUPANCY graphs show some low points, but mixed with the high ones, there's no clear line separating them.
#scatter plots
data=df[['WIND_SPEED','INTENSITY','OCCUPANCY','NO2_CAT']]
sn.pairplot(data,hue='NO2_CAT')
Predictive analysis.¶
In this section, we will use classification methods (SGD Classifier) to try to classify and predict the values of NO$_{2}$ levels using traffic and weather data. As you can see from the number of values, this is an imbalanced data set. Let us take a look at how the algorithm works. The values of some parameters were obtained by parameter optimization of the algorithm, which is not shown here.
df['NO2_CAT'].value_counts()
First, we split the data into training and testing datasets.Obviously we remove NO$_{2}$ values from the independent variables. We also select the most important variables impacting NO$_{2}$.
# Import train_test_split function
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_sample_weight
x=df[['INTENSITY','OCCUPANCY','WIND_SPEED']]
#x=df[['INTENSITY','OCCUPANCY','WIND_SPEED']]
y=df[['NO2_CAT']]
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30,random_state=12,stratify=y)
Let's fit the model.
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
pipe=Pipeline([('ss',StandardScaler()),('sgd',SGDClassifier(random_state=42))])
params={'sgd__loss':['hinge','log','modified_huber','squared_hinge','perceptron','squared_loss','huber','epsilon_insensitive','squared_epsilon_insensitive'],
'sgd__epsilon':[1e-3,1e-2,0.1,1],
'sgd__fit_intercept':[True,False],
'sgd__alpha':[0.0001,0.001,0.1,1],
'sgd__max_iter':[100,1000,10000],
'sgd__tol':[1e-3,1e-2,0.1,1],
'sgd__early_stopping':[True,False],
'sgd__validation_fraction':[0.1,0.3,0.5,0.7],
'sgd__n_iter_no_change':[1,3,5,10],
'sgd__average':[None,1,3,5,7,10]
}
rscv = RandomizedSearchCV(pipe, param_distributions=params,
n_iter=500,scoring='f1_macro',verbose=0,random_state=42)
rscv.fit(X_train, np.array(y_train).ravel())
rscv.best_score_
from sklearn.metrics import classification_report
y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))
The results are really disappointing: low precission and recall scores. This is because we have a very imbalanced dataset, so the instances with high pollution are ignored. Let's try to fix that. We can use the parameter class_weight in order to make the algorithm take into account the imbalanced nature of the dataset.
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
pipe=Pipeline([('ss',StandardScaler()),('sgd',SGDClassifier(random_state=42,class_weight='balanced'))])
rscv = RandomizedSearchCV(pipe, param_distributions=params,
n_iter=500,scoring='f1_macro',verbose=0,random_state=42)
rscv.fit(X_train, np.array(y_train).ravel())
rscv.best_score_
y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_rscv).ravel()
(tn, fp, fn, tp)
Ok, the prediction has improved, the recall is much better which is what we are searching for, on the other hand there are many false negatives, (low points predicted as high points). In this case, it is not bad because if we use this algorithm, we want to be sure that we can predict all the high values, even if there are some false negatives, but we should try to improve the model.
Let's try other methods to deal with imbalanced datasets. Oversampling methods create synthetic instances of the minority class so that the dataset is more balanced. The imblearn library helps us with some algorithms. The SMOTETomek algorithm combines oversampling and undersampling.We only need to balance our training dataset to avoid overfitting our model, so we include the oversampling in our pipeline.
from imblearn.combine import SMOTETomek
oversample=SMOTETomek()
from sklearn import svm
from imblearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
pipe=Pipeline([('ss',StandardScaler()),('ov',oversample),('sgd',SGDClassifier(random_state=42))])
rscv = RandomizedSearchCV(pipe, param_distributions=params,
n_iter=500,scoring='f1_macro',verbose=0)
rscv.fit(X_train, np.array(y_train).ravel())
y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_rscv).ravel()
(tn, fp, fn, tp)
Worst results in my opinion, slightly better recall, much worst precision.
We have seen how to download real (open) data from an open data portal and clean and prepare it for analysis. We have also learned the impact of an imbalanced dataset and how we can try to improve our models. We have seen that we need to balance precision and recall depending on our goals. In this case, we may want to predict all cases of high pollution even if we have a high number of false alarms.