WINE QUALITY PREDICTION

Karthik V M
6 min readFeb 16, 2023

--

Wine Quality prediction is an EDA of the Playground series data from season 3 episode 5 from the Kaggle competition.

Wine Quality Prediction code click here

1. Summary:

· This is an EDA of the Playground series data for season 3 episode 5.

· The data is synthetic and ask to predict Wine Quality, with a range of 0–10. With 10 being a high-quality wine.

· This is a Multi-class classification problem

· The dataset for this competition (both train and test) was generated from a deep learning model trained on the original Wine Quality dataset.

· The metric required to be used by the competition is Quadratic weighted kappa

2. Problem statement (Business Level):

The problem statement of this project is to build a model that can predict the quality of the wine based on various chemical and physical properties provided. Wine quality should be predicated on a scale of 0–10.

3. Problem Statement (ML/DL):

a. Objective: To predict the Quality of the wine on the range or scale of 0–10 with the help of the given features.

b. Data: The dataset contains information on a different combination of chemicals used in train and test i.e. 2056 and 1372 respectively, with the columns names Id, fixed acidity, citric acid, residual sugar, free sulfur dioxide, total sulfur dioxide, ph, density, sulphates, alcohol along with Target Quality.

c. Evaluation Metric: The model will be evaluated using Quadratic weighted kappa and for our understanding purpose we use the f1 score along with the confusion matrix.

d. Constraints: The model should be able to run on a standard laptop with 8 GB of Ram and a standard CPU.

e. Expected Result: The model should be able to predict the Wine Quality with an accuracy of at least 75%.

4. Importing the libraries:

5. Loading data set:

Now let’s load the dataset of Wine Quality into panda’s DataFrame, and then print the top 5 rows of each dataset i.e., train and test along with its shape

# Loading the dataset
train_data=pd.read_csv("/content/train.csv")
test_data=pd.read_csv("/content/test.csv")

Train dataset:

# train dataset 1st 5 rows(head) and shape
print("Train dataset shape is:",train_data.shape)
train_data.head()

Test dataset:

# test dataset 1st 5 rows(head) and shape
print("Test dataset shape is:",test_data.shape)
test_data.head()

6. Data Description

#information of the train dataset
train_data.info()

#information of the test dataset
test_data.info()

So here we can see the total number of columns, the column’s name, and the data type of the features and And also the entire dataset is numerical there are no categorical variables.

7. Data Exploration:

Data pre-processing steps:

#Train dataset
#checking the duplicate values in the dataset
train_data.duplicated().sum()

#checking for the Null Values in the dataset
train_data.isnull().sum()
#Test dataset
#checking the duplicate values in the dataset
test_data.duplicated().sum()

#checking for the Null Values in the dataset
test_data.isnull().sum()

In this dataset, we don’t have any null values or missing values or duplicated values.

1. EDA RESULTS:

Target Analysis:

This is a Mutli-class classification problem, where we are expected to predict the quality of wine on a 0–10 rating scale.

From the below, we can see that we only have wine quality starting from 3 to 8.

  • and also the data set is an imbalance
  • and it is multi-classification problem
  • so change the options to 3–0,4–1,5–2,6–3,7–4, and 8–9

Distribution and skewness:

Skewness checking in the independent variable of both datasets

Skewness: if the skew values is -1 and -0.5, the data is negatively skewed, and if it is between 0.5 to 1, the data is positively skewed. The skewness is moderate. If the skewness is lower than -1 (negatively skewed) or greater than 1 (positively skewed), the data is highly skewed.

in the train data and test data set residual sugar, chlorides, total sulfur dioxide, and sulfates are highly skewed, so we need to make it a normalized distribution

# We concatenate test data to train to get a full view of the data as we know it
skew_df = pd.concat((train_data.drop(['quality'],axis =1), test_data), axis =0).skew(numeric_only=True).sort_values()
print("Skewly distributed columns by skewness value:\n")
display(skew_df)

Kurtosis is a measure of whether the data is heavy-tailed or light-tailed relative to a normal distribution

The degree of distribution of data further for normal distribution will affect certain model performances.

So, to normalize the distribution in the data set we can use the distribution transformations such as,

· Log Transformation

· BoxCox Transformation

· Square Root Transformation

· Quantile Transformation

So let’s apply every distribution transformation for the skewed data that is for Sulphates, residual sugar, and chlorides as they show high kurtosis and skewness

We are looking for the best process to create normally distributed graphs. From the above, we see that QuantileScaling is the best process for most of the features as this gives us a nice bell-shaped curve.

Correlation and Mutual Information:

Features that contain similar (correlated / mutual) information negatively impact certain models as this cause Overfitting.

We can observe in the above, that there is a high correlation between the number of features for example ‘ph’ and ‘fixed acidity’ have a larger negative correlation, and ‘density’ and ‘fixed acidity’ have a large positive correlation.

So in order to avoid it we can drop any one of the features which is highly correlated or we can reduce the correlation by applying feature decomposition with PCA (Principal Component Analysis) method.

Outliers/ Distribution:

Outliers will skew certain models and result in poor local optimums in trained models and reduce model performance. As such we need to assist our models in identifying outliers:

We will identify features with a large number of outliers through boxplot visualization.

Columns with a larger number of outliers will skew our models. From eyeballing the above graphs we have a number of columns with Outliers.

  • Fixed acidity
  • Residual sugar
  • Chlorides
  • Density
  • Ph
  • Sulphates

--

--

No responses yet