The dataset comes from the scraping of websites dedicated to wine.
It consists of 48 features and 3,600 rows - each example is one vintage of a wine.
The features are composed of numerical and categorical attributes such as: vintage, id, name, type of the wine (red or white), id, name of the winery, country, region, appelation of the wine, biodynamic wine or not, natural wine or not, level of acidity, type of body, alcohol content (%), number and type of awards won, ratings of the wine (grade from 1 to 5), year of production, grapes that compose the wine etc.
As the database is extremely imbalanced in favor of French (conventional) wine, I decided to focus on French wine only.
I used the feature median_price as a label for my predictive model, since I wanted to predict the price of a specific wine. The median has been computed by taking into account the volume of the priced bottles.
The whole dataset is the private property of Eldiias, but you can find a pickel file named wine_dataset_sql.
Cleaning the data was an important part of the project, as the database came from webscraping.
I cleaned the data in SQL and python, getting rid of duplicates, missing values, and formatting the values.
The exploratory part of the analysis let me suspect a strong relationship - either linear or quadratic - between the ratings and the prices, but also between the years of production and the prices, as shown below:
We also see an important disparity between grapes and regions :
But, paradoxically, there seems to be a reversed relationship between the number of awards won and the prices :
This can be explained by at least two reasons:
- most of wines in the dataset have no award, so the rewarded wines might be not significant
- famous and expensive wines do not compete for awards as they are already considered top of the range - in France, awards go to cheaper wines that need to build a reputation.
After visualizing the distribution of the prices, it seemed relevant to compute a confidence interval for the mean of the prices, using scipy.
The average price of French wine falls between 69.3€ and 73.7€, with certainty 95%.
Is French natural/biodynamic wine significantly more acidic than French conventionnal wine, on average ?
All conventional French wines + all biodynamic and natural French wines - both following a binomial distribution.
p1 proportion of wines with acidity "élevée" among conventional wines ; p2 proportion of wines with acidity "élevée" among biodynamic and natural wines.
14 biodynamic and natural French wines, a proportion p1 = 0.93 for biodynamic and natural wines with acidity "élevée" ; 215 conventional French wines, a proportion p2 = 0.81 for conventional wines with acidity "élevée"
After working separately with different variables on 3 different models of regression - linear regression and random forest - (1: variables ratings and year, 2: variables ratings and region, 3: variables ratings and grapes), I wrapped up all the variables in one model of regression that reached satisfying R-squared.
Train score: 0.88 Test score: 0.88
(taking the log of the datapoints since it appeared more like a quadratic relationship in the visualization)
Train score: 0.75 Test score: 0.76
(taking the log of the datapoints)