In my previous post, I went over the basic concepts in machine learning and I used a very small amount of data. I got great feedbacks but also notes to make more complex example with bigger dataset. In this post I will use a bigger dataset and use pandas , seaborn and scikit-learn to illustrate the process.

#### Linear regression

Linear regression is a very simple supervised machine learning algorithm – we have data (X , Y) with linear relationship. we want to predict unknown Y vales for given X. X can be one or more parameters.

Before we start we need to import some libraries:

1 2 3 |
import matplotlib.pyplot as py import seaborn as sb import pandas as pd |

In this example we will use a dataset from seaborn library (seaborn provides statistics graphs as an extension to matplotlib)

1 |
df=sb.load_dataset('tips') |

There are other datasets builtin seaborn library – see documentation

Lets look at the data:

1 |
df.head() |

The dataset contains tips data from different customers females and males smokers and non smokers from days Thursday to Sunday, dinner or lunch and from different tables size

We want to predict how much tip the waiter will earn based on other parameters

First lets look into the data using some dataframe methods :

1 |
df.info() |

1 2 3 4 5 6 7 8 9 10 11 12 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 244 non-null float64 sex 244 non-null category smoker 244 non-null category day 244 non-null category time 244 non-null category size 244 non-null int64 dtypes: category(4), float64(2), int64(1) memory usage: 7.2 KB |

1 |
df.describe() |

1 |
df.sample(5) |

### Data analysis with Pandas

Now lets answer some questions:

What is the hardest day to work ? (based on number of tables been served)

1 |
df.groupby('day').count() |

Lets find out what is the best day to work – maximum tips (sum and percents)

1 2 3 |
df2=df.groupby('day').sum() # sum per day df2.drop('size',inplace=True,axis=1) # sum of size column is not relevant df2['percent'] = df2['tip']/df2['total_bill']*100 # add percents |

we can see that the tips are around 15% of the bill

who eats more (and tips more)? smokers or non smokers?

1 2 |
df3=df.groupby('smoker').sum() df3['percent'] = df3['tip']/df3['total_bill']*100 |

Lets group by day and table size:

1 2 3 |
df4= df.groupby(['day','size']).sum() df4['percent'] = df4['tip']/df4['total_bill']*100 df4.dropna() # drop null rows |

(smaller tables are better to serve)

### Visualization with Seaborn

lets draw some seaborn graphs:

Tables per day

1 |
sb.countplot(x='day' ,data=df) |

Tables per day per size:

1 |
sb.countplot(x='day',hue='size' ,data=df) |

Smokers or not ?

1 |
sb.countplot(x='day',hue='smoker' ,data=df) |

### Transform and clean the data

Before we start building our model, we need to convert all the text values into numbers. We can do it in many ways:

- Using update statements
- Using replace method
- Iterate over the rows
- Use dummy variables

**Using replace: **

convert sex and smoker columns to values

1 2 |
df.replace({ 'sex': {'Male':0 , 'Female':1} , 'smoker' : {'No': 0 , 'Yes': 1}} ,inplace=True) df.head() |

**Using dummy variables:**

The values in day column are: Thu, Fri, Sat, Sun we can convert it to 1,2,3,4 but to get a good model, it is better to use boolean variables. We can achieve it by converting the column into 4 columns – one for each day with 0 or 1 as values. In pandas library it can be done using get_dummies:

1 2 |
days=pd.get_dummies(df['day']) days.sample(5) |

Actually we can drop one of the columns without loosing data – for example if we drop column ‘Thur’ we know that index 135 is Thur because all other days are 0. It is also supported by the same function:

1 2 |
days=pd.get_dummies(df['day'],drop_first=True) days.sample(6) |

Do the same with time column and concat all data frames, Also we don’t need the day and size columns anymore so we drop them

1 2 3 4 5 6 |
days=pd.get_dummies(df['day'],drop_first=True) df = pd.concat([df,days],axis=1) times=pd.get_dummies(df['time'],drop_first=True) df = pd.concat([df,times],axis=1) df.drop(['day','time'],inplace=True,axis=1) df.head() |

### Building our Machine Learning model

Now we are ready to build the linear regression model:

We create a list of features as X and predicted as Y

1 2 |
X = df[['sex','smoker','size','Fri','Sat','Sun','Dinner']] Y = df[['tip']] |

Now lets split the data into test and train so we can test our model before we use it – we decide to split 70% – 30%:

1 2 3 |
from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression X_train, X_test , y_train , y_test = train_test_split(X,Y,test_size=0.25,random_state=26) |

Now lets train the model with X_train and y_train:

1 2 |
model = LinearRegression() model.fit(X_train, y_train) |

And predict the X_test values:

1 |
predictions=model.predict(X_test) |

We can now look at the predictions and compare it with y_test

We can draw a graph to see the difference distribution:

1 |
sb.distplot(y_test-predictions) |

We can see from the graph that most of the times the predictions were correct (difference = 0). We can continue working on the model , adding data and play with the parameters

If we want to predict new value for example :

We have a 3 size table smoker male on friday lunch:

1 2 3 |
>>> myvals = np.array([0,1,3,1,0,0,0]).reshape(1,-1) >>> model.predict(myvals) array([[ 3.12444493]]) |

we expect to get 3.12$ tip

And the same table on dinner:

1 2 3 |
>>> myvals = np.array([0,1,3,1,0,0,0]).reshape(1,-1) >>> model.predict(myvals) array([[ 3.73414562]]) |

We expect to get 3.73$

[…] For example with a bigger dataset and pandas see this post […]

Hi,

Can you help me in making me understand your conclusion about the distribution plot.

i am new to python and its concepts and trying to learn more about the language.

i am sorry to have asked such a silly question.

thank you