2.3.4 Data Handling
2.3.4.1 Missing Values and Data Cleaning
First step is cleaning the data, for which we need to find the null values in the dataset. Figures 2.2 and 2.3 show the number of null values in every column. There are two methods of handling null value: first is that we can drop all the rows with null values, which will result in data loss; the other is that we could calculate the mean of all the values and replace all null values with the mean. Therefore, before cleaning the null value, we drop columns like society and balcony with multiple null values. Along with it, we also drop the columns like area type and availability, as our main goal is to predict the price.
In the size column, there are values with different attributes like 3 BHK and 3 BK, which means different; hence, to generalize, we will create a new column BHK. In this column, we would apply a function where we would tokenize each word; here, we keep the numbers and get rid of the other words. Therefore, we get a column BHK. In the total square feet column, there are entries where range is mention and not exact number; in this case, we replace it with the average of both the number.
Figure 2.2 Missing values.
Figure 2.3 Visualizing missing values using heatmap.
2.3.4.2 Feature Engineering
Feature engineering is the base that will help us further to remove outliers. Here, we combine two columns, apply the formula, and get price per square feet. Then, we find the number of unique location where we get 1,304 locations. Here, some of the locations are just mentioned once or twice; therefore, we set a threshold of 10, so all the locations that appear over five times are considered Figure 2.4.
Figure 2.4 Different BHK attribute.
2.3.4.3 Removing Outliers
Outliers are data points or errors, which represent extreme variations in our dataset. There are techniques to detect outlier; one of them is by visualization. We can graph box plot or scatter plot and, from the patterns, draw inference.
In BHK, there are some flat whose average area of one room is larger, which appears unusual, whereas in some instances, the number of bathroom is larger than number of rooms in the house, hence affecting the result.
The scatter chart was plotted to visualize price per square feet for 2 BHK and 3 BHK properties. Here, the blue points represent the 2 BHK and red points as 3 BHK plots. Based on Figures 2.5 through 2.8 the outliers was remove from the Hebbal region using the “remove bhk outliers function”.
Figure 2.5 Bath visualization.
Figure 2.6 BHK visualization.
Figure 2.7 Scatter plot for 2 and 3 BHK flat for total square feet.
Figure 2.8 Scatter plot for 2 And 3 BHK flat for total square feet after removing outliers.
2.4 Algorithms
2.4.1 Linear Regression
Linear regression is an approach linear in nature to modeling the relationship connecting a scalar response and one or more explanatory variables. A prognostic modeling technique finds a relationship among independent variable and dependent variable. The independent variables can be categorical or continuous, while dependent variables are only continuous.
2.4.2 LASSO Regression
LASSO regression is another sort of linear regression; it makes use of shrinkage. Its data values are reduce in measurement in the route of a valuable point like mean. The system encourages easy and sparse models; the acronym “LASSO” is for the Least Absolute Shrinkage and choice Operator [4, 5]. L1 regularization is done with the aid of LASSO regression; it gives a sanction, which is equal to the absolute fee of the coefficients significance. This form of regularization outcomes in sparse fashions with much less coefficients; many coefficients can emerge as zero and are eliminated from the model. Huge penalties result in values close by to zero, which produces less difficult fashions. On the opposite, L2 regularization (e.g., ridge regression) does not bring about the exception of the coefficients or sparse models. This makes the LASSO higher to elucidate than the ridge.
2.4.3 Decision Tree
A selection tree is flowchart-like tree, in which a characteristic is represented by using inner node; the choice rule is represented with the aid of a branch and final results by way of each leaf node. The pinnacle node in a choice tree is called as the root node. It partitions the tree in a recursive way, namely, recursive partitioning. The time complexity is a characteristic of the range of statistics and the variety of attributes