SecretDataScientist.com

What is Selection Bias?

Selection Bias is the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase “selection bias” most often refers to the distortion of a statistical … Read more

What is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. R-squared is the percentage of the response variable variation that is explained by the model, it is always between 0 and 100%: 0% indicates that the … Read more

What is Root Mean Square Error (RMSE)?

Root Mean Square Error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the … Read more

What is Resampling?

Resampling is any technique of generating a new sample from an existing dataset. There is a variety of methods for estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping). Exchanging labels on data points when performing significance tests (permutation tests, also … Read more

What is Regularization?

Regularization in the field of machine learning is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting. A theoretical justification for regularization is that it attempts to impose Occam’s razor on the solution, as depicted in the figure. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior … Read more

What is Regression?

Regression is a statistical measure used that attempts to determine the strength of the relationship between one dependent variable and a series of other changing (independent) variables. The two basic types of regression are linear regression and multiple linear regression, although there are non-linear regression methods for more complicated data and analysis. Linear regression uses one independent variable to explain … Read more

What is Random Sampling?

Random sampling. In this technique, each member of the population has an equal chance of being selected as the subject. The entire process of sampling is done in a single step with each subject selected independently of the other members of the population. There are many methods to proceed with simple random sampling. A sample chosen randomly is meant to … Read more

What is Random Forest?

Random Forest or Random Decision Forest are an ensemble learning method for classification, regression, and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees habit of overfitting to … Read more

What is Radial Basis Function(RBF) network?

Radial Basis Function(RBF) network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 … Read more

What is QQ plot?

QQ plots – Quantile-Quantile plots are a graphical technique for determining if two data sets come from populations with a common distribution. A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. By a quantile, we mean the fraction (or percent) of points below the given value. The … Read more