DATA SCIENCE Q&A

What is Gaussian Distribution (Normal Distribution)?

Gaussian Distribution (Normal Distribution) in probability theory is a very common continuous probability distribution. Normal distribution is important in statistics and is often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. The normal distribution is useful because of the central limit…

What is Fuzzy Clustering?

Fuzzy Clustering (also referred to as soft clustering) is a form of clustering in which each data point can belong to more than one cluster. Clustering or cluster analysis involves assigning data points to clusters (also called buckets, bins, or classes), or homogeneous classes, such that items in the same…

What is Feedforward Neural Network (FNN)?

Feedforward Neural Network (FNN) is a biologically inspired classification algorithm. It consists of a (possibly large) number of simple neuron-like processing units, organized in layers. Every unit in a layer is connected with units in the previous layer. These connections are not all equal: each connection may have a different…

What is Feature Vector?

Feature vector in pattern recognition and machine learning is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of…

What is Feature in machine learning?

Feature in machine learning and pattern recognition is an individual measurable property of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification, and regression. Features are usually numeric, but structural features such as strings and graphs are used…

What are False Negatives?

False negatives are where a test result indicates that a condition failed, while it was successful. I.e. erroneously no effect has been assumed. A common example is a guilty prisoner freed from jail. The condition: “Is the prisoner guilty?” is true (yes, the prisoner is guilty). But the test (a…

What are False Positives?

False positives commonly called a “false alarm”, is a result that indicates a given condition has been fulfilled when it has not. I.e. erroneously a positive effect has been assumed. In the case of “crying wolf” – the condition tested for was “is there a wolf near the herd?”; the…

What is Extrapolation?

Extrapolation in mathematics is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk of…

What is Explanatory Data Analysis?

Explanatory Data Analysis (EDA) in statistics is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task….

What is Euclidean Distance?

Euclidean distance in mathematics is the “ordinary” (i.e. straight-line) distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as a Pythagorean metric. A generalized term for the Euclidean norm…

What is Estimation?

Estimation is the process of finding an estimate, or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is derived from the best information available. Typically, estimation involves “using the value…

What is Eigenvectors?

Eigenvectors are a special set of vectors associated with a linear system of equations that are sometimes also known as characteristic roots, proper values, or latent roots. The determination of the eigenvectors and eigenvalues of a system is extremely important in physics and engineering, where it is equivalent to matrix…

What is Deep Learning?

Deep Learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a class of machine learning algorithms that: use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. The…

What is Deep Belief Network?

Deep Belief Nets are probabilistic generative models that are composed of multiple layers of stochastic, latent variables. The latent variables typically have binary values and are often called hidden units or feature detectors. The top two layers have undirected, symmetric connections between them and form an associative memory. The lower…

What is Decision Tree?

Decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically in decision analysis, to help…

What is Data Mining?

Data Mining is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. It is an interdisciplinary subfield of computer science. The overall goal of the data mining process is to extract information from a data…

What is Cross-Validation?

Cross-Validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is a prediction, and one wants to estimate how accurately a predictive model will perform in practice. The idea…

What is Correlation?

Correlation is a statistical measure that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think…

What is Convolutional Neural Network (CNN)?

Convolutional Neural Network (CNN) is made up of neurons that have learnable weights and biases. CNN are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision…

What is collaborative filtering?

Collaborative Filtering (CF) is a technique used by recommender systems. Collaborative filtering has two senses, a narrow one, and a more general one. Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The…

What is cluster analysis?

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in one sense or another) to each other than to those in other groups (clusters). It is the main task of…

What is classification?

Classification in machine learning and statistics, is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Classification is an example of pattern recognition. In the…

What is Chi-squared test for variances?

Chi-squared test for variances. A chi-square test can be used to test if the variance of a population is equal to a specified value. This test can be either a two-sided test or a one-sided test. The two-sided version tests against the alternative that the true variance is either less…

What is Chi-squared test for goodness of fit?

Chi-squared test for goodness of fit also written as a χ2 test is any statistical hypothesis test wherein the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. Without other qualification, ‘chi-squared test’ often is used as short for Pearson’s chi-squared test. Chi-squared…

What is Central Limit Theorem?

Central Limit Theorem (CLT) is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Furthermore, all of the samples…

What is Causation?

Causation. Two or more variables considered to be related, in a statistical context, if their values change so that as the value of one variable increases or decreases so does the value of the other variable (although it may be in the opposite direction). Theoretically, the difference between the two…

What is Categorical Variable?

Categorical Variable in statistics is a variable that can take on one of a limited, and usually fixed number of possible values, assigning each unit of observation to a particular group or nominal category on the basis of some qualitative property. In computer science and some branches of mathematics, categorical…

What is CART or Classification And Regression Trees?

CART or Classification And Regression Trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are…

What is Box plot?

Box plots is a quick way of examining one or more sets of data graphically. In statistics, a box plot is a convenient way of depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper…

What is Bootstrapping?

Bootstrapping. In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost…

What is Boltzmann Machine?

Boltzmann machine is a network of symmetrically connected, neuronlike units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm that allows them to discover interesting features in datasets composed of binary vectors. The learning algorithm is very slow in networks with…

What is Big Data?

Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term “big data” often refers simply to…

What is Bias-variance trade-off

Bias-variance trade-off is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. In statistics and machine learning bias-variance trade-off is the problem of simultaneously minimizing two sources of error…

What is Bayesian statistics?

Bayesian statistics is a theory in the field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief known as Bayesian probabilities. Such an interpretation is only one of a number of interpretations of probability and there are other…

What is backpropagation?

Backpropagation or the backward propagation of errors is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. The algorithm repeats a two-phase cycle, propagation, and weight update. When an input vector is presented to the network, it is propagated…

What is Autoencoder?

Autoencoder is an artificial neural network used for unsupervised learning of efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. Recently, the autoencoder concept has become more widely used for learning generative models of…

What is AUC – Area Under the Curve?

AUC stands for the Area Under the Curve. Technically, it can be used for the area under any number of curves that are used to measure the performance of a model, for example, it could be used for the area under a precision-recall curve. However, when not otherwise specified, AUC…

What is ANOVA F-test?

Anova F-test in a one-way analysis of variance is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. For example, suppose that a medical trial compares four treatments. The ANOVA F-test can be used to assess whether any of the…

What is ANOVA – Analysis of variance?

ANOVA -Analysis of variance is a form of statistical hypothesis testing used in the analysis of experimental data. A test result is called statistically significant if it is deemed unlikely to have occurred by chance, assuming the truth of the null hypothesis. A statistically significant result, when a probability (p-value)…

What is ANCOVA – Analysis of covariance?

ANCOVA (Analysis of covariance) is a general linear model which blends ANOVA and regression. ANCOVA evaluates whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV) often called a treatment, while statistically controlling for the effects of other continuous variables that are…

What is Alternative Hypothesis (H1)?

Alternative Hypothesis (H1) is a way of referring to the alternative hypothesis in a scientific experiment or business process improvement initiative. While the null hypothesis (H0) in any experiment or research project is that the connection or conclusion suggested by the experiment is false, the alternative hypothesis (H1) is always…

What is A/B Testing

A/B Testing (also known as split testing or bucket testing) is a method of comparing two versions of a web page or app against each other to determine which one performs better. AB testing is essentially an experiment where two or more variants of a page are shown to users…

What is 80/20 rule – Pareto rule?

The 80-20 rule (Pareto rule) is a rule of thumb that states that 80% of outcomes can be attributed to 20% of all causes for a given event. In business, the 80-20 is often used to point out that 80% of a company’s revenue is generated by 20% of its…

Artificial Neural Network

What is Neural Network? Definition Artificial Neural network is a system of individual processing units (Neurons) connected usually in a structured manner. It is designed to mimic operation and model the way human brain solves problems. Each individual neural unit has a function which processes the values of all its inputs. The…

Machine Learning

Machine learning is a subfield of science, that provides computers with the ability to learn without being explicitly programmed.   The goal of machine learning is to develop learning algorithms, that do the learning automatically without human intervention or assistance, just by being exposed to new data. The machine learning paradigm…

Data Science Definition

A precise Data Science definition is not easy to come up with, but as a rule of thumb, one could say, data science is a field that involves extracting insights and knowledge from data using various statistical and computational methods. Data science involves using methods from the fields of statistics,…