Trading with Python Intro – Data Import

Traditionally, there have been two general ways of analyzing market data: In recent years, computer science and mathematics revolutionized trading, it has become dominated by computers helping to analyze vast amounts of available data.  Algorithms are responsible for making trading decisions faster than any human being could. Machine learning and…

How would you validate-test a predictive model?

How would you validate-test a predictive model? Why evaluate/test model at all? Evaluating the performance of a model is one of the most important stages in predictive modeling, it indicates how successful model has been for the dataset. It enables to tune parameters and in the end test the tuned model…

Why would you use Regularization and what it is?

Why would you use Regularization and what it is? In Machine Learning, very often the task is to fit a model to a set of training data and use the fitted model to make predictions or classify new (out of sample) data points. Sometimes model fits the training data very…

Introduction to TensorFlow

Introduction to TensorFlow. What is TensorFlow? The shortest definition would be, TensorFlow is a general-purpose library for graph-based computation. But there is a variety of other ways to define TensorFlow, for example, Rodolfo Bonnin in his book – Building Machine Learning Projects with TensorFlow brings up definition like this: “TensorFlow…

Where to learn TensorFlow for Free?

Below a list of free resources to learn TensorFlow: TensorFlow website: www.tensorflow.org Udacity free course: www.udacity.com Google Cloud Platform: cloud.google.com Coursera free course: www.coursera.org Machine Learning with TensorFlow by Nishant Shukla : www.tensorflowbook.com ‘First Contact With TensorFlow’ by Prof. JORDI TORRES: jorditorres.org  or you can order from Amazon: First Contact With Tensorflow Kadenze Academy: www.kadenze.com…

Tensor Flow Cheat Sheet.

TensorFlow Quick Reference Table – Cheat Sheet. TensorFlow is a very popular deep-learning library, with its complexity can be overwhelming, especially for new users. Here is a short summary of often-used functions, if you want to download it in pdf it is available here: TensorFlow CheetSheet – SecretDataScientist.com If you…

Popular Pandas snippets used in data analysis.

Popular Pandas snippets used in data analysis. Pandas is very popular Python library for data analysis, manipulation, and visualization, I would like to share my personal view on the list of most often used functions/snippets for data analysis. 1.Import Pandas to Python import pandas as pd 2. Import data from…


DATA SCIENCE QUESTIONS AND ANSWERS


What is Hadoop YARN?

Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored on…

What is Hadoop Flume?

Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work…

What is Apache Kafka?

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for…

What is Hadoop Zookeeper?

Hadoop Zookeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments….

What is Hadoop Hbase?

Hadoop Hbase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use…

What is Hadoop Sqoop?

Hadoop Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop…

What is Hadoop Hive?

Hadoop Hive is a runtime Hadoop support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right…

What is Hadoop Pig?

Hadoop Pig was initially developed at Yahoo to allow people using Hadoop to focus more on analyzing large datasets and spend less time writing mappers and reduce programs. This would…

What is Z-Score or Standard Score?

Z-Score or Standard Score in statistics is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what…

What is Unsupervised Learning?

Unsupervised Learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses. The most common unsupervised learning method is cluster…

What is Type II Error?

Type II Error in statistical hypothesis testing is incorrectly retaining a false null hypothesis (a “false negative”). A type II error (or error of the second kind) is the failure…

What is Type I Error?

Type I Error in statistical hypothesis testing is the incorrect rejection of a true null hypothesis (a false positive). More simply stated, a type I error is detecting an effect…

What is True Positive Rate (Sensitivity)?

True Positive Rate (Sensitivity) is a statistical measure which measures the proportion of positives that are correctly identified as such (for example, the percentage of sick people who are correctly…

What is True Negative Rate (Specificity)?

True Negative Rate (Specificity) is a statistical measure which measures the proportion of negatives that are correctly identified as such (for example, the percentage of healthy people who are correctly…

What is Three Sigma Rule?

Three Sigma Rule in the empirical sciences express a conventional heuristic that “nearly all” values are taken to lie within three standard deviations of the mean, i.e. that it is…

What is Support Vector Machines (SVM)?

Support Vector Machines (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which…

What is Supervised Learning?

Supervised Learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example…

What is Statistical Significance?

Statistical Significance in statistical hypothesis testing is attained whenever the observed p-value of a test statistic is less than the significance level defined for the study. The p-value is the…

What is Statistical Power?

Statistical Power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis. Statistical power is inversely related to beta or the…

What is Sentiment Analysis?

Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis…

What is Semi-Supervised Learning?

Semi-Supervised Learning is a class of supervised learning tasks that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount…

What is Self-Organizing Map (SOM)?

Self-Organizing Map (SOM) is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the…

What is Selection Bias?

Selection Bias is the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not…

What is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple…

What is Root Mean Square Error (RMSE)?

Root Mean Square Error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually…

What is Resampling?

Resampling is any technique of generating a new sample from an existing dataset. There is a variety of methods for estimating the precision of sample statistics (medians, variances, percentiles) by…

What is Regularization?

Regularization in the field of machine learning is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting. A theoretical justification for regularization…

What is Regression?

Regression is a statistical measure used that attempts to determine the strength of the relationship between one dependent variable and a series of other changing (independent) variables. The two basic…

What is Random Sampling?

Random sampling. In this technique, each member of the population has an equal chance of being selected as the subject. The entire process of sampling is done in a single…