Data Scientist Interview Questions – Explain what precision and recall are?

After the predictive model has been finished, the most important question is: How good is it? Does it predict well? Evaluating the model is one of the most important tasks in the data science project,  it indicates how good predictions are. Very often for classification problems we look at metrics called precision and recall, to define them in detail let’s quickly … Read more

How would you validate-test a predictive model?

Why evaluate/test model at all? Evaluating the performance of a model is one of the most important stages in predictive modeling, it indicates how successful model has been for the dataset. It enables to tune parameters and in the end test the tuned model against a fresh cut of data. Below we will look at few most common validation metrics used … Read more

Why would you use Regularization and what it is?

Why would you use Regularization and what it is? In Machine Learning, very often the task is to fit a model to a set of training data and use the fitted model to make predictions or classify new (out of sample) data points. Sometimes model fits the training data very well but does not well in predicting out of sample … Read more

Introduction to TensorFlow

Introduction to TensorFlow. What is TensorFlow? The shortest definition would be, TensorFlow is a general-purpose library for graph-based computation. But there is a variety of other ways to define TensorFlow, for example, Rodolfo Bonnin in his book – Building Machine Learning Projects with TensorFlow brings up definition like this: “TensorFlow is an open source software library for numerical computation using … Read more

Where to learn TensorFlow for Free?

Below a list of free resources to learn TensorFlow: TensorFlow website: www.tensorflow.org Udacity free course: www.udacity.com Google Cloud Platform: cloud.google.com Coursera free course: www.coursera.org Machine Learning with TensorFlow by Nishant Shukla : www.tensorflowbook.com ‘First Contact With TensorFlow’ by Prof. JORDI TORRES: jorditorres.org  or you can order from Amazon: First Contact With Tensorflow Kadenze Academy: www.kadenze.com OpenShift: blog.openshift.com Tutorial by pkmital : github.com Tutorial by HyunsuLee : github.com … Read more

Tensor Flow Cheat Sheet.

TensorFlow Quick Reference Table – Cheat Sheet. TensorFlow is a very popular deep-learning library, with its complexity can be overwhelming, especially for new users. Here is a short summary of often-used functions, if you want to download it in pdf it is available here: TensorFlow CheetSheet – SecretDataScientist.com If you find it useful please share on social media. Import TensorFlow: … Read more

Popular Pandas snippets used in data analysis.

Popular Pandas snippets used in data analysis. Pandas is very popular Python library for data analysis, manipulation, and visualization, I would like to share my personal view on the list of most often used functions/snippets for data analysis. 1.Import Pandas to Python import pandas as pd 2. Import data from CSV/Excel file df=pd.read_csv(‘C:/Folder/mlhype.csv’)   #imports whole csv to pd dataframe … Read more

What is Hadoop YARN?

Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored on a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data … Read more

What is Hadoop Flume?

Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation … Read more

What is Apache Kafka?

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process … Read more