BLOG - SecretDataScientist.com

Fine Tuning LLM

Fine-tuning large language models (LLMs) has become an indispensable tool in the LLM requirements of enterprises to enhance their operational processes. While the foundational training of LLMs offers a broad understanding of language, the fine-tuning process molds these models into…

Embeddings

Embeddings are a fundamental concept in machine learning and natural language processing (NLP). They are used to convert non-numeric data, such as text or categorical variables, into numerical vectors that machine learning algorithms can process. These vectors, known as embeddings,…

LangChain Cheatsheet

LangChain simplifies building AI applications using large language models (LLMs) by providing an intuitive interface for connecting to state-of-the-art models like GPT-4 and optimizing them for custom applications. It supports chains combining multiple models and modular prompt engineering for more…

Ollama Cheatsheet

Here is a comprehensive Ollama cheat sheet containing most often used commands and explanations: Installation and Setup Running Ollama Model Library and Management Advanced Usage Integration with Visual Studio Code AI Developer Scripts Additional Resources Other Tools and Integrations Community…

Autonomous AI Agents

Autonomous AI agents are intelligent computer programs that operate independently, making decisions and taking actions without human intervention. These agents are powered by advanced machine learning algorithms and large language models (LLMs), enabling them to process vast amounts of data…

What is AGI – Artificial General Intelligence?

Artificial General Intelligence (AGI): A Comprehensive Overview for Professionals Artificial General Intelligence (AGI) is a concept that has garnered significant attention in recent years, particularly with the emergence of advanced AI tools like ChatGPT. As a researcher in the field,…

Microsoft DP-100 – Designing and Implementing a Data Science Solution on Azure – free questions.

Microsoft Certified Azure Data Scientist Associate, the DP-100 exam measures your ability to accomplish technical tasks like: Example Questions You need to resolve the local machine learning pipeline performance issue. What should you do? A. Increase Graphic Processing Units (GPUs).B….

Trading with Python Intro – Data Import

Traditionally, there have been two general ways of analyzing market data: In recent years, computer science and mathematics revolutionized trading, it has become dominated by computers helping to analyze vast amounts of available data. Algorithms are responsible for making trading…

Data Scientist Interview Questions – Explain what precision and recall are?

After the predictive model has been finished, the most important question is: How good is it? Does it predict well? Evaluating the model is one of the most important tasks in the data science project, it indicates how good predictions are….

How would you validate-test a predictive model?

Why evaluate/test model at all? Evaluating the performance of a model is one of the most important stages in predictive modeling, it indicates how successful model has been for the dataset. It enables to tune parameters and in the end test…

Why would you use Regularization and what it is?

Why would you use Regularization and what it is? In Machine Learning, very often the task is to fit a model to a set of training data and use the fitted model to make predictions or classify new (out of…

Introduction to TensorFlow

Introduction to TensorFlow. What is TensorFlow? The shortest definition would be, TensorFlow is a general-purpose library for graph-based computation. But there is a variety of other ways to define TensorFlow, for example, Rodolfo Bonnin in his book – Building Machine…

Where to learn TensorFlow for Free?

Below a list of free resources to learn TensorFlow: TensorFlow website: www.tensorflow.org Udacity free course: www.udacity.com Google Cloud Platform: cloud.google.com Coursera free course: www.coursera.org Machine Learning with TensorFlow by Nishant Shukla : www.tensorflowbook.com ‘First Contact With TensorFlow’ by Prof. JORDI TORRES: jorditorres.org or you…

Tensor Flow Cheat Sheet.

TensorFlow Quick Reference Table – Cheat Sheet. TensorFlow is a very popular deep-learning library, with its complexity can be overwhelming, especially for new users. Here is a short summary of often-used functions, if you want to download it in pdf…

Popular Pandas snippets used in data analysis.

Popular Pandas snippets used in data analysis. Pandas is very popular Python library for data analysis, manipulation, and visualization, I would like to share my personal view on the list of most often used functions/snippets for data analysis. 1.Import Pandas…

What is Hadoop YARN?

Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored on a single platform, unlocking an entirely new approach to analytics….

What is Hadoop Flume?

Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work with are called sources, decorators, and sinks. A source can…

What is Apache Kafka?

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a…

What is Hadoop Zookeeper?

Hadoop Zookeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space,…

What is Hadoop Hbase?

Hadoop Hbase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. An HBase system comprises a set of tables. Each…

What is Hadoop Sqoop?

Hadoop Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can…

What is Hadoop Hive?

Hadoop Hive is a runtime Hadoop support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate. Hive allows SQL developers to write…

What is Hadoop Pig?

Hadoop Pig was initially developed at Yahoo to allow people using Hadoop to focus more on analyzing large datasets and spend less time writing mappers and reduce programs. This would allow people to do what they want to do instead…

What is Z-Score or Standard Score?

Z-Score or Standard Score in statistics is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. Observed values above the mean…

Numerai – deep learning example code.

In a previous post on Numerai, I have described very basic code to get into a world of machine learning competitions. This one will be a continuation, so if you haven’t read it I recommend to do it- here. In…

Intro to machine learning competitions with ‘Numerai’ – example code.

In this post, I want to share, how simple it is to start competing in machine learning tournaments – Numerai. I will go step by step, line by line explaining what is doing what and why it is required. Numerai…

Intro to Machine Learning

Machine Learning Definition The Machine Learning subfield of science provides computers with the ability to learn without being explicitly programmed. The goal of Machine Learning is to develop learning algorithms that do the learning automatically without human intervention or assistance,…