Big Data Processing

Big data collection, storage and processing technologies

data-sciencedata-analysisdata-engineeringbig-datapandasstatisticsdataanalyticssparkhadoop
microsoft
ML-For-Beginners
microsoft
72.0k

Microsoft Azure cloud advocates are pleased to offer a 12-week, 26-lesson machine learning course. In this course, you will learn what is sometimes called classical machine learning, using Scikit-learn as the library, avoiding deep learning, which will be covered in our upcoming "Beginner AI" course. Pair these courses with our "Beginner Data Science" course!

grafana
grafana
grafana
67.7k

Grafana - A tool for monitoring, metric analysis and dashboards for Graphite, InfluxDB and Prometheus, etc.

apache
superset
apache
66.0k

Data visualization and data exploration platform, providing various visualization templates and interactive dashboards for clearer data presentation; built-in SQL IDE, allowing users to better operate data; API is open and flexible, with high customizability.

scikit-learn
scikit-learn
scikit-learn
61.9k

scikit-learn is a Python module for machine learning built on top of SciPy.

binhnguyennus
awesome-scalability
binhnguyennus
61.7k

A project dedicated to large-scale system design, which gathers the patterns and best practices of scalable, reliable and high-performance systems. It provides developers with rich resources and references to help them design and implement efficient large-scale systems.

Asabeneh
30-Days-Of-Python
Asabeneh
46.0k

A Python tutorial suitable for beginners to learn. The tutorial aims to teach you the basic programming knowledge and advanced development skills of Python, such as web crawling, data analysis, statistical analysis, virtual environment building, API construction, etc., through 30 days of coding learning.

metabase
metabase
metabase
41.8k

A quick data analysis and visualization tool that provides users with a friendly user experience and integration capabilities. It helps companies easily explore and understand their own data without the need for complex data queries and analytical skills. For enterprises and data analysts who need to quickly obtain data insights, Metabase is a powerful and easy-to-use BI tool.

run-llama
llama_index
run-llama
41.3k

A data framework for LLM (large language model) applications. It provides a solution for data storage and management for LLM applications, helping users build and manage LLM applications more efficiently.

apache
spark
apache
41.0k

Spark - Apache Spark is a fast and general-purpose cluster computing system for big data. It provides high-level APIs in Scala, Java, Python and R, as well as an optimized engine for generic computation graphs that support data analysis.

coollabsio
coolify
coollabsio
40.6k

A self-hosted solution for a project that is open source on GitHub, which can be used as an alternative to Heroku and Netlify. It supports reverse proxy, free SSL certificate configuration, multiple common database configurations, one-click installation and upgrade of projects, and other functions. Coolify aims to provide a flexible self-hosted solution that allows developers to easily deploy and manage their applications.

ClickHouse
ClickHouse
ClickHouse
40.4k

A free big data analysis database management system (DBMS) designed for handling massive amounts of data. It provides powerful analytical functions that can be used for real-time queries and analysis of large-scale data sets, helping users quickly extract valuable information from massive data.

apache
airflow
apache
39.9k

A scheduled task management platform, which manages and schedules various offline scheduled tasks with a built-in web management interface. When the number of scheduled tasks reaches hundreds, it becomes impossible to effectively and conveniently manage these tasks using crontab. This project was born to solve this problem.

streamlit
streamlit
streamlit
39.1k

Streamlit is an open-source Python library that makes it easy to create and share beautiful custom web applications for machine learning and data science. Streamlit converts data scripts into sharable web applications in minutes. It's all written in pure Python. No front-end experience is required, so you can build and share data applications faster than ever before.

gradio-app
gradio
gradio-app
37.7k

The open-source project named Gradio on GitHub can generate a simple and elegant UI interface for machine learning models in just a few minutes, allowing you to demonstrate your projects in the browser. Through this interface, you can complete operations such as dragging and uploading images, pasting text, recording sounds, etc., and view the model output content.

mendableai
firecrawl
mendableai
37.2k

DataTalksClub
data-engineering-zoomcamp
DataTalksClub
30.3k

Data Engineering Zoomcamp (DataTalksClub/data-engineering-zoomcamp) offers a free data engineering course designed to help learners master the basic concepts and skills of data engineering. Whether it's data stream processing, data warehouse construction, or ETL process design, this course provides valuable learning resources for those aspiring to enter the field of data engineering.

AMAI-GmbH
AI-Expert-Roadmap
AMAI-GmbH
29.8k

An AI technology roadmap, initiated by the German software company AMAI GmbH, contains relevant knowledge points in the field of AI technology, each of which is accompanied by detailed documents

microsoft
Data-Science-For-Beginners
microsoft
29.4k

Microsoft's Azure cloud advocates are happy to offer a 10-week, 20-lesson course on data science. Each lesson includes pre- and post-lesson quizzes, written instructions for completing the course, solutions, and assignments. Our project-based teaching method allows you to learn while building, which is a proven way to "stick" with a new skill.

eugeneyan
applied-ml
eugeneyan
27.9k

A selection of papers, technical articles and well-known blogs related to data science and machine learning, covering 24 technical directions such as data engineering, natural language processing, computer vision, reinforcement learning, etc. Most of the articles come from world-renowned universities and enterprises.

DataExpert-io
data-engineer-handbook
DataExpert-io
27.6k

A learning guide for data engineers covering books, courses, interview materials, excellent blogs, communities and bloggers worth following.

getredash
redash
getredash
27.3k

An open source BI tool that provides web-based database query and data visualization functionality

PostHog
posthog
PostHog
26.1k

PostHog provides open-source product analytics, session recording, feature flagging, and A/B testing that you can self-host.

d2l-ai
d2l-en
d2l-ai
25.7k

An interactive deep learning book that provides code, math, and discussions across multiple frameworks. This project has been adopted at over 500 universities in 70 countries around the world, including Stanford University, Massachusetts Institute of Technology, Harvard University, Cambridge University, etc. It provides rich resources and an interactive learning experience for learning deep learning.

apache
flink
apache
24.8k

Flink - Apache Flink is an open source stream processing framework with powerful streaming and batch processing capabilities

fastai
fastbook
fastai
23.0k

The non-profit technology organization fast.ai recently opened its new version of the deep learning course.

dataease
dataease
dataease
20.0k

An open-source data visualization analysis tool that helps users quickly analyze data and gain insights into business trends, thereby achieving business improvement and optimization.

ml-tooling
best-of-ml-python
ml-tooling
20.0k

It includes some practical machine learning and Python open source projects and tools. There are more than 900 projects in total, including data visualization, natural language processing, text and image data, web crawling, etc.

sinaptik-ai
pandas-ai
sinaptik-ai
19.8k

PrefectHQ
prefect
PrefectHQ
19.1k

Python's data stream orchestration platform. If the programs for acquiring, cleaning, and processing data are considered as individual tasks, this project can integrate these tasks into a workflow, enabling their deployment, scheduling, and monitoring on a web platform.

Avaiga
taipy
Avaiga
18.0k

Quickly build data-driven web applications. This is a project based on Python and Flask, combined with front-end technologies such as React, providing developers with a simple and efficient development framework. It can simplify the development process of data processing, API development, and user interface construction. Whether you are a data scientist, machine learning engineer, or web developer, you can use Taipy to quickly complete the entire process from prototype to web application. Sharing from @Liu Sanfei

airbytehq
airbyte
airbytehq
18.0k

An open-source data integration platform that can complete data integration in just a few minutes through APIs, applications, command-line tools, and other methods for subsequent use and management.

Tencent
APIJSON
Tencent
17.8k

A framework for quickly developing API services, providing fully automated APIs for simple add, delete, modify and query operations as well as complex queries and simple transaction operations. With APIJSON, users no longer need to write interfaces and documents, greatly improving development efficiency.

dair-ai
ML-YouTube-Courses
dair-ai
16.5k

The ML YouTube Courses project is dedicated to providing users with the latest machine learning and artificial intelligence courses, all of which can be found on YouTube. By aggregating various educational resources, this project offers learners and practitioners a convenient platform to easily browse, filter, and select course content that suits their learning needs. Whether you are a beginner or a professional, ML YouTube Courses is an ideal choice for discovering quality machine learning educational resources.

heibaiying
BigData-Notes
heibaiying
16.4k

A Big Data Primer

bharathgs
Awesome-pytorch-list
bharathgs
15.8k

A list of open-source libraries related to PyTorch on GitHub, containing learning tutorials, examples, etc.

argoproj
argo-workflows
argoproj
15.6k

GaiZhenbiao
ChuanhuChatGPT
GaiZhenbiao
15.4k

ChuanhuChatGPT is an open-source chatbot project based on Transformers, providing powerful dialogue generation capabilities and various pre-trained models. This project uses advanced Transformer technology to enable interesting conversations with the robot. Developers can quickly build interactive and natural-flowing chatbots using ChuanhuChatGPT to meet various application needs.

FavioVazquez
ds-cheatsheets
FavioVazquez
15.2k

Data Science Cheat Sheet

apache
hadoop
apache
15.1k

Hadoop - Apache Hadoop uses a simple programming model to distribute large data sets across computer clusters for processing.

openobserve
openobserve
openobserve
15.0k

OpenObserve is a cloud-native visualization monitoring platform specifically designed for logs, metrics, tracing, and analytics, engineered for PB-scale. It offers 10 times simplicity, 140 times lower storage costs, high performance, and an Elasticsearch/Splunk/Datadog alternative for PB-scale (logs, metrics, tracing).

Kanaries
pygwalker
Kanaries
14.7k

A recently popular Python library on GitHub that can be used to simplify the data analysis and data visualization workflow in Jupyter Notebook.

aalansehaiyang
technology-talk
aalansehaiyang
14.4k

A summary of Java ecosystem common technology frameworks, open source middleware, system architecture, project management, classic architecture cases, databases, commonly used third-party libraries, online operation and maintenance, etc.

andkret
Cookbook
andkret
14.3k

Provides practical guidance and best practices for data engineers on data processing, analysis, and management. This project collects the knowledge shared by experienced experts to help data engineers better address challenges in the data field.

microsoft
nni
microsoft
14.2k

A lightweight but powerful toolkit to help users automate feature engineering, neural network architecture search, hyperparameter tuning and model compression

virgili0
Virgilio
virgili0
14.1k

A machine learning guide that can serve as your machine learning mentor, providing you with a complete learning path to learn more about the use of tools and master more skills

apache
doris
apache
13.6k

A high-performance, real-time analytical database based on MPP architecture, which performs excellently in scenarios with massive data and high concurrency. Currently, it is widely used in many well-known companies to build applications such as user analysis, log retrieval analysis, and user profiling.

bbfamily
abu
bbfamily
13.5k

A free and open-source quantitative trading & investment architecture system based on Python, supporting stocks, futures, foreign exchange, digital currencies (BTC\ETH\LTC\ETC\BCC), etc.

marimo-team
marimo
marimo-team
12.9k

Innovative responsive Python notebook. This project is a responsive notebook designed specifically for Python, which automatically executes and updates the dependent code cells when interacting with the UI, ensuring consistency between the code and output. It is stored in pure Python files, making it easy to manage and run, and supports execution as a script or deployment as an interactive web application.

datastacktv
data-engineer-roadmap
datastacktv
12.6k

The latest learning route guide for data engineers in 2020, which contains: CS foundation, database foundation, relational database, cluster computing foundation, data processing, monitoring data pipeline, data security and privacy, etc.

ludwig-ai
ludwig
ludwig-ai
11.4k

A low-code framework designed for building custom deep learning models, neural networks, and other AI models. The project aims to lower the development barrier for AI applications, enabling developers to create and deploy custom AI models more easily without requiring expertise in deep learning.

vesoft-inc
nebula
vesoft-inc
11.3k

Nebula - Nebula Graph is an open-source graph database that excels at handling ultra-large-scale datasets with billions of vertices and trillions of edges

OpenRefine
OpenRefine
OpenRefine
11.3k

A desktop tool for data cleaning, which analyzes and organizes data through visualization. It supports multiple platforms, including Windows, Linux, and Mac operating systems. The tool has functions such as querying, filtering, deduplication, and analysis, allowing users to organize messy data into "clean" spreadsheets in a simple and intuitive way. Without the need for programming and SQL knowledge, OpenRefine provides users with a powerful and user-friendly data cleaning experience.

trinodb
trino
trinodb
11.2k

pwxcoo
chinese-xinhua
pwxcoo
11.2k

The Chinese Xinhua Dictionary database, including common idioms, proverbs, words and characters

wangzhiwubigdata
God-Of-BigData
wangzhiwubigdata
10.1k

A big data interview question solution, mainly divided into three major chapters: Big Data Development Foundation, Framework Learning, and Practical Advanced, which includes high-frequency interview questions on technologies such as high concurrency, distributed, Hadoop, Spark, Flink, and Kafka.

wandb
wandb
wandb
9.8k

A lightweight machine learning visualization tool. It is used for visualizing and tracking machine learning experiments, allowing experiments to be tracked, compared, and visualized with just a few lines of code. For machine learning engineers and data scientists, this tool provides a convenient and efficient way to manage experiments and results.

microsoft
computervision-recipes
microsoft
9.7k

A computer vision guide, "Computer Vision Recipes," provides code examples and best practices for building computer vision systems.

alexeygrigorev
data-science-interviews
alexeygrigorev
9.3k

A data science-related interview question, mainly divided into two parts: knowledge theory (such as linear regression, neural network, decision tree, text classification, etc.) and technical application (such as SQL, Python, algorithm, etc.) content

finos
perspective
finos
9.1k

We recommend Perspective, an interactive and visual data analysis tool on GitHub. It can be used to create data reports, data panels, research notes, and applications. To facilitate the use of developers and data scientists, the development team also provides more than ten cases for reference and learning, including categories such as movies, supermarkets, subways, and streaming media.

oceanbase
oceanbase
oceanbase
9.1k

OceanBase is a distributed relational database developed by Ant Group. It is based on the Paxos protocol and a distributed architecture, which realizes high availability and linear scalability. The OceanBase database can run on common server clusters without relying on special hardware architectures. This project aims to provide a reliable relational database solution for enterprise-level applications.

© 2025 GitHub Fun. All rights reserved.