Posts tagged

#Data Engineering

12 posts

From Native Installation to a More Stable Hadoop + Hive Stack with Coolify

Part 7 in the Hadoop and Hive Tutorial Series

Apr 11, 2026 Hadoop

Querying Apache Hive from DBeaver: Starting HiveServer2 and Connecting a Desktop SQL Client

Part 6 in the Hadoop and Hive Tutorial Series

Mar 24, 2026 Hadoop

From HDFS to SQL Queries: Loading CSV Files into Hive External Tables and Querying with SQL

Part 5 in the Hadoop and Hive Tutorial Series

Mar 18, 2026 Hadoop

Building a Modern Frontier Data Stack: Hadoop 3.4.3, Hive 4.2.0, and MinIO S3 Integration in 2026

A few days ago, I published posts about how to install Hadoop 3.3.6 natively on Ubuntu. At that time, I thought it was the state of the art. But things in the Big Data world move fast.

Mar 12, 2026 Hadoop

Apache Hive 3.1.3 on Ubuntu: Native Installation on Top of Hadoop 3.3.6

In Part 1 of this series, I installed Hadoop 3.3.6 natively on Ubuntu 24.04 and configured HDFS in pseudo-distributed mode. In Part 2, I configured YARN and ran the canonical WordCount job on War and Peace. In Part 3, I improved the text processing pipeline by...

Mar 10, 2026 Hadoop

Correcting Word Frequencies with Data Normalization: MapReduce Text Processing on War and Peace — Part 3

In Part 1 of this series, we installed Hadoop 3.3.6 natively on Ubuntu and configured HDFS for distributed storage. In Part 2, we configured YARN, wrote our first MapReduce program (WordCount), and executed it against the full text of War and Peace.

Mar 8, 2026 Hadoop

Running Your First MapReduce Job on Hadoop: WordCount on War and Peace

In Part 1 of this series we installed Hadoop 3.3.6 natively on Ubuntu 24.04 and got HDFS running in pseudo-distributed mode. That gave us a working distributed file system, but Hadoop is much more than storage — its true power lies in processing large datasets in...

Mar 1, 2026 Hadoop

Hadoop 3.3.6 on Ubuntu: Native Installation without Virtual Machine

When I started working with Hadoop in a learning environment, the course guide indicated using Linux Mint in a virtual machine. However, I already had Ubuntu 24.04 installed natively on my Dell Vostro with 32 GB of RAM, and it seemed smarter to leverage it directly. In...

Feb 23, 2026 Hadoop

Running Local Notebooks on Databricks Using VS Code

Execute Jupyter notebooks locally in VS Code while leveraging Databricks compute infrastructure. Your code runs on Databricks’ serverless compute, but you edit and manage files locally—giving you the best of both worlds: local development experience with cloud compute power.

Nov 11, 2025 Data Science

The Powerful COPY Command in DuckDB / MotherDuck: A Quick Reference Guide

The COPY command in DuckDB and MotherDuck is a versatile tool for importing and exporting data. This guide provides a concise overview of how to use COPY both from the DuckDB CLI (SQL only) and from Python, including workflows with Ibis and pandas. Use this as a quick...

Jul 8, 2025 Data Science, Python, SQL

Building a Complete DuckLake Solution: From Local Development to Cloud Production

DuckLake is revolutionizing the lakehouse architecture by combining the simplicity of DuckDB with the power of modern data lake formats. In this comprehensive guide, I’ll walk you through building a complete DuckLake solution in two parts: first creating a local...

Jul 7, 2025 Data Science, SQL

Azure Databricks Quick Reference Guide

Databricks is an analytics and data engineering platform that sits on top of Spark, an analytics engine for big data processing and machine learning. Spark uses in-memory processing using a distributed computer platform of clusters that work as if they were a single one.

Feb 24, 2024 Data Science