A bit of personal context
A few days ago, I published posts about how to install Hadoop 3.3.6 natively on Ubuntu. At that time, I thought it was the state of the art. But things in the Big Data world move fast.
Fast foward a few days, and when I had the chance to revisit my old setup, I decided to do something different: not just update, but build from scratch a “Frontier Data Stack” — using the newest versions released in early 2026.
Why build it on a Mac Mini with Ubuntu 24.04 instead of in the cloud? Because I believe it’s important to understand that you don’t need a massive cloud budget to learn modern Big Data. With a modest desktop machine (Intel i5, 16GB RAM), we can build something that looks and works like the stacks you see in production.
This post is as much for me as it is for you: a reference I can check when I need to reinstall in six months, and a path you can follow if you want to experiment with these technologies.
What are we going to build?
Imagine your Mac Mini is a small “data company.” You need:
- Distributed file storage (Hadoop HDFS)
- A SQL-like data warehouse (Hive)
- Cloud-like storage locally that’s S3-compatible (MinIO)
- Everything working together, without components fighting over ports
That’s exactly what we’ll accomplish in this guide.
The architecture we’ll build
┌─────────────────────────────────────────────┐ │ Mac Mini (Ubuntu 24.04) │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ Hadoop 3.4.3 (HDFS + YARN) │ │ │ │ - NameNode on port 9010 │ │ │ │ - Local DataNode │ │ │ └─────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ Hive 4.2.0 (Data Warehouse) │ │ │ │ - Metastore for schemas │ │ │ │ - Beeline CLI │ │ │ └─────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ MinIO (S3-compatible Object Store) │ │ │ │ - Buckets for data │ │ │ │ - S3A connector │ │ │ └─────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ Java 21 LTS (Runtime) │ │ │ │ - Engine for everything │ │ │ └─────────────────────────────────────┘ │ └─────────────────────────────────────────────┘
Prerequisites: A Modern Java Runtime
Before we start, we need to prepare the ground. Hadoop 3.4 and especially Hive 4.2.0 were designed with Java 21 in mind. While Java 17 works, Java 21 includes performance optimizations worth having.
Why Java 21?
Java 21 introduced “Foreign Function & Memory” (FFM) features that allow modern libraries to access native system functionality more efficiently. Hive 4.2.0 uses JLine (a command-line library) that takes advantage of these features.
Installation
sudo apt update && sudo apt install -y openjdk-21-jdk
Verify it installed correctly:
java -version # You should see something like: # openjdk version "21.0.x" ...
This may take a few minutes. Time for a coffee break ☕
Step 1: Hadoop 3.4.3 — The Engine of Your Data Lake
Hadoop 3.4.3 was released in January 2026 and is a significant leap forward. It includes important improvements in:
- S3A Connector: More stable with better concurrency handling
- ARM64 compatibility: Native support for ARM architectures
- Java 21 optimizations: Leverages the new runtime
The challenge I faced: Port conflict
When I first installed Hadoop, HDFS tried to use port 9000 by default. But MinIO (already running on the Mac Mini) was occupying that port.
The solution: Move Hadoop’s NameNode to a different port — in my case, port 9010.
This is a good example of something you’ll see in production: when you have multiple services, you need to coordinate which services use which ports.
Hadoop Configuration: core-site.xml
This file is the heart of Hadoop’s configuration. Here we tell it where HDFS lives and what port to contact it on.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9010</value>
</property>
</configuration>
What does this mean?
fs.defaultFS: The default file system. We’re saying “when someone wants to write to HDFS, connect to localhost on port 9010”hdfs://: The protocol — in other words, “I’m using HDFS, not S3, not GCS”:9010: The port where our NameNode listens
Once configured, Hadoop will know where to find its data even if you restart it.
Step 2: Hive 4.2.0 — Your SQL Warehouse in the Data Lake
If Hadoop is distributed storage, Hive is the schema manager and query engine. Hive 4.2.0 is essentially a rewrite of the 3.x series and represents significant maturation of the project.
Why Hive and not Spark or Presto?
- Compatibility: Hive is the de facto standard in Hadoop legacy ecosystems
- Metastore: Hive maintains a metastore — a registry of all your schemas, tables, and columns
- SQL: You can use standard SQL, making it accessible even if you come from traditional database worlds
The problem I encountered: “Unable to create a terminal”
During Hive initialization, when I tried to use schematool or beeline, I got a strange error:
java.lang.IllegalStateException: Unable to create a terminal
What was happening?
The new JLine libraries in Hive 4.2 try to use Java 21’s FFM features to create native terminals. When you run commands in non-interactive contexts (like remote SSH or scripts), this fails because there’s no real terminal to interact with.
The solution: Add flags to HADOOP_OPTS to explicitly enable these features:
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native --enable-preview --enable-native-access=ALL-UNNAMED"
This tells Java: “I know this is a preview feature, but go ahead, allow it.”
Step 3: Connecting Hadoop to MinIO with S3A
This is where the real magic happens. MinIO is an S3-compatible object storage server you can run locally. But for Hadoop and Hive to read data from MinIO, they need the S3A connector.
The important change: AWS SDK v2
If you search for old tutorials, you’ll find references to “aws-java-sdk-1.x.jar”. That no longer works. Hadoop 3.4.3 has completely migrated to AWS SDK v2.
What does this mean?
- v1 and v2 are completely different APIs
- v1 jars are incompatible with Hadoop 3.4.3
- You need the correct version
Installing the AWS SDK v2 bundle
- Download the correct bundle: The version that matches Hadoop 3.4.3’s POM is 2.35.4
wget https://repo1.maven.org/maven2/software/amazon/awssdk/aws-java-sdk-bundle/2.35.4/aws-java-sdk-bundle-2.35.4.jar - Copy it to two locations:
# In the Hadoop folder cp aws-java-sdk-bundle-2.35.4.jar $HADOOP_HOME/share/hadoop/common/lib/ # Also in Hive (so Hive can access S3A) cp aws-java-sdk-bundle-2.35.4.jar $HIVE_HOME/lib/
Why two locations? Because both Hadoop and Hive need access to the connector. It’s like having a key in two different places to ensure whoever needs it, finds it.
Configuring S3A in core-site.xml
After copying the jars, Hadoop needs to know how to connect to MinIO. Add this to your core-site.xml:
<!-- S3A Configuration for MinIO -->
<property>
<name>fs.s3a.endpoint</name>
<value>http://minio-server:9000</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>YOUR_MINIO_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_MINIO_SECRET_KEY</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
Step 4: Your First MapReduce Job with Python
This is where we test that everything works. Instead of writing complex Java code, we’ll use Hadoop Streaming — an elegant way to run MapReduce jobs using any language that can read stdin and write to stdout.
What is Hadoop Streaming?
Hadoop Streaming is a utility that lets you use any executable as a mapper or reducer. It doesn’t care what language you use — Python, Bash, Ruby, Perl, Node.js, whatever. As long as your program reads from stdin and writes to stdout, Hadoop can orchestrate it.
This is a paradigm shift from the traditional approach:
Traditional way (Java MapReduce):
You write Java code → Compile to .jar → Submit to Hadoop → Hadoop runs your compiled Java → Get results
Hadoop Streaming way:
You write a script in any language → Submit to Hadoop → Hadoop pipes data through your script → Get results
Why Hadoop Streaming is better than traditional Java MapReduce
- Language agnostic: Write in Python, Go, R, whatever you’re comfortable with. You’re not locked into Java.
- Development speed: A Python script takes minutes to write and test. Java requires compilation, packaging, JAR creation, debugging classpath issues… hours.
- Lower barrier to entry: Most data engineers and data scientists know Python. Very few want to write Java MapReduce code.
- Easy debugging: You can test your mapper/reducer locally with simple shell pipes before submitting to Hadoop:
# Test mapper locally cat input.txt | python3 mapper.py # Test full pipeline locally cat input.txt | python3 mapper.py | sort | python3 reducer.py - Version control friendly: Plain text scripts. No need to commit binary JARs.
- Flexibility: Sometimes you need to call external tools, parse complex formats, or use ML libraries. Much easier in Python than Java.
- Production-ready: Hadoop Streaming isn’t “toy” code. Companies like Spotify, Airbnb, and Netflix used/use it for serious production workloads.
Common Hadoop Streaming parameters explained
The command we’ll use has many optional parameters. Here are the most important ones:
hadoop jar hadoop-streaming.jar \
-files <files> # Files to distribute to nodes (your scripts)
-mapper <command> # The mapper command/script
-reducer <command> # The reducer command/script
-input <path> # Input file path (HDFS or S3)
-output <path> # Output directory path
-numReduceTasks <n> # Number of reducers (default: 1)
-partitioner <class> # Custom partitioner (advanced)
-combiner <command> # Optional combiner for optimization
-jobconf <key=value> # Additional job configuration
-inputformat <format> # Input format (TextInputFormat by default)
-outputformat <format> # Output format (TextOutputFormat by default)
-lazyOutput # Don't create output files until job finishes
The ones you’ll use 90% of the time:
-files: Distributes your mapper/reducer to all nodes-mapper: The script that processes each line-reducer: The script that aggregates results-input: Where your data lives-output: Where results go-numReduceTasks: How many parallel reducers (more = faster, but more network I/O)
Example with numReduceTasks:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.4.3.jar \
-files mapper.py,reducer.py \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py" \
-input s3a://hive-data/test_folder/warandpeace.txt \
-output s3a://hive-data/output_python \
-numReduceTasks 4 # Use 4 parallel reducers instead of default 1
Why Python specifically?
For this guide:
- It’s accessible for beginners
- Scripts are simple and readable
- Hadoop handles the distribution and coordination
- It’s excellent for learning how MapReduce works without Java complexity
The Mapper: mapper.py
#!/usr/bin/env python3
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print(f'{word}\t1')
What does it do?
- Reads each line from input
- Splits the line into words
- For each word, emits
word\t1(one word, count 1)
Example: If the input is “hello world”, it emits:
hello 1 world 1
The Reducer: reducer.py
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
for line in sys.stdin:
line = line.strip()
if not line:
continue
word, count = line.split('\t', 1)
count = int(count)
if word != current_word:
if current_word:
print(f'{current_word}\t{current_count}')
current_word = word
current_count = 0
current_count += count
if current_word:
print(f'{current_word}\t{current_count}')
What does it do?
- Receives key-value pairs from the Mapper (already grouped by word)
- Sums all the counts for each word
- Emits the word with its total count
Example: If it receives:
hello 1 hello 1 world 1
It emits:
hello 2 world 1
Running the job on MinIO data
Suppose you have a file warandpeace.txt in MinIO, in the hive-data bucket with path test_folder/:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.4.3.jar \
-files mapper.py,reducer.py \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py" \
-input s3a://hive-data/test_folder/warandpeace.txt \
-output s3a://hive-data/output_python
Breaking down the command:
hadoop jar: Run a Hadoop job-files mapper.py,reducer.py: “Send these scripts to each participating node”-mapper "python3 mapper.py": “The mapper is this Python command”-input s3a://...: “Read from MinIO (note the s3a:// prefix)”-output s3a://...: “Write results here”
Hadoop automatically:
- Reads the input file
- Divides it into chunks
- Sends each chunk to a Mapper in parallel
- Groups the outputs by key
- Sends each group to a Reducer
- Writes the final result
Interpreting the results
After the job completes successfully, results will be in s3a://hive-data/output_python/. For a 562,488-word file (like War and Peace), you’ll typically see:
and 20,498 the 31,550 to 16,252 ...
These are the most frequent words in the book, counted distributively across Hadoop.
Step 5: Troubleshooting — The MRAppMaster Problem
If you get to this point, you’ll probably encounter this error:
java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.v2.app.MRAppMaster
This error is more common than you’d think, and it points to a specific problem: YARN containers can’t find MapReduce libraries.
What’s happening?
YARN (Yet Another Resource Negotiator) is Hadoop’s resource manager. When you run a job:
- YARN creates containers (processes) on nodes
- Those containers need access to MapReduce libraries
- If they don’t know where to look, it fails
The solution: Two key configurations
1. In mapred-site.xml:
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
2. In bash (before running jobs):
export HADOOP_MAPRED_HOME=$HADOOP_HOME
This tells Hadoop: “When you create containers, include these paths in the CLASSPATH.”
Validating that everything works
Before considering it “ready for production” (or at least, “ready to experiment”), verify each component:
1. Java is installed
java -version # You should see Java 21
2. Hadoop is running
jps # You should see: NameNode, DataNode, SecondaryNameNode
3. Hive can connect
beeline -u jdbc:hive2://localhost:10000/ # You should be able to connect (though it may spam warnings)
4. MinIO is accessible
# Using an S3 client or curl: aws s3 --endpoint-url http://localhost:9000 ls s3://hive-data/
5. A simple job works
echo -e "hello world\nhello hadoop" | \
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.4.3.jar \
-mapper "cat" \
-reducer "uniq -c"
If you see output without errors, you’re ready!
What are the numbers?
After running our Word Count on War and Peace:
- Total words processed: 562,488
- Map tasks executed: 2 (Hadoop split the file into 2 chunks)
- Reduce tasks executed: automatic (Hadoop optimizes this)
- Total time: Seconds (on a Mac Mini, incredibly fast)
Most frequent words:
- “the” — 31,550 times
- “and” — 20,498 times
- “to” — 16,252 times
Interesting, right? The three articles/conjunctions dominate, as expected in English.
Reflection: Why build this locally?
When I tell people I’m setting up a Big Data stack on a 2012 desktop machine running Ubuntu, they typically give this look 😐
But I think there’s something valuable here you won’t get in the cloud:
- You understand each component — There’s no magic abstraction. You see exactly how Hadoop divides files, how Hive maintains its metastore, how MinIO serves data.
- You can experiment fearlessly — Want to know what happens if you change the port? Try it. Want to understand how YARN behaves under pressure? Run 100 jobs. No need to pay for every mistake.
- You learn what matters in production — The same challenges we overcome here (port conflicts, classpath issues, SDK versions) are exactly what you’ll face in an AWS or Databricks cluster.
- It’s reproducible — This guide is technically for me (for when I need to reinstall in six months), but it’s also a map for you.
Next steps: Now it’s your turn
If you made it this far and you have a Mac Mini, an old server, a laptop with enough RAM… give it a try.
Here’s what you could do:
- Start with Java 21 — Following Step 1 exactly
- Then Hadoop — Once
jpsshows the processes, you know it works - Then Hive and MinIO — The background services
- Finally, a MapReduce job — That first time you see a distributed job work on your machine is magical
You don’t have to do it all at once. Even if you just experiment with Hadoop + Python streaming, you’ll have learned something important: how distributed systems think.
And if something doesn’t work — if you find a port is occupied, or classpath behaves strangely — it’s not a failure. It’s exactly how you learn. I’ve been through all these problems, and documenting them here is as much about reminding myself as it is about saving you frustration.
Will you give it a try? If you do, tell me in the comments what surprised you most. Was it the speed? The conceptual simplicity versus configuration complexity? Or just that feeling of “wow, it actually worked”?
Quick Reference: URLs and Endpoints
A quick reference table to access the main services. Keep this handy so you don’t have to search through logs or remember port numbers.
| Service | URL/Endpoint | Port | Purpose |
|---|---|---|---|
| Hadoop NameNode Web UI | http://localhost:9870 | 9870 | Monitor HDFS, file system status, job tracking |
| Hadoop HDFS | hdfs://localhost:9010 | 9010 | HDFS NameNode (configured port) |
| Hadoop Secondary NameNode | http://localhost:9868 | 9868 | Backup NameNode monitoring |
| YARN ResourceManager | http://localhost:8088 | 8088 | Monitor running jobs, cluster resources |
| YARN NodeManager | http://localhost:8042 | 8042 | Individual node resource status |
| Hive Server 2 | jdbc:hive2://localhost:10000 | 10000 | JDBC connection for SQL queries |
| Beeline (Hive CLI) | localhost:10000 | 10000 | Interactive Hive query shell |
| MinIO Console | http://localhost:9001 | 9001 | MinIO web interface (if running) |
| MinIO S3 API | http://localhost:9000 | 9000 | S3-compatible API endpoint |
| Java Version Check | java -version | N/A | Verify Java 21 installation |
| Hadoop Version Check | hadoop version | N/A | Verify Hadoop 3.4.3 installation |
| Hive Version Check | beeline --version | N/A | Verify Hive 4.2.0 installation |
Quick connectivity tests
# Test Hadoop HDFS hdfs dfs -ls / # Test Hive connectivity beeline -u jdbc:hive2://localhost:10000/ -e "SELECT 1;" # Test MinIO S3 aws s3 --endpoint-url http://localhost:9000 ls s3://hive-data/ # Check all Java processes running jps -l # Check active ports netstat -tlnp | grep -E ':(9010|9870|10000|9000|9001)'
References and resources
- Apache Hadoop Official Docs: https://hadoop.apache.org/docs/
- Apache Hive Documentation: https://hive.apache.org/
- AWS SDK for Java v2: https://github.com/aws/aws-sdk-java-v2
- MinIO Documentation: https://min.io/docs/
- Java 21 Release Notes: https://www.oracle.com/java/technologies/javase/21-0-1-relnotes.html