Set up a local Spark cluster with one master node and one worker node in Ubuntu from scratch completely, and for free.

Tortle with 4 legs by Charles Zhu, my 6 yo son

This is an action list to install the open-sourced Spark master(or driver) and worker in local Ubuntu completely for free. (in contrast to Databricks for $$$)

The following setup runs in a home intranet. On one Linux(Ubuntu) physical machine(Jetson Nano) and one WSL2(Ubuntu) inside of Windows 10.

Make sure you have Java installed

sudo apt install openjdk

Check if you get Java installed

java --version

If you are going to use PySpark, go get Python installed

sudo apt install python3

Check if you get Python installed

python3 --version

From the Spark download page, select your version, I select the newest…

Get started working with Spark and Databricks with pure plain Python

Image from

In the beginning, the Master Programmer created the relational database and file system. But the file system in a single machine became limited and slow. The data darkness was on the surface of database. The spirit of map-reducing was brooding upon the surface of the big data.

And Master Programmer said, let there be Spark, and there was Spark.

If the relational database is a well-maintained data garden; Hadoop is a clutter data forest, and it can grow to an unlimited size.

To put data into the garden, data need to be carefully cleaned and grow there structurally. While in…

A solution to extract keywords from documents automatically. Implemented in Python with NLTK and Scikit-learn.

Image by Andrew Zhu, an old Reuters news

Imagine you have millions(maybe billions) of text documents in hand. No matter it is customer support tickets, social media data, or community forum posts. There were no tags when the data was generated. You are scratching your head hard to giving tags to those random documents.

Manually tagging is unpractical; Giving an existing tagging list will be outdated soon. Hiring a vendor company to do the tagging work is too much expensive.

You may say, why not using Machine Learning? like, Neral Network deep learning. But, NN needs some training data first. The training data that right fit your dataset.

Get start with NLTK and Python text analysis with a use case.

Photo By Andrew Zhu, Library of UW, Seattle

When I was still a student, I read articles that said that linguists can use text analytic techniques to determine the author of an anonymous book. I thought it is cool at that time.

When looking back, I feel this technique is still cool. But, nowadays with the help of NLTK and Python, you and I can be a “real” linguist with several lines of code.

You don’t need to write a crawler to scape the analysis corpus. For learning and research purposes, a huge existing text database already there well maintained in the NLTK package. …

Code highlight and LaTex math formula in markdown file editing in VSC and post processed by Nodejs.

Image by Andrew Zhu, afternoon sunset in Issaquah, WA

If you are using Visual Studio Code, it is easy to enable both code highlight and Math by installing one extension: Markdown all in one.

Initialize a new markdown document end with .md. To enable code highlight, surround code with ``` (straight single quote, usually under esc key). like this:

def func():
print("hello python")

To select data from pandas’ dataframe, we can use df_data['column'], and can also use df_data.column, then, df_data.loc['column'], yeah, can also use df_data.iloc['index']. Next, pd.eval(), and don't forget df_data.query().If the above is not enough, there is a package called numexpr, and many more.

The Zen of Python said:

There should be one — and preferably only one — obvious way to do it.

Hey Pandas Dataframe, is there one best and obvious way to select data? let’s go through 10 ways one by one and see if we can find the answer.

Say, we have a sample pd data:

import pandas…

How we can use TF-IDF to give weights # to text data, and figure out why the result from scikit-learn is different compare with formula from Textbooks

Image by Andrew Zhu, my son, Charles’s lego board

When dealing with text data, we want to measure the importance of a word to a document of a full text collection. One of the most intuitive solution would be counting the word appearance number, the higher the better. But simply counting the words # will lead to the result that favor to long document/article. After all, longer document contains more words.

We need another solution that can appropriately measure the importance of a word in the overall context. TF-IDF is one of the effective solutions. And also functioning as the backbone of modern search engines like Google.

The core…

I like the idea that we need to rethink about technologies and modernization , from the perspective of human being. Like New York Time square, they fixed the traffic jam by simply blocking some unnecessary roads and joints. rebuild and turn that areas to walking street.

But in the other side, the trending of technologize maybe unstoppable.

Thousands years ago. Socrates insists that writing destroys memory and weakens the mind. and even doubt the merit of introducing ‘letters’. But nowadays, we all can't live without writing and reading, and books.

Like IPads, Mac and computers, My kid is also super…

The Rabbit And The Turtle — By Charles Zhu, my 6 years old son

Whenever there is a programming speed competition, Python usually goes to the bottom. Some said that is because Python is an interpretation language. All interpretation language is slow. But we know that Java is also a kind of language, its bytecode is interpreted by JVM. As showing in this benchmark, Java is much faster than Python.

Here is a sample that can demo Python’s slowness. Use the traditional for-loop to produce reciprocal numbers:

import numpy as np
values = np.random.randint(1, 100, size=1000000)
def get_reciprocal(values):
output = np.empty(len(values))
for i in range(len(values)):
output[i] = 1.0/values[i]
%timeit get_reciprocal(values)

The result:



Warning: this piece is sheer for fun, no intention to offend anyone, include Data Scientist.

If you are entitled with Data “Scientist”, will you ever self-doubting that am I really a “Scientist” or am I really working on the “Science” or just data analyst?

Recently, I came across a tweet, which says:

Offend a Data “Scientist” with one tweet Ben Lindsay

Then there comes many amusing replies like this one:

machine learning is just regression with extra steps — Mike Henry

This one from my peer , an underrated tweet:

What is the scientific method, but “guess and check?”…

Andrew Zhu

Daddy of two kids, husband, and Applied Data Scientist @ Azure CGA Microsoft, Redmond.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store