The Rabbit And The Turtle — By Charles Zhu, my 6 years old son

People said Python is slow, how slow it can be

Whenever there is a programming speed competition, Python usually goes to the bottom. Some said that is because Python is an interpretation language. All interpretation language is slow. But we know that Java is also a kind of language, its bytecode is interpreted by JVM. As showing in this benchmark, Java is much faster than Python.

Here is a sample that can demo Python’s slowness. Use the traditional for-loop to produce reciprocal numbers:

The result:


Three common mistakes could be made in your Spark PySpark projects. Those mistakes may confuse your colleagues and also make you pull your hair out.

Image by my boy, Charles Zhu

After one crazy week of working on a Databricks project, I made a lot of mistakes and hence learned a lot. Here are some tips to share on how to make those mistakes I made.

Use Concatenated Spark SQL string in functions

With PySpark, we can either query a Spark Dataframe with Spark SQL or DataFrame DSL(domain-specific language).

The Spark SQL way:

With Dataframe DSL, you can query the data without creating any views, almost…


Understand how Decision Tree Classifier works in plain language and minimum math equations. Figure out how Gini Impurity and Information Gain works from scratch.

By my son, Charles Zhu

Compare with machine learning models like Neural Network, I thought Decision Tree Classifier should be the most simple one. But I was wrong, this model is a bit complex than I thought. And the model also lands the foundation for other advanced models like LightBGM and Random Forest Decision Tree. So, I spent some time learning it and try to figure out how Decision Tree Classifier works.

How decision tree works

The model works very much like how a human mind classifying objects in the real world.


Photo by Tech Daily on Unsplash

In the previous short article Track Dogecoin Real-Time Price with Python I leveraged Python’s requests and BeautifulSoup package to scrape the web HTML to grab real-time Dogecoin (or any other cryptos trade in Robinhood.com).

My holding number of Dogecoin is like the coin itself, is a joke. The main purpose is not for trading but to get hand dirty and see how I can use Python to scrape the web with minimum lines of code, and it looks working pretty well.

The next question follows: How can I get the historical price information in daily or even hourly granularity for…


Dogecoin was plummeting this morning and surging tonight, I was thinking, what if there is an alert that can send out mail saying, “Hey, the Dogecoin price is dropped 20%, it is time to buy in”.

Em, why not create one with Python by myself. Here are my overall steps.

  1. Find a crypto price API or scrape the web.
  2. Send out mail if the pice is meet the rule, say, drop 20%.

Get the real-time price

After searching and googling, I found Robinhood is the best place to grab real-time price info. No need to sign in, no call limitation, and for free.

Using…


Set up a local Spark cluster with one master node and one worker node in Ubuntu from scratch completely, and for free.

Tortle with 4 legs by Charles Zhu, my 6 yo son

This is an action list to install the open-sourced Spark master(or driver) and worker in local Ubuntu completely for free. (in contrast to Databricks for $$$)

The following setup runs in a home intranet. On one Linux(Ubuntu) physical machine(Jetson Nano) and one WSL2(Ubuntu) inside of Windows 10.

Step 1. Prepare environment

Make sure you have Java installed

Check if you get Java installed

If you are going to use PySpark, go get Python installed

Check if you get Python installed

Step 2. Download and install Spark in the Driver machine

From the Spark download page, select your version, I select the newest…


Get started working with Spark and Databricks with pure plain Python

Image from https://unsplash.com/s/photos/spark

In the beginning, the Master Programmer created the relational database and file system. But the file system in a single machine became limited and slow. The data darkness was on the surface of database. The spirit of map-reducing was brooding upon the surface of the big data.

And Master Programmer said, let there be Spark, and there was Spark.

There is already Hadoop, why bother Spark

If the relational database is a well-maintained data garden; Hadoop is a clutter data forest, and it can grow to an unlimited size.

To put data into the garden, data need to be carefully cleaned and grow there structurally. While in…


A solution to extract keywords from documents automatically. Implemented in Python with NLTK and Scikit-learn.

Image by Andrew Zhu, an old Reuters news

Imagine you have millions(maybe billions) of text documents in hand. No matter it is customer support tickets, social media data, or community forum posts. There were no tags when the data was generated. You are scratching your head hard to giving tags to those random documents.

Manually tagging is unpractical; Giving an existing tagging list will be outdated soon. Hiring a vendor company to do the tagging work is too much expensive.

You may say, why not using Machine Learning? like, Neral Network deep learning. But, NN needs some training data first. The training data that right fit your dataset.


Get start with NLTK and Python text analysis with a use case.

Photo By Andrew Zhu, Library of UW, Seattle

When I was still a student, I read articles that said that linguists can use text analytic techniques to determine the author of an anonymous book. I thought it is cool at that time.

When looking back, I feel this technique is still cool. But, nowadays with the help of NLTK and Python, you and I can be a “real” linguist with several lines of code.

Prepare analysis target

You don’t need to write a crawler to scape the analysis corpus. For learning and research purposes, a huge existing text database already there well maintained in the NLTK package. …


Code highlight and LaTex math formula in markdown file editing in VSC and post processed by Nodejs.

Image by Andrew Zhu, afternoon sunset in Issaquah, WA

Enable code highlight and Math formulas in Visual Studio Code

If you are using Visual Studio Code, it is easy to enable both code highlight and Math by installing one extension: Markdown all in one.

Initialize a new markdown document end with .md. To enable code highlight, surround code with ``` (straight single quote, usually under esc key). like this:

Andrew Zhu

Daddy of two kids, husband, programmer, blogger, and Applied Data Scientist @ Azure CGA Microsoft, Redmond.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store