Whenever there is a programming speed competition, Python usually goes to the bottom. Some said that is because Python is an interpretation language. All interpretation language is slow. But we know that Java is also a kind of language, its bytecode is interpreted by JVM. As showing in this benchmark, Java is much faster than Python.
Here is a sample that can demo Python’s slowness. Use the traditional for-loop to produce reciprocal numbers:
import numpy as np
values = np.random.randint(1, 100, size=1000000)
output = np.empty(len(values))
for i in range(len(values)):
output[i] = 1.0/values[i]
After one crazy week of working on a Databricks project, I made a lot of mistakes and hence learned a lot. Here are some tips to share on how to make those mistakes I made.
With PySpark, we can either query a Spark Dataframe with Spark SQL or DataFrame DSL(domain-specific language).
The Spark SQL way:
# create a view from spark dataframe
# define you sql query as a string
sql_string = "select * from sdf_view"
# execute the spark SQL
result_df = spark.sql(sqlQuery = sql_string)
With Dataframe DSL, you can query the data without creating any views, almost…
Compare with machine learning models like Neural Network, I thought Decision Tree Classifier should be the most simple one. But I was wrong, this model is a bit complex than I thought. And the model also lands the foundation for other advanced models like LightBGM and Random Forest Decision Tree. So, I spent some time learning it and try to figure out how Decision Tree Classifier works.
The model works very much like how a human mind classifying objects in the real world.
In the previous short article Track Dogecoin Real-Time Price with Python I leveraged Python’s
BeautifulSoup package to scrape the web HTML to grab real-time Dogecoin (or any other cryptos trade in Robinhood.com).
My holding number of Dogecoin is like the coin itself, is a joke. The main purpose is not for trading but to get hand dirty and see how I can use Python to scrape the web with minimum lines of code, and it looks working pretty well.
The next question follows: How can I get the historical price information in daily or even hourly granularity for…
Dogecoin was plummeting this morning and surging tonight, I was thinking, what if there is an alert that can send out mail saying, “Hey, the Dogecoin price is dropped 20%, it is time to buy in”.
Em, why not create one with Python by myself. Here are my overall steps.
After searching and googling, I found Robinhood is the best place to grab real-time price info. No need to sign in, no call limitation, and for free.
This is an action list to install the open-sourced Spark master(or driver) and worker in local Ubuntu completely for free. (in contrast to Databricks for $$$)
The following setup runs in a home intranet. On one Linux(Ubuntu) physical machine(Jetson Nano) and one WSL2(Ubuntu) inside of Windows 10.
Make sure you have Java installed
sudo apt install openjdk-8-jdk
Check if you get Java installed
If you are going to use PySpark, go get Python installed
sudo apt install python3
Check if you get Python installed
From the Spark download page, select your version, I select the newest…
In the beginning, the Master Programmer created the relational database and file system. But the file system in a single machine became limited and slow. The data darkness was on the surface of database. The spirit of map-reducing was brooding upon the surface of the big data.
And Master Programmer said, let there be Spark, and there was Spark.
If the relational database is a well-maintained data garden; Hadoop is a clutter data forest, and it can grow to an unlimited size.
To put data into the garden, data need to be carefully cleaned and grow there structurally. While in…
Imagine you have millions(maybe billions) of text documents in hand. No matter it is customer support tickets, social media data, or community forum posts. There were no tags when the data was generated. You are scratching your head hard to giving tags to those random documents.
Manually tagging is unpractical; Giving an existing tagging list will be outdated soon. Hiring a vendor company to do the tagging work is too much expensive.
You may say, why not using Machine Learning? like, Neral Network deep learning. But, NN needs some training data first. The training data that right fit your dataset.
When I was still a student, I read articles that said that linguists can use text analytic techniques to determine the author of an anonymous book. I thought it is cool at that time.
When looking back, I feel this technique is still cool. But, nowadays with the help of NLTK and Python, you and I can be a “real” linguist with several lines of code.
You don’t need to write a crawler to scape the analysis corpus. For learning and research purposes, a huge existing text database already there well maintained in the NLTK package. …
If you are using Visual Studio Code, it is easy to enable both code highlight and Math by installing one extension: Markdown all in one.
Initialize a new markdown document end with
.md. To enable code highlight, surround code with ``` (straight single quote, usually under esc key). like this:
Daddy of two kids, husband, programmer, blogger, and Applied Data Scientist @ Azure CGA Microsoft, Redmond.