This is an action list to install the open-sourced Spark master(or driver) and worker in local Ubuntu completely for free. (in contrast to Databricks for $$$)
The following setup runs in a home intranet. On one Linux(Ubuntu) physical machine(Jetson Nano) and one WSL2(Ubuntu) inside of Windows 10.
Make sure you have Java installed
sudo apt install openjdk
Check if you get Java installed
If you are going to use PySpark, go get Python installed
sudo apt install python3
Check if you get Python installed
From the Spark download page, select your version, I select the newest…
In the beginning, the Master Programmer created the relational database and file system. But the file system in a single machine became limited and slow. The data darkness was on the surface of database. The spirit of map-reducing was brooding upon the surface of the big data.
And Master Programmer said, let there be Spark, and there was Spark.
If the relational database is a well-maintained data garden; Hadoop is a clutter data forest, and it can grow to an unlimited size.
To put data into the garden, data need to be carefully cleaned and grow there structurally. While in…
Imagine you have millions(maybe billions) of text documents in hand. No matter it is customer support tickets, social media data, or community forum posts. There were no tags when the data was generated. You are scratching your head hard to giving tags to those random documents.
Manually tagging is unpractical; Giving an existing tagging list will be outdated soon. Hiring a vendor company to do the tagging work is too much expensive.
You may say, why not using Machine Learning? like, Neral Network deep learning. But, NN needs some training data first. The training data that right fit your dataset.
When I was still a student, I read articles that said that linguists can use text analytic techniques to determine the author of an anonymous book. I thought it is cool at that time.
When looking back, I feel this technique is still cool. But, nowadays with the help of NLTK and Python, you and I can be a “real” linguist with several lines of code.
You don’t need to write a crawler to scape the analysis corpus. For learning and research purposes, a huge existing text database already there well maintained in the NLTK package. …
If you are using Visual Studio Code, it is easy to enable both code highlight and Math by installing one extension: Markdown all in one.
Initialize a new markdown document end with
.md. To enable code highlight, surround code with ``` (straight single quote, usually under esc key). like this:
To select data from pandas’ dataframe, we can use
df_data['column'], and can also use
df_data.loc['column'], yeah, can also use
pd.eval(), and don't forget
df_data.query().If the above is not enough, there is a package called numexpr, and many more.
The Zen of Python said:
There should be one — and preferably only one — obvious way to do it.
Hey Pandas Dataframe, is there one best and obvious way to select data? let’s go through 10 ways one by one and see if we can find the answer.
Say, we have a sample pd data:
When dealing with text data, we want to measure the importance of a word to a document of a full text collection. One of the most intuitive solution would be counting the word appearance number, the higher the better. But simply counting the words # will lead to the result that favor to long document/article. After all, longer document contains more words.
We need another solution that can appropriately measure the importance of a word in the overall context. TF-IDF is one of the effective solutions. And also functioning as the backbone of modern search engines like Google.
I like the idea that we need to rethink about technologies and modernization , from the perspective of human being. Like New York Time square, they fixed the traffic jam by simply blocking some unnecessary roads and joints. rebuild and turn that areas to walking street.
But in the other side, the trending of technologize maybe unstoppable.
Thousands years ago. Socrates insists that writing destroys memory and weakens the mind. and even doubt the merit of introducing ‘letters’. But nowadays, we all can't live without writing and reading, and books.
Like IPads, Mac and computers, My kid is also super…
Whenever there is a programming speed competition, Python usually goes to the bottom. Some said that is because Python is an interpretation language. All interpretation language is slow. But we know that Java is also a kind of language, its bytecode is interpreted by JVM. As showing in this benchmark, Java is much faster than Python.
Here is a sample that can demo Python’s slowness. Use the traditional for-loop to produce reciprocal numbers:
import numpy as np
values = np.random.randint(1, 100, size=1000000)
output = np.empty(len(values))
for i in range(len(values)):
output[i] = 1.0/values[i]
Warning: this piece is sheer for fun, no intention to offend anyone, include Data Scientist.
If you are entitled with Data “Scientist”, will you ever self-doubting that am I really a “Scientist” or am I really working on the “Science” or just data analyst?
Recently, I came across a tweet, which says:
Offend a Data “Scientist” with one tweet — Ben Lindsay
Then there comes many amusing replies like this one:
machine learning is just regression with extra steps — Mike Henry
This one from my peer , an underrated tweet:
Daddy of two kids, husband, and Applied Data Scientist @ Azure CGA Microsoft, Redmond.