「Demystifying Data Science」是美國一場非常精彩的 12 小時免費直播講座,邀請 28 位來自 Facebook、Airbnb、Quora、Etsy、Fast.ai 等知名企業的資深資料科學家分享「如何轉職進入成一位數據分析師」。

由於直播時間是美國時間的早上十點到晚上十點,即,台灣時間的晚上十點到格式的早上十點,因此我只看了晚上十點到半夜十二點半共五場演講,並筆記一些講者分享的內容。由於一些來不及紀錄的缺漏內容是事後再根據記憶補上的,因此有些地方可能用詞或說法會不太精準,就請多多體諒啦。

Read More

Elasticsearch is a distributed, real-time, search and analytics platform.

Using a restful API, Elasticsearch saves data and indexes it automatically. It assigns types to fields and that way a search can be done smartly and quickly using filters and different queries.

It’s uses JVM in order to be as fast as possible. It distributes indexes in “shards” of data. It replicates shards in different nodes, so it’s distributed and clusters can function even if not all nodes are operational. Adding nodes is super easy and that’s what makes it so scalable.

ES uses Lucene to solve searches. This is quite an advantage with comparing with, for example, Django query strings. A restful API call allows us to perform searches using json objects as parameters, making it much more flexible and giving each search parameter within the object a different weight, importance and or priority.

Read More

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

Clustering Analysis is finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups such that

  • Intra-cluster distances are minimized
  • Inter-cluster distances are maximized

Read More

In iterative clustering algorithms, the procedure adopted for choosing initial cluster centers is extremely important as it has a direct impact on the formation of final clusters. It is dangerous to select outliers as initial centers, since they are away from normal samples.

Cluster Center Initialization Algorithms (CCIA) is a density-based multi-scale data condensation. This procedure is applicable to clustering algorithms for continuous data. In CCIA, we assume that an individual attribute may provide some information about initial cluster center.

Read More