問題: Distributed computing

Distributed join


To join data together you need the data to be colocated - this idea is to colocate a subset of joined data on each node so that the data can be split up


If you've got lots of data - more than can fit in memory you need a way to split the data up between machines. This idea is to prejoin data at insert time and distribute the records that join together to the same machine

this way the overall dataset can be split between machines. But the join can be executed on every node and then aggregated for the total joined data.

In more detail, we decide which machine stores the data by consistently hashing it with the join keys. So the matching records always get stored on the same machine.


(別通知) (可選) 請,登錄

這不是分佈式 nft 所做的。我們從多個帳戶聚合一個項目。我喜歡稱之爲組合。我確實認爲組合是要解決的問題,不僅僅是爲了存儲問題,而是爲了管理複雜性

Is it not what a distributed nft does. We agregating a project from many accounts. I like to call it composition. I do think composition is the problem to solve, not just for storage concerns, but to manage complexity

    :  -- 
    : Mindey
    :  -- 


我不知道 NFT 是如何在區塊鏈中存儲所有權的。但本質上,這個想法是關於分佈式 SQL 數據庫存儲引擎的中級實現細節。

我發現分佈式 P2P 數據庫的想法非常有用 我有一個項目,我在其中實現了一個非常簡單的 SQL 數據庫,它通過將連接鍵分發到集羣中的每個節點來支持分佈式連接。在這個想法中,我採用了不同的方法並散列連接鍵並使用連接鍵進行節點放置。


I don't know how NFTs store ownership in the block chain. But essentially this idea is about the mid level implementation detail of a distributed SQL database storage engine.

I find the idea of a distributed P2P database very useful I have a project where I implement a very simple SQL database and it supports distributed joins by distributing join keys to every node in the cluster. In this idea I take a different approach and hash the join key and use the join keys to do the node placement.

So everybody in the cluster gets a subset of the data. You need everybody to be online to do a query

// 在插入時預連接數據 //

我認爲這是一個合理的想法,但如何做到這一點?可能的join有很多種,其實假設數據庫的表數是 %% n %% ,那麼可能需要join的所有表對的個數,就是 %% k=2 %% 的個數子集:

$$ {\binom{n}{2}}={\frac{n!}{2!(n-2)!}}={\frac{n^2-n}{2}} $$

例如,如果數據庫有 15 個表,這個數字是 105,如果有 42 個表,這個數字是 861。添加您需要對不同字段進行連接的可能性——並且預先計算的連接數量可能是更高。儘管如此,在插入時執行它似乎是合理的,因爲連接會改變並且需要在每次插入時重新計算或修改。

// prejoin data at insert time //

I think it's a reasonable idea, but how would this be done? There are many possible joins, in fact, suppose that the number of tables is the database is %%n%%, then the number of all table pairs that may need to be joined, is the number of %%k=2%% subsets:

$${\binom {n}{2}} = {\frac {n!}{2!(n-2)!}} = {\frac {n^2-n}{2}} $$

For example, if the database has 15 tables, this number is 105, and if there's 42 tables, this number is 861. Add the possibility that you need to do joins on different fields -- and the number of pre-computed joins may be even higher. Still, it seems reasonable to do it at insert time, as the joins would change and need to be recomputed or modified to on every insert.

在我的 SQL 數據庫中,我引入了一個名爲 create join 的語句。


創建連接 內部連接 ​​people.id = items.people items.search = products.name 上的內部連接產品


In my SQL database I introduced a statement called create join.

You tell the database ahead of time what fields are joinable fields.

create join inner join people on people.id = items.people inner join products on items.search = products.name

The database then checked on every insert and does the associated consistent hashing and node placement.

    : Mindey
    :  -- 
    :  -- 


我應該指出,一旦 create join 語句運行,某些數據可能是可移動的。

如果 products.id 和 search.product_id 存在連接並且在插入產品時插入。匹配搜索的查詢將運行 select Id from search where product_id = X

search.product_ID 取決於 products.id。它們具有相同的值。產品的一致哈希可以是:id。搜索的一致性哈希可以忽略它自己的 id 並使用 product_id。這會將這些數據分發到同一臺機器上,因爲哈希值是相同的。

如果有多個連接,則此方案可能需要更復雜。我認爲這些字段可以連接和 hashed.q

I should point out that some data may be movable once the create join statement has ran.

If a join existed for products.id and search.product_id and at insert time of products was inserted. A query for matching searches would run select Id from search where product_id = X

search.product_ID depends on products.id. they have the same value. The consistent hash for products can be: id. The consistent hash for search can ignore its own id and use the product_id. This would distribute this data to the same machine because the hashes are identical.

If there is multiple joins the this scheme might need to be more complicated. I think the fields can be concatenated and hashed.q

    :  -- 
    : Mindey
    :  -- 


爲 Internet 提供搜索引擎的成本非常高,因爲數據太大且無法放入內存。我們可以使用這種方法來拆分搜索的存儲需求。


如果有 1000 個存儲節點,那麼每個搜索查詢都會產生 1000 個子查詢,每個存儲節點都是一個。每個都返回他們所知道的。

Providing a search engine for the internet is outrageously expensive because the data is so large and doesn't fit in memory. We could use this approach to split up the storage requirements of doing search.

If everybody hosts a fraction of the search index then all queries go to everybody before being returned.

If there is 1000 storage nodes then every search query produces 1000 subqueries one to every storage node. Each returns what they know about.

    : Mindey
    :  -- 
    :  --