database - 如何有效利用10+台电脑导入数据

Question

我们有超过 200,000,000 行的平面文件 (CSV)，我们将其导入到包含 23 个维度表的星型模式中。最大的维度表有 300 万行。目前我们在一台计算机上运行导入过程，大约需要 15 个小时。由于时间太长，我们想利用 40 台计算机来进行导入。

我的问题

我们如何有效地利用这 40 台计算机进行导入。主要担心的是，在所有节点上复制维度表需要花费大量时间，因为它们需要在所有节点上相同。这可能意味着，如果我们将来使用 1000 台服务器进行导入，由于服务器之间的广泛网络通信和协调，它实际上可能比使用单个服务器慢。

有人有建议吗？

编辑：

以下是 CSV 文件的简化：

"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue" 
"bvalue";"evenanothervalue"
"avalue";"anothervalue"

导入后，表格如下所示：

维度表1

id  name
1   "avalue"
2   "bvalue"

维度表2

id  name
1   "anothervalue"
2   "evenanothervalue"

事实表

  dimension_table1_ID       dimension_table2_ID
    1                      1
    2                      2
    1                       2
    1                       2              
    2                       2
    1                       1

score 10 · Accepted Answer

You could consider using a 64bit hash function to produce a bigint ID for each string, instead of using sequential IDs.

With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.

This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.

You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.

See:

http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash

http://code.google.com/p/smhasher/wiki/MurmurHash

http://www.partow.net/programming/hashfunctions/index.html

score 3 · Accepted Answer

将 CSV 数据加载到数据库中很慢，因为它需要读取、拆分和验证数据。

所以你应该尝试的是：

在每台计算机上设置本地数据库。这将摆脱网络延迟。
在每台计算机上加载不同部分的数据。尝试给每台计算机相同的块。如果由于某种原因这并不容易，请给每台计算机，例如，10'000 行。当他们完成后，给他们下一个块。
使用 DB 工具转储数据
将所有转储加载到单个数据库中

确保您的加载器工具可以将数据导入到已经包含数据的表中。如果您不能这样做，请检查您的数据库文档以获取“远程表”。许多数据库允许使来自另一个数据库服务器的表在本地可见。

这允许您运行类似的命令insert into TABLE (....) select .... from REMOTE_SERVER.TABLE

如果您需要主键（并且应该），那么在导入本地数据库期间分配 PK 时也会遇到问题。我建议将 PK 添加到 CSV 文件中。

[编辑]检查您的编辑后，您应该尝试以下操作：

编写一个小程序，提取 CSV 文件第一列和第二列中的唯一值。这可能是一个简单的脚本，例如：
```
 cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
```
这是一个非常便宜的过程（即使是大文件也需要几分钟）。它为您提供 ID 值文件。
编写一个程序，读取新的 ID 值文件，将它们缓存在内存中，然后读取巨大的 CSV 文件并用 ID 替换值。

如果 ID 值文件太大，只需对小文件执行此步骤，然后将大文件加载到所有 40 个每台机器的数据库中。
将大文件分成 40 个块，并在每台机器上加载每个块。

如果您有巨大的 ID 值文件，您可以使用在每台机器上创建的表来替换所有剩余的值。
使用备份/恢复或远程表来合并结果。

或者，更好的是，将数据保存在 40 台机器上，并使用并行计算的算法来拆分工作并合并结果。这就是 Google 如何在几毫秒内从数十亿网页创建搜索结果的方式。

有关介绍，请参见此处。

score 2 · Accepted Answer

This is a very generic question and does not take the database backend into account. Firing with 40 or 1000 machines on a database backend that can not handle the load will give you nothing. Such a problem is truly to broad to answer it in a specific way..you should get in touch with people inside your organization with enough skills on the DB level first and then come back with a more specific question.

score 2 · Accepted Answer

最简单的事情是让一台计算机负责分发新的维度项目 ID。每个维度都可以有一个。如果维度处理计算机在同一网络上，您可以让它们广播 id。那应该足够快。

你打算使用什么数据库来进行 23 维星图方案？导入可能不是唯一的性能瓶颈。您可能希望在分布式主内存系统中执行此操作。这避免了很多物化问题。

您应该调查是否存在高度相关的维度。

通常，对于具有大维度的 23 维星型方案，标准关系数据库（SQL Server、PostgreSQL、MySQL）在处理数据仓库问题时会表现得非常糟糕。为了避免进行全表扫描，关系数据库使用物化视图。有 23 个维度，您买不起。分布式主内存数据库可能能够足够快地进行全表扫描（2004 年，我在 Delphi 的 Pentium 4 3 GHz 上完成了大约 800 万行/秒/线程）。Vertica 可能是另一种选择。

另一个问题：压缩文件时文件有多大？这为您可以执行的标准化量提供了一个很好的一阶估计。

[编辑] 我看过你的其他问题。这看起来不太适合 PostgreSQL（或 MySQL 或 SQL 服务器）。您愿意等待查询结果多长时间？

score 2 · Accepted Answer

Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.

Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)

To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):

Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.
It contains lists of all X files and all N-1 (not counting itself) “worker” databases
Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)
When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)
Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s).
When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.
Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.
At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.
For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.
Again, repeat the above step for each worker database, until all files have been uploaded.

Advantages:

Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.
Ideally, little work (“second stage”, merging datasets) is left for the master database

Limitations:

Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network
Master database is a possible chokepoint. Everything has to go through here.

Shortcuts:

It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).
SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.
Other shortcuts are likely, but it totally depends upon the business rules being implemented.

Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?

score 1 · Accepted Answer

罗希塔

我建议您通过在数据库之外首先汇总数据来消除负载中的大量工作。我在 Solaris unix 环境中工作。我倾向于使用 korn-shell 脚本，它将cut文件分成更易于管理的块，然后将这些块平均分配给我的两个其他服务器。我会使用 nawk 脚本（nawk 有一个高效的哈希表，他们称之为“关联数组”）来处理这些块，以计算不同的值（维度表）和事实表。只需将每个新名称与该维度的增量器相关联，然后编写事实。

如果您通过命名管道执行此操作，您可以“即时”推送、远程处理和回读数据，而“主机”计算机坐在那里将数据直接加载到表中。

请记住，无论您如何处理 200,000,000 行数据（它是多少 Gig？），都需要一些时间。听起来你在找点乐子。阅读其他人如何解决这个问题很有趣......古老的格言“有不止一种方法可以做到！” 从未如此真实。祝你好运！

干杯。基思。

score 0 · Accepted Answer

另一方面，您可以为 Windows Server 使用 Windows Hyper-V 云计算插件：http://www.microsoft.com/virtualization/en/us/private-cloud.aspx

score 0 · Accepted Answer

您的实现似乎非常低效，因为它的加载速度低于 1 MB/秒（50GB/15 小时）。

在现代单一服务器上正确实施（2x Xeon 5690 CPU + RAM 足以加载哈希表中的所有维度 + 8GB）应该给您至少 10 倍的速度，即至少 10MB/秒。

database - 如何有效利用10+台电脑导入数据

8 回答 8

Related

Reference