Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.
Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)
To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):
- Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.
- It contains lists of all X files and all N-1 (not counting itself) “worker” databases
- Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)
- When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)
- Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s).
- When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.
- Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.
- At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.
- For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.
- Again, repeat the above step for each worker database, until all files have been uploaded.
Advantages:
- Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.
- Ideally, little work (“second stage”, merging datasets) is left for the master database
Limitations:
- Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network
- Master database is a possible chokepoint. Everything has to go through here.
Shortcuts:
- It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).
- SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.
- Other shortcuts are likely, but it totally depends upon the business rules being implemented.
Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?