Alright, let’s say you have a pretty big database. I’m talking, like, 100GB for a medium sized table here. (Some tables in excess of 3TB!)
Now let’s say that you’re going to be sharding this dataset. I’m not going to
cover sharding itself in this post, but the concept is pretty simple: we want
to split the data out across multiple servers in an easy to look up fashion. In
this scenario, we’re going to split the data up 1024 ways and do the lookup by
modulo
hash of userId
.
You’re now faced with a multipart problem: How do you export all this data, split it up for all the shards, and then load it where it is supposed to go?
Typically, you’re going to see the advice to use mysqldump
. Don’t. It’s not
going to work. For starters, there can be a lot of issues with it timing out,
making sure it doesn’t lock things it shouldn’t, and some other problems.
Primarily, however, it’s default export is SQL statements – and SQL statements,
while they can be fast to load, they are not the fastest way to load data.
Instead, you should use the SELECT INTO OUTFILE
command. It will look
something like this:
SELECT
userId, username, hashedPassword, secretData
INTO OUTFILE '/path/to/users.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'
FROM schema.users;
This will create a standard ‘ol CSV file. However, there’s a problem here. That
secretData
column is a binary field with encrypted (or any kind of binary)
data. Binary data can contain any set of characters, including problem
characters like \n
or "
which will ruin the format of the CSV file and lead
to lost data. Not to mention that most systems expect CSVs to be ascii
plaintext.
So, how do you export this data in a sane CSV? Turns out it’s pretty easy,
just hex()
the binary values:
SELECT
userId, username, hashedPassword, hex(secretData)
INTO OUTFILE '/path/to/users.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'
FROM schema.users;
This will convert the binary data into its hexidecimal representation which, in turn, creates a perfectly acceptable CSV file.
Now that we have every row exported, we need to split the data up acording to
which shard the row of data will ultimately reside in. So, in our case, if a
row belongs to userId=584685
that means 584685 % 1024 = 1005
the record
should go to shard 1005 (of 1024 total). We need to evaluate every record in
the CSV and put that line into a different CSV which contains records for
the given shard.
I toiled on this for several days, mostly trying to add multi-threading to the
process. Ultimately, however, diskio is the limiting factor and your disk will
never out-perform your CPU. (Maybe you could put this file into /tmpfs
and
multi-thread it, but it’s really not worth it.)
So, in order to read the file, parse it, and dump each record into the
appropriate CSV file we turn to the incomprehensible awk
linux tool:
awk -F, '{x=sprintf("%.4d", $1%1024); print > x".csv" }'
In short, what this does is pull the first column of data (split by commas),
modulo against 1024, set that to the variable x
and then appends the whole
line to x.csv
. The sprintf
is there so we get 0090.csv
instead of 90.csv
which is a bit nicer when listing the files.
Now, awk
will happily run this until it finishes, but as it can take a very
long time to run, I highly recommend you install the pv
utility and run the
command like this to get visual progress bar and ETA:
pv /path/to/users.csv | awk -F, '{x=sprintf("%.4d", $1%1024); print > x".csv" }'
Finally, run these commands in a screen
session so if you lose your SSH
connection all is not lost.
Great, you’ve got the data all split out, but if you do a normal LOAD DATA INFILE
that hex()
d data isn’t that you actually want in the column. Fortuneately,
LOAD DATA INFILE
has a trick which lets us get the data imported in the
correct format:
LOAD DATA LOCAL INFILE '/path/to/shard.csv'
REPLACE INTO TABLE schema.users
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'
userId, username, hashedPassword, @secretData
SET secretData = unhex(@secretData);
This is an interesting extra little feature of LOAD DATA
wherein you can
specify each portion of the CSV to go into a column or into a variable. You
place the hex()
d data into the @secretData
variable and then use the SET
portion of the command to unhex()
the value into the secretData
column,
thus restoring your binary data.
Getting these CSV’s loaded into each shard, reader, is an exercise I leave in
your hands. I will say some loops, shell commands, and LOAD DATA LOCAL INFILE
are your friend.
The previous data migration system was a program that SELECT
ed rows on one
connection from the source, then used a separate connection to the target
to insert them. This method, for the 100GB table had an ETA of over 120 days.
The dump/split/load method worked in a grand total of 12 hours. Most of that
time (~10.5 hours) was spent on the splitting of the file itself. The dump took
40 minutes and the (parallel) load another 45.