How to import millions of records into mysql
With a gem called activerecord-import, importing a huge amount of records into mysql table becomes an easier task to do, and this post demonstrates how to achieve it.
Requirement
Suppose we have a mysql table called visitor_ids
that saves the visitor ids for some really high traffic website. The SQL to create such table:
CREATE TABLE `visitor_ids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`visitor_id` varchar(255) NOT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
)
And an active record model is defined as VisitorId
.
Now we have say more than 1 million such visitor ids needed to be inserted into this table.
Solution
Use CSV as the data source
visitor id
4B1A9AFD6614529C3C37D5F18C5E070DD34D33BFA2B7E04F70D23B0FB1945401
A72BCFA9DC6A089C5D465531C988CFEC84F83CA0159250CD5A314D67544F7DEF
...
Use activerecord-import
for bulk import
- require dependencies
require 'activerecord-import/base'
require 'activerecord-import/active_record/adapters/mysql2_adapter'
- initialize a queue to buffer the record objects
class VisitorIdsImporter
SOURCE_CSV_FILE = '/path/to/visitor_ids.csv'
CHUNK_SIZE = 10_000
MAX_IMPORT_THREADS = 3
def initialize
@queue = Queue.new
end
end
- produces the records
class VisitorIdsImporter
def build_visitor_ids
Thread.new do
counter = 0
visitor_ids = []
CSV.foreach(SOURCE_CSV_FILE, headers: true) do |row|
visitor_ids << [row['visitor id']]
counter += 1
if (counter % CHUNK_SIZE).zero?
@queue << visitor_ids
visitor_ids = []
end
end
@queue << visitor_ids unless visitor_ids.empty?
MAX_IMPORT_THREADS.times do
@queue << :done
end
end
end
end
- imports the records
class VisitorIdsImporter
def import_visitor_ids
MAX_IMPORT_THREADS.times.map do
Thread.new do
while (visitor_ids = @queue.pop)
break if visitor_ids == :done
VisitorId.import! [:visitor_id], visitor_ids, validate: false
puts "[#{Thread.current.object_id}]: #{Time.now}: Imported #{visitor_ids.size}."
end
end
end.each(&:join)
end
end
- run the import
class VisitorIdsImporter
def start
puts "#{Time.now}: Import start"
build_visitor_ids
import_visitor_ids
puts "#{Time.now}: Import end"
end
end
Summary
How long does it take to import 1 million records?
It depends.
As a reference though:
It took around 2 minutes for an AWS instance with 16G memory and 4 CPUs to import about 1 million records into a mysql instance also hosted by AWS.