How to import millions of records into mysql
With a gem called activerecord-import, importing a huge amount of records into mysql table becomes an easier task to do, and this post demonstrates how to achieve it.
Requirement
Suppose we have a mysql table called visitor_ids
that saves the visitor ids for some really high traffic website. The SQL to create such table:
CREATE TABLE `visitor_ids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`visitor_id` varchar(255) NOT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
)
And an active record model is defined as VisitorId
.
Now we have say more than 1 million such visitor ids needed to be inserted into this table.
Solution
Use CSV as the data source
visitor id
4B1A9AFD6614529C3C37D5F18C5E070DD34D33BFA2B7E04F70D23B0FB1945401
A72BCFA9DC6A089C5D465531C988CFEC84F83CA0159250CD5A314D67544F7DEF
...
Use activerecord-import
for bulk import
- require dependencies
require 'activerecord-import/base' require 'activerecord-import/active_record/adapters/mysql2_adapter'
- initialize a queue to buffer the record objects
class VisitorIdsImporter SOURCE_CSV_FILE = '/path/to/visitor_ids.csv' CHUNK_SIZE = 10_000 MAX_IMPORT_THREADS = 3 def initialize @queue = Queue.new end end
- produces the records
class VisitorIdsImporter def build_visitor_ids Thread.new do counter = 0 visitor_ids = [] CSV.foreach(SOURCE_CSV_FILE, headers: true) do |row| visitor_ids << [row['visitor id']] counter += 1 if (counter % CHUNK_SIZE).zero? @queue << visitor_ids visitor_ids = [] end end @queue << visitor_ids unless visitor_ids.empty? MAX_IMPORT_THREADS.times do @queue << :done end end end end
- imports the records
class VisitorIdsImporter def import_visitor_ids MAX_IMPORT_THREADS.times.map do Thread.new do while (visitor_ids = @queue.pop) break if visitor_ids == :done VisitorId.import! [:visitor_id], visitor_ids, validate: false puts "[#{Thread.current.object_id}]: #{Time.now}: Imported #{visitor_ids.size}." end end end.each(&:join) end end
- run the import
class VisitorIdsImporter def start puts "#{Time.now}: Import start" build_visitor_ids import_visitor_ids puts "#{Time.now}: Import end" end end
Summary
How long does it take to import 1 million records?
It depends.
As a reference though:
It took around 2 minutes for an AWS instance with 16G memory and 4 CPUs to import about 1 million records into a mysql instance also hosted by AWS.