With a gem called activerecord-import, importing a huge amount of records into mysql table becomes an easier task to do, and this post demonstrates how to achieve it.

Requirement

Suppose we have a mysql table called visitor_ids that saves the visitor ids for some really high traffic website. The SQL to create such table:

CREATE TABLE `visitor_ids` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `visitor_id` varchar(255) NOT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`)
)

And an active record model is defined as VisitorId. Now we have say more than 1 million such visitor ids needed to be inserted into this table.

Solution

Use CSV as the data source

visitor id
4B1A9AFD6614529C3C37D5F18C5E070DD34D33BFA2B7E04F70D23B0FB1945401
A72BCFA9DC6A089C5D465531C988CFEC84F83CA0159250CD5A314D67544F7DEF
...

Use activerecord-import for bulk import

  • require dependencies
    require 'activerecord-import/base'
    require 'activerecord-import/active_record/adapters/mysql2_adapter'
    
  • initialize a queue to buffer the record objects
    class VisitorIdsImporter
    SOURCE_CSV_FILE = '/path/to/visitor_ids.csv'
    CHUNK_SIZE = 10_000
    MAX_IMPORT_THREADS = 3
    
    def initialize
      @queue = Queue.new
    end
    end
    
  • produces the records
    class VisitorIdsImporter
    def build_visitor_ids
      Thread.new do
        counter = 0
        visitor_ids = []
        CSV.foreach(SOURCE_CSV_FILE, headers: true) do |row|
          visitor_ids << [row['visitor id']]
          counter += 1
          if (counter % CHUNK_SIZE).zero?
            @queue << visitor_ids
            visitor_ids = []
          end
        end
        @queue << visitor_ids unless visitor_ids.empty?
        MAX_IMPORT_THREADS.times do
          @queue << :done
        end
      end
    end
    end
    
  • imports the records
    class VisitorIdsImporter
    def import_visitor_ids
      MAX_IMPORT_THREADS.times.map do
        Thread.new do
          while (visitor_ids = @queue.pop)
            break if visitor_ids == :done
    
            VisitorId.import! [:visitor_id], visitor_ids, validate: false
            puts "[#{Thread.current.object_id}]: #{Time.now}: Imported #{visitor_ids.size}."
          end
        end
      end.each(&:join)
    end
    end
    
  • run the import
    class VisitorIdsImporter
    def start
      puts "#{Time.now}: Import start"
      build_visitor_ids
      import_visitor_ids
      puts "#{Time.now}: Import end"
    end
    end
    

Summary

How long does it take to import 1 million records? It depends.
As a reference though:
It took around 2 minutes for an AWS instance with 16G memory and 4 CPUs to import about 1 million records into a mysql instance also hosted by AWS.

TOPâ–˛
Post tagged with: rails, ruby