Using Sequel and Ruby to import the Geonames database 2008-03-21


GeoNames is a project to gather geographical information under the creative commons attribution license, which means it's ideal for commercial use and hobbyists alike.

The version I have contains a whopping 6.7 million. names of populated places, locations and all kinds of other things that have a name and location. It's great for putting place names on a map for example.

I've been playing with it a bit and wanted to load the dataset into a SQL database to make it easier to work with. Here's an example Ruby script to use Sequel to load it into a Sqlite database.

It should be trivial to make it work with MySQL or Postgres (or any of the other databases supported by Sequel) too. In theory it's just a matter of changing the "DB =" line. In practice you may have to change the schema too.

To make this code work you need Rubygems, Sequel and Sqlite installed. You can install sequel from a gem with "gem install sequel". For Fedora/Redhat at least, Sqlite and the ruby bindings are included - if you don't have it installed do "yum install ruby-sqlite" or "up2date -i ruby-sqlite" if you're on RHEL.

First "model.rb", a tiny helper file requiring rubygems and Sequel, and initializing the database:


    require 'rubygems'
    require 'sequel'
    
    DB = Sequel("sqlite:///allCountries.db")
    
    require 'geoname'

Here's the main script:

    require 'model'
    
    ARGF.each do |line|
      g = nil
      begin
        g = GeoName.parse(line)
      rescue Exception => e
        puts "Exception during parsing: #{e.message}"
      end
      if g
        g.save
      else
        puts "Unable to parse line: #{line.chomp}"
      end
    end

As you can see it's a pretty trivial little driver that just loads the data from a file provided to standard input or as an argument on the command line, tries to parse the line, and insert it into the database (g.save), with some minimal error handling.

The guts of the import is in the GeoName model class ("geoname.rb"), which I'll break in two:


    require 'time'
    
    class GeoName < Model
      set_schema do
        primary_key :geonameid
        text :name
        text :asciiname
        text :alternatenames
        real :latitude
        real :longitude
        text :feature
        text :feature_code
        text :country_code
        text :cc2
        text :admin1
        text :admin2
        text :admin3
        text :admin4
        integer :population
        integer :elevation
        integer :gtopo30
        text :timezone
        timestamp :modification_date
    
        index :population
        index :name
        index [:country_code, :feature, :population]
        index [:country_code, :feature_code, :population]
      end
    
      create_table if !table_exists?

This is a simple way of specifying the database schema straight in your model. Note that Sequel doesn't require this - it will be perfectly happy to just work with whatever is in your database, and you can also use migrations to give you more flexibility in changing the database schema when your application changes. In this case, though, I'm expecting to blow away the existing table whenever I load a new version of the file, so it's convenient to just have the schema in the same location.

The "index" statement does exactly what they say: They create indexes on the columns listed after them. The indexes above are the ones I found convenient for my use - change as you please.

The column names have a 1-to-1 correspondence with the column names in the Geonames file.

Note the last line. It does exactly what it says: It creates the table it the table doesn't already exists. Combined with Sqlite this lets me just blow away the database file ("allCountries.db" in the same directory) and rerun the script and the database and table is recreated.

Here's the rest of the code:


      def self.parse line
        g = GeoName.new
        s = line.chomp.split("\t")
        g.geonameid = s[0]
        g.name = s[1]
        g.asciiname = s[2]
        g.alternatenames = s[3]
        g.latitude = s[4]
        g.longitude = s[5]
        g.feature = s[6]
        g.feature_code = s[7]
        g.country_code = s[8]
        g.cc2 = s[9]
        g.admin1 = s[10]
        g.admin2 = s[11]
        g.admin3 = s[12]
        g.admin4 = s[13]
        g.population = s[14]
        g.elevation = s[15]
        g.gtopo30 = s[16]
        g.timezone = s[17]
        g.modification_date = Time.parse(s[18])
        return g
      end
    end

Trivial, isn't it? It just splits up the line and assign each field to the appropriate column. You could use some hack to make it more compact, but for a simple script like this I prefer just keeping it simple and straightforward.

That's all there's to it - after that you can use the database (and the model class above), to do queries using Ruby. Here's an example using irb:



blog comments powered by Disqus