Vidar Hokstad V2.0

Home Blog

Tag: geonames

2008-03-21 14:38 UTC Using Sequel and Ruby to import the Geonames database

GeoNames is a project to gather geographical information under the creative commons attribution license, which means it's ideal for commercial use and hobbyists alike.

The version I have contains a whopping 6.7 million. names of populated places, locations and all kinds of other things that have a name and location. It's great for putting place names on a map for example.

I've been playing with it a bit and wanted to load the dataset into a SQL database to make it easier to work with. Here's an example Ruby script to use Sequel to load it into a Sqlite database.

It should be trivial to make it work with MySQL or Postgres (or any of the other databases supported by Sequel) too. In theory it's just a matter of changing the "DB =" line. In practice you may have to change the schema too.

To make this code work you need Rubygems, Sequel and Sqlite installed. You can install sequel from a gem with "gem install sequel". For Fedora/Redhat at least, Sqlite and the ruby bindings are included - if you don't have it installed do "yum install ruby-sqlite" or "up2date -i ruby-sqlite" if you're on RHEL.

First "model.rb", a tiny helper file requiring rubygems and Sequel, and initializing the database:

require 'rubygems'
require 'sequel'                                                                                                                                                                                                                                                                                                                                        

DB = Sequel("sqlite:///allCountries.db")

require 'geoname'

Here's the main script:

require 'model'

ARGF.each do |line| g = nil begin g = GeoName.parse(line) rescue Exception => e puts "Exception during parsing: #{e.message}" end if g g.save else puts "Unable to parse line: #{line.chomp}" end end

As you can see it's a pretty trivial little driver that just loads the data from a file provided to standard input or as an argument on the command line, tries to parse the line, and insert it into the database (g.save), with some minimal error handling.

The guts of the import is in the GeoName model class ("geoname.rb"), which I'll break in two:

require 'time'

class GeoName < Model set_schema do primary_key :geonameid text :name text :asciiname text :alternatenames real :latitude real :longitude text :feature text :feature_code text :country_code text :cc2 text :admin1 text :admin2 text :admin3 text :admin4 integer :population integer :elevation integer :gtopo30 text :timezone timestamp :modification_date

index :population index :name index [:country_code, :feature, :population] index [:country_code, :feature_code, :population] end

create_table if !table_exists?

This is a simple way of specifying the database schema straight in your model. Note that Sequel doesn't require this - it will be perfectly happy to just work with whatever is in your database, and you can also use migrations to give you more flexibility in changing the database schema when your application changes. In this case, though, I'm expecting to blow away the existing table whenever I load a new version of the file, so it's convenient to just have the schema in the same location.

The "index" statement does exactly what they say: They create indexes on the columns listed after them. The indexes above are the ones I found convenient for my use - change as you please.

The column names have a 1-to-1 correspondence with the column names in the Geonames file.

Note the last line. It does exactly what it says: It creates the table it the table doesn't already exists. Combined with Sqlite this lets me just blow away the database file ("allCountries.db" in the same directory) and rerun the script and the database and table is recreated.

Here's the rest of the code:

  def self.parse line
    g = GeoName.new
    s = line.chomp.split("\t")
    g.geonameid = s[0]
    g.name = s[1]
    g.asciiname = s[2]
    g.alternatenames = s[3]
    g.latitude = s[4]
    g.longitude = s[5]
    g.feature = s[6]
    g.feature_code = s[7]
    g.country_code = s[8]
    g.cc2 = s[9]
    g.admin1 = s[10]
    g.admin2 = s[11]
    g.admin3 = s[12]
    g.admin4 = s[13]
    g.population = s[14]
    g.elevation = s[15]
    g.gtopo30 = s[16]
    g.timezone = s[17]
    g.modification_date = Time.parse(s[18])
    return g
  end
end

Trivial, isn't it? It just splits up the line and assign each field to the appropriate column. You could use some hack to make it more compact, but for a simple script like this I prefer just keeping it simple and straightforward.

That's all there's to it - after that you can use the database (and the model class above), to do queries using Ruby. Here's an example using irb:

# irb -rmodel
irb(main):001:0> GeoName.order_by(:population.desc).limit(10).each{|g| printf "%10.d %s\n",g.population,g.name }
1306313812 Peoples Republic of China
1080264388 Republic of India
 295734134 United States
 241973879 Republic of Indonesia
 186112794 Federative Republic of Brazil
 166052900 State of Uttar Pradesh
 162419946 Islamic Republic of Pakistan
 144319628 Peoples Republic of Bangladesh
 143420309 Russian Federation
 128771988 Federal Republic of Nigeria
=> #<Sequel::SQLite::Dataset: "SELECT * FROM geo_names ORDER BY population DESC LIMIT 10">
irb(main):002:0> 


Older Entries

About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Categories

StumbleUpon My link page

(Links I have stumbled and like)