2008-03-21 14:38 UTC Using Sequel and Ruby to import the Geonames database
The version I have contains a whopping 6.7 million. names of populated places, locations and all kinds of other things that have a name and location. It's great for putting place names on a map for example.
I've been playing with it a bit and wanted to load the dataset into a SQL database to make it easier to work with. Here's an example Ruby script to use Sequel to load it into a Sqlite database.
It should be trivial to make it work with MySQL or Postgres (or any of the other databases supported by Sequel) too. In theory it's just a matter of changing the "DB =" line. In practice you may have to change the schema too.
To make this code work you need Rubygems, Sequel and Sqlite installed. You can install sequel from a gem with "gem install sequel". For Fedora/Redhat at least, Sqlite and the ruby bindings are included - if you don't have it installed do "yum install ruby-sqlite" or "up2date -i ruby-sqlite" if you're on RHEL.
First "model.rb", a tiny helper file requiring rubygems and Sequel, and initializing the database:
require 'rubygems' require 'sequel' DB = Sequel("sqlite:///allCountries.db") require 'geoname'
Here's the main script:
require 'model' ARGF.each do |line| g = nil begin g = GeoName.parse(line) rescue Exception => e puts "Exception during parsing: #{e.message}" end if g g.save else puts "Unable to parse line: #{line.chomp}" end end
As you can see it's a pretty trivial little driver that just loads the data from a file provided to standard input or as an argument on the command line, tries to parse the line, and insert it into the database (g.save), with some minimal error handling.
The guts of the import is in the GeoName model class ("geoname.rb"), which I'll break in two:
require 'time' class GeoName < Model set_schema do primary_key :geonameid text :name text :asciiname text :alternatenames real :latitude real :longitude text :feature text :feature_code text :country_code text :cc2 text :admin1 text :admin2 text :admin3 text :admin4 integer :population integer :elevation integer :gtopo30 text :timezone timestamp :modification_date index :population index :name index [:country_code, :feature, :population] index [:country_code, :feature_code, :population] end create_table if !table_exists?
This is a simple way of specifying the database schema straight in your model. Note that Sequel doesn't require this - it will be perfectly happy to just work with whatever is in your database, and you can also use migrations to give you more flexibility in changing the database schema when your application changes. In this case, though, I'm expecting to blow away the existing table whenever I load a new version of the file, so it's convenient to just have the schema in the same location.
The "index" statement does exactly what they say: They create indexes on the columns listed after them. The indexes above are the ones I found convenient for my use - change as you please.
The column names have a 1-to-1 correspondence with the column names in the Geonames file.
Note the last line. It does exactly what it says: It creates the table it the table doesn't already exists. Combined with Sqlite this lets me just blow away the database file ("allCountries.db" in the same directory) and rerun the script and the database and table is recreated.
Here's the rest of the code:
def self.parse line g = GeoName.new s = line.chomp.split("\t") g.geonameid = s[0] g.name = s[1] g.asciiname = s[2] g.alternatenames = s[3] g.latitude = s[4] g.longitude = s[5] g.feature = s[6] g.feature_code = s[7] g.country_code = s[8] g.cc2 = s[9] g.admin1 = s[10] g.admin2 = s[11] g.admin3 = s[12] g.admin4 = s[13] g.population = s[14] g.elevation = s[15] g.gtopo30 = s[16] g.timezone = s[17] g.modification_date = Time.parse(s[18]) return g end end
Trivial, isn't it? It just splits up the line and assign each field to the appropriate column. You could use some hack to make it more compact, but for a simple script like this I prefer just keeping it simple and straightforward.
That's all there's to it - after that you can use the database (and the model class above), to do queries using Ruby. Here's an example using irb:
# irb -rmodel irb(main):001:0> GeoName.order_by(:population.desc).limit(10).each{|g| printf "%10.d %s\n",g.population,g.name } 1306313812 Peoples Republic of China 1080264388 Republic of India 295734134 United States 241973879 Republic of Indonesia 186112794 Federative Republic of Brazil 166052900 State of Uttar Pradesh 162419946 Islamic Republic of Pakistan 144319628 Peoples Republic of Bangladesh 143420309 Russian Federation 128771988 Federal Republic of Nigeria => #<Sequel::SQLite::Dataset: "SELECT * FROM geo_names ORDER BY population DESC LIMIT 10"> irb(main):002:0>