Vidar Hokstad V2.0

Home Blog

Tag: sequel

2008-03-21 14:38 UTC Using Sequel and Ruby to import the Geonames database

GeoNames is a project to gather geographical information under the creative commons attribution license, which means it's ideal for commercial use and hobbyists alike.

The version I have contains a whopping 6.7 million. names of populated places, locations and all kinds of other things that have a name and location. It's great for putting place names on a map for example.

I've been playing with it a bit and wanted to load the dataset into a SQL database to make it easier to work with. Here's an example Ruby script to use Sequel to load it into a Sqlite database.

It should be trivial to make it work with MySQL or Postgres (or any of the other databases supported by Sequel) too. In theory it's just a matter of changing the "DB =" line. In practice you may have to change the schema too.

To make this code work you need Rubygems, Sequel and Sqlite installed. You can install sequel from a gem with "gem install sequel". For Fedora/Redhat at least, Sqlite and the ruby bindings are included - if you don't have it installed do "yum install ruby-sqlite" or "up2date -i ruby-sqlite" if you're on RHEL.

First "model.rb", a tiny helper file requiring rubygems and Sequel, and initializing the database:

require 'rubygems'
require 'sequel'                                                                                                                                                                                                                                                                                                                                        

DB = Sequel("sqlite:///allCountries.db")

require 'geoname'

Here's the main script:

require 'model'

ARGF.each do |line| g = nil begin g = GeoName.parse(line) rescue Exception => e puts "Exception during parsing: #{e.message}" end if g g.save else puts "Unable to parse line: #{line.chomp}" end end

As you can see it's a pretty trivial little driver that just loads the data from a file provided to standard input or as an argument on the command line, tries to parse the line, and insert it into the database (g.save), with some minimal error handling.

The guts of the import is in the GeoName model class ("geoname.rb"), which I'll break in two:

require 'time'

class GeoName < Model set_schema do primary_key :geonameid text :name text :asciiname text :alternatenames real :latitude real :longitude text :feature text :feature_code text :country_code text :cc2 text :admin1 text :admin2 text :admin3 text :admin4 integer :population integer :elevation integer :gtopo30 text :timezone timestamp :modification_date

index :population index :name index [:country_code, :feature, :population] index [:country_code, :feature_code, :population] end

create_table if !table_exists?

This is a simple way of specifying the database schema straight in your model. Note that Sequel doesn't require this - it will be perfectly happy to just work with whatever is in your database, and you can also use migrations to give you more flexibility in changing the database schema when your application changes. In this case, though, I'm expecting to blow away the existing table whenever I load a new version of the file, so it's convenient to just have the schema in the same location.

The "index" statement does exactly what they say: They create indexes on the columns listed after them. The indexes above are the ones I found convenient for my use - change as you please.

The column names have a 1-to-1 correspondence with the column names in the Geonames file.

Note the last line. It does exactly what it says: It creates the table it the table doesn't already exists. Combined with Sqlite this lets me just blow away the database file ("allCountries.db" in the same directory) and rerun the script and the database and table is recreated.

Here's the rest of the code:

  def self.parse line
    g = GeoName.new
    s = line.chomp.split("\t")
    g.geonameid = s[0]
    g.name = s[1]
    g.asciiname = s[2]
    g.alternatenames = s[3]
    g.latitude = s[4]
    g.longitude = s[5]
    g.feature = s[6]
    g.feature_code = s[7]
    g.country_code = s[8]
    g.cc2 = s[9]
    g.admin1 = s[10]
    g.admin2 = s[11]
    g.admin3 = s[12]
    g.admin4 = s[13]
    g.population = s[14]
    g.elevation = s[15]
    g.gtopo30 = s[16]
    g.timezone = s[17]
    g.modification_date = Time.parse(s[18])
    return g
  end
end

Trivial, isn't it? It just splits up the line and assign each field to the appropriate column. You could use some hack to make it more compact, but for a simple script like this I prefer just keeping it simple and straightforward.

That's all there's to it - after that you can use the database (and the model class above), to do queries using Ruby. Here's an example using irb:

# irb -rmodel
irb(main):001:0> GeoName.order_by(:population.desc).limit(10).each{|g| printf "%10.d %s\n",g.population,g.name }
1306313812 Peoples Republic of China
1080264388 Republic of India
 295734134 United States
 241973879 Republic of Indonesia
 186112794 Federative Republic of Brazil
 166052900 State of Uttar Pradesh
 162419946 Islamic Republic of Pakistan
 144319628 Peoples Republic of Bangladesh
 143420309 Russian Federation
 128771988 Federal Republic of Nigeria
=> #<Sequel::SQLite::Dataset: "SELECT * FROM geo_names ORDER BY population DESC LIMIT 10">
irb(main):002:0> 


2008-03-20 15:17 UTC Sequel with Sqlite caveat: Sorting on dates

Posted in: , , ,
One slight hickup it's worth being aware of when storing timestamps in Sequel is that if you use the Sqlite adapter, the timestamp is stored as an ISO 8601 string with timezone, like this:

2008-03-20T11:26:52Z

The "Z" at the end signifies UTC.

This isn't a problem if you're consistent, but if you parse times with Time.parse() in Ruby, and some of the strings you parse contain another timezone, you might end up getting times in different timezones in your database.

Needless to say that screws up sorting.

Avoiding this is pretty simple, though the best solution would probably be to fix the Sequel Sqlite adapter - I'll take a look at that later. Here's my quick workaround, which also illustrates Sequel hooks:

class Item < Sequel::Model
  before_save do
      self.published_at = Time.parse(self.published_at.to_s).utc
    end
  end
end

Anything in the "before_save" block gets executed before an insert or update. Similarly you can add hooks after insert/update with "after_save". In this case all I do is convert whatever is in published_at to a string and reparsing it and forcing it to UTC. Which timezone you force it too doesn't matter, as long as you're consistent it will sort correctly.

(As a side effect, the code above also happily lets me do things like Item.published_at = :now to set the timestamp automatically on save)


2008-03-20 11:26 UTC Sequel ORM: Right level of abstraction

Posted in: , , , ,
The more I use Sequel, the more I am coming to believe the level of abstraction it picks is just right. You use operations that mirror SQL, but in a mostly Ruby-ish way. Most importantly SQL is so completely wrapped that I've yet to see any need to write "raw" SQL.

Let me take a couple of examples based on my blog code.

To get the list of the 10 most recent pages for a tag, I do this:

Item.join(:tags,:item_id).filter{:name == tag}.published.latest(10)

"published" and "latest" are methods in the model, that both work on the Sequel dataset:

class Item < Sequel::Model
  subset(:published) {:published_at != nil}

def dataset.latest(l = nil) return order_by(:published_at.desc).limit(l) if (l) return order_by(:published_at.desc) end end

"subset" above is Sequel shorthand for:

  def dataset.published
     filter{:published != nil}
  end

Which means the original statement translates into SQL like this:

irb(main):005:0> Item.join(:tags,:item_id).filter{:name == "ruby"}.published.latest(10).sql
=> "SELECT items.* FROM items INNER JOIN tags ON (tags.item_id = items.id) WHERE (name = 'ruby') AND (NOT (published_at IS NULL)) ORDER BY published_at DESC LIMIT 10"

Notice the "sql" method call? You can call that on any Sequel dataset to see what SQL it will use. If I want to run it, I use any of the normal enumerable methods, such as each, collect etc. to process the dataset, as you'd expect.

The syntax feels so natural that I find myself almost exclusively using IRB to execute queries against my database. Since it doesn't need a model to work (you can replace "Item" above with DB[:items], and any of the pure Sequel calls will still work without having the Item class) it works fine for ad-hoc queries too. It's in effect pretty close to being a replacement for Ruby-DBI with far more of a Ruby feel.

The key thing for me, though, is not feeling constrained and at the same time not having to dip down into ugly hacks such as adding actual SQL queries (If you really need to, Sequel will let you do that too - so far I've not had a reason - and if you do, you can still add support for it by extending the dataset so you can contain all the SQL ugliness in one place).


2008-03-18 11:42 UTC Sequel praise and Sqlite type translation problems

Posted in: , , ,
I ran into a very annoying problem with Sqlite yesterday, triggered by Sequel. But first some praise for how simple Sequel makes things most of the time:

One of the nice things about Sequel is that it allows me to write complex queries in pure ruby, without ever (so far, anyway) dipping into actual Sequel.

Last night I was throwing together a tag histogram for this blog, and suddenly everything blew up.

The offending code was this, which nicely illustrates Sequel:

Tag.select(:name,:count[:name].as(:count).
  group_by(:name).order_by(:count.desc)

Which I during my debugging discovered could be reduced to:

Tag.group_and_count(:name).order_by(:count.desc)

It does order by count, but ascending by default. What's even nicer is Sequels "dataset" concept, which let me write this:

class Tag
  def dataset.histogram
     group_and_count(:name).order_by(:count.desc)
  end
end

By creating the method on the dataset I can chain it with other Sequel methods:

   Tag.histogram.limit(10)

Very nice and neat. However, the above all blew up for me. I figured I'd try running everything off Sqlite since I'm not exactly expecting this blog to draw a ton of traffic given the extreme geekery likely to be found here on a regular basis.

Part of the problem is that Sqlite is dynamically typed. It's one of the things I love about Sqlite usually, and it makes it a very nice match for Ruby most of the time. The problem is that when reading things out of Sqlite it all comes back as strings. So the Ruby bindings add type translation to translate the strings back into native Ruby. The combination is potentially very powerful, as nothing stops you from automatically serializing and unserializing full Ruby objects (and combined with Sqlite callbacks you could actually do a lot of wonderfully evil things with that, including calling methods on the objects stored in the database).

The problem is that at least as used by Sequel this blows up when one of the columns doesn't have a type, for example because it's not a column but the return value of a function.

For now I've solved this by changing this part of sqlite3/resultset.rb from this:

if @db.type_translation
  row = @stmt.types.zip( row ).map do |type, value|
    @db.translator.translate( type , value )
  end
end

Into:

if @db.type_translation
  row = @stmt.types.zip( row ).map do |type, value|
    @db.translator.translate( type ? type : "NUMERIC" , value )
  end
end

It's not a good solution, so I won't submit it upstream, but it works for now. The problem is that it will break for any function returning a string instead of a number.


Older Entries

About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Categories

StumbleUpon My link page

(Links I have stumbled and like)