dinsdag 30 maart 2010

Word stemming

One of the issues with working with (bibliometric) data is data cleaning. While we already have an experimental tool called Record Grouper that is being refactored at the moment, we also needed a quick-and-dirty way to do some basic cleaning on words.

To facilitate this need, I have added a new feature to the Word Splitter tool. The tool now builds an additional table called Wordstems, and it adds an additional field to the Words column with a reference to a record in this Wordstems table. Each word is now being stemmed by using the Porter stemming algorithm. I am looking into replacing this algorithm by another, more accurate one, but the idea will stay the same if that happens.

So now, each word will get it's stem (according to this Porter algorithm) associated with it using the stem's ID number. This makes it easy to treat words that have the same stem as the same word, so we can get rid of the difference between "robot", "robots", "robotic" and "robotics" easily if we want.

On a note that relates to the last posting on database compatibilities. While the Word Splitter still only works with MS Access, under the hood a lot has changed. I have implemented a driver system with extends the database drivers that are available in Qt by default, and works around a couple of bugs on the side. Based on this, I have created an Access Driver that is now being used by the Word Splitter tool. While it is not completely database independent yet, it is a good start and a good test for the new driver.

woensdag 24 februari 2010

Database compatabilities

At the moment, SAINT only supports Microsoft Access as a database to store your data. Because most of the tools are quite tightly bound to accessing this data structure, you are really bound to Access. That is really becoming a problem.

First of all, Access itself is a problem. Access has some limitations that limit the amount of data that it can handle to about 2GB. However, that limit includes temporary space needed for the execution of some complex queries. In practice, you will thus run into this limit much earlier in the form of a vague error message. Access has other problems too, but this is the main issue. Access does have two big strengths: first of all, it is easy to handle it's databases. You can just copy over files, and it is already installed in many professional environments. Second, it has a quite good visual query designer, making Access less of a database than it is an analysis environment in the context of using it with SAINT.

The second problem with Access has to do with Qt, the toolkit that is used as the base of SAINT. Qt supplies a set of database drivers, that makes it possible to interface with different databases. It does not have a driver for Access/JET however. SAINT accesses the databases using a driver for the ODBC layer. That has an impact on the speed of the connection, as well as on the supported features. A solution would be of course to write a database driver for Qt myself, but that is not a trivial task.

The third problem is that of cross platform compatibility. Access only runs on Microsoft Windows, but in the scientific world, macs and linux machines are not all that uncommon. It would be very good if SAINT would be able to work on those systems as well. Because Qt itself is cross platform, that is not so hard, were it not that SAINT is bound (too) tightly with Access and thus with Windows.

So, it was time to come up with a solution.
Like I said, Qt supports database drivers for different databases. These work cross platform. However, they don't entirely abstract away the differences between the databases themselves. Different databases support, for instance, different data types. They also use a slightly different query languages. The base is the same (SQL), but the dialects differ. That is easy to understand: the databases are different for a reason, and they support different features. That also means that you sometimes have to talk to them in a slightly different way. Qt helps with this, but not enough.

This issue makes it hard to simply exchange one database for another. That would also create other problems for the users. What to do with existing data sets that are in Access already? An optimal solution would be to build in support for several data back-ends, and let the user choose in an easy way which one to use. This way, the tools can be used for small, simple sets that work perfectly on Access or some other small local database, but also on huge sets that run on a remote server.

To achieve that, I need to extend Qt's abstraction from the database back-end. Currently, it supports basic manipulation and querying of the data in the database, but it does not support changing the database structure. That feature is needed in many tools though. Many parts of SAINT create new tables, or extend existing ones. The solution I have decided on focuses on extending the Qt SQL driver model to support these operations, and then re-factoring the existing SAINT code to use this abstraction and to remove the current explicit SQL code that is used everywhere. All SQL will then be generated by the abstraction layer, making it possible to support different databases as a back-end for SAINT by just replacing the database driver. Of course, there is also some work to be done in the interface, as connecting to a remote database over the Internet needs a different setup than just pointing to a file on your local file system.

I am developing this code in a separate branch of the code for now. It will be for SAINT version 2.0. So far, the progress is promising, but it is still far from production ready. Expect a first beta release of this code somewhere in April.

woensdag 10 februari 2010

Changes in ISI data importer: increased compatibility with other sources

Recently, I was contacted by somebody who tried to use SAINT on a dataset converted from PubMed to the ISI text format. That did not work. It turns out that the ISI data importer relied on the existence of a really ISI specific field to ensure records were unique. That field does not exist in the data the converter tool created, and thus the application regarded all of them as the same: the code was empty for all of them.

A small update fixed that. Now, instead of only relying on the ISI code field, the program will try other options. The first option tried is to see if there is a valid DOI code. That code is unique too, if it is available. If it is, that code will be used as the unique identifier. If there is no DOI code, an artificial code will be generated based on the journal name, the ISSN number, the year, the full name of the first author and the title of the article. That should yield a pretty unique identifier. The code will be prepended with an identifier that tells you where the code came from. These can be "isi", "doi" or "saint".

As a side-effect of this change, I have also added a new field in the articles table output. The DOI field is now included. That also creates an interesting matching opportunity with the cited references.

All very nice perhaps, but... The changes can lead to two problems:
1) If you import data from multiple sources, you may run into double values. This can be the case if an article occurs for instance in both an ISI and another source. In the ISI source, there will be an ISI identifier, so that will be used as the unique code, even if there is a DOI field too. If the same article appears in another data source that does not have the ISI identifier, we'll run into a problem. The DOI field may be used as the identifier, and the articles are no longer identified as the same one.

2) If you use this newest version of the ISI data importer to augment data from an earlier version, old articles will not be recognized as the same anymore, because in the previous version, there was no "isi" prefix to the code. If you plan to do that, you should append the string "isi:" (w/o the quotes) before every code field. You can do that by running the query below:

UPDATE Articles SET Articles.code = "isi:"+[Articles].[code];

maandag 25 januari 2010

ISI importer demonstration

After some good responses on my video demonstrating the Matrix Builder tool, I decided to make a video demonstrating the ISI Data Importer tool. Again, the video is available at Youtube.

The video demonstrates the complete process from selecting your input file to starting the actual parsing, and highlights the new features of the parser that help you identify the next step. I hope it will be useful to you!

Update (January 27, 2010):
OK, I am still learning how to deal with Youtube, and how to optimize the videos properly. I have changed the video for an HD version. If all goes well, that should work now. Watch in full screen at 720p quality for optimal viewing. It may take a little time for Youtube to process the video though.

vrijdag 15 januari 2010

Video demonstration

I did a first attempt at creating a demonstration video for our SAINT tools. The first tool to get it, is the Matrix Builder tool. It's very basic, and runs for less than a minute and a half. I will create more elaborate videos later on, but it may be useful for somebody already.

Update: the video has been replaced with a higher resolution version. This should improve readability quite a lot. It may take a little while until the HD version is actually updated and you'll notice the improved quality.

Wrong file in download link

When uploading all the files, and updating the CMS to point to them, I made a mistake. The installer I file I pointed to from our CMS at the Rathenau Website was actually a very old installer. The installer pointed to from this blog was the correct one.

The mistake has been fixed now, but still: embarrassing. Sorry for any confusion or inconvenience!

woensdag 13 januari 2010

Release 2010.01 beta 1 is out!

Yes, I finally made a new release ready for testing.

This new release fixes several bugs, the one bugging you most probably being the ISI parser issue. Even though there was a work-around, it was an annoying issue. Sorry about that!

Another big thing in this release is that it is the first release to include the community detection. As part of the Network Tools program, you can now employ the algorithm described in V. Blondel, J.-L. Guillaume, R. Lambiotte and E. Lefebvre 2008, Fast unfolding of community hierarchies in large networks. More on that algorithm you can find on their website.

As already described in the last blog posting, there are also some changes under the hood of the Matrix Builder tool. Besides labels and edge values, you can now also output other properties to the Pajek .net output file, allowing you to display more aspects at the same time in your network.

There are many other small tweaks and bugfixes in this version. Expect more additions before the final release. When that will be released? When it is done!

Download the latest beta toolkit (installer for Microsoft Windows) here.