SAINT toolkit: juni 2009

dinsdag 30 juni 2009

Development priorities

We have set some development priorities after the first release that we did recently. These are partly reflected in the Ticket tracker, but to be honest that doesn't quite do it justice yet.

A small but important update is to change the output format for the Matrix Compiler tool. Currently, it outputs to a full matrix in DL format. This is a bit inflexible, as it does not allow attributes other than a label for the nodes, and it outputs a full matrix even for a large, sparce matrix. That results in bigger output files, and thus longer processing/io time. In the (hopefull near) future, it will allow for the addition of including attributes to nodes as well as connections. You can track the progress on this issue here. We're already testing it...

A larger task has to do with reworking the Record Grouper and the Relation Calculator. The first part of that job is to specialize the Record Grouper to do only that: group objects based on some kind of relation between them. This means ripping out a large piece of complicated code (but not throwing that away, see later) and focus on a good UI to make it easier to work with. This means fine tuning the layout, but also add options to undo, store and re-load your work, etc. That is a lot of work in itself, but it will simplify the code considderably making it easier to maintain.

Another large task is to get the Relation Calculator into a usable shape. This is a complex tool. The basic idea is that will become a specilized tool to calculate a similarity or distance measure between any two objects in the database, and be as flexible as possible as to how to calculate that. Currently, we only use SQL queries to calculate such scores, but that is sometimes limited, often complex, and usually relatively slow because most SQL backends don't use multi-threading for single queries, let alone utilize things like letting the videocard do work for you.

You can express a lot of such measures in SQL, but it is often complex to do, especially if you are not that used to using databases. That makes the current way of working harder to use for novices, but also for seasoned researchers who just are not that into SQL. It is however more flexible than being stuck to defaults too much. In this tool, we want to make the standard things easy, and yet be as flexible as possible to enable more advanced use.

The goal is to make the standard analysis that you would normally run on a database generated by the ISI Data Importer and processed by the Word Splitter and optionally the Record Grouper as easy as selecting them and optionally set some time slices or threshold, after which ready-to-analyze output is generated. We aim to release a first working combination of these tools at the end of August.

maandag 29 juni 2009

New name: SAINT

We did it: we came up with a nice acronym as a name for the toolkit. Since Science Research Tools is a bit generic, and SciSA toolkit too much linked to our department name, we came up with a new name: SAINT. That is an acronym for Science Assessment (or Analysis, at your choice) Integrated Network Toolkit. This name will be used everywhere to refer to the toolkit from now on, though it will take a bit of time for it to be used everywhere consistently. Note that our URL's will not change, so there is no need to update your bookmarks.

SAINT will save you (time).
That's just one of the many possible catchphrases of course... We'll come up with some new ones in due time to advertise the toolkit to the outside world. If you have any suggestions, please let us know!

vrijdag 26 juni 2009

Mini-demo during coffee breaks

The official demonstration session on the e-Social science conference today is scheduled for the last 20 minutes of the lunch break. Because of the nature of the lunches (a three course affair that seems to run over the allotted time every time so far), I have decided to cancel this demonstration. Experience by others so far are that no-one shows up for these sessions.

For those interested, I will be giving mini demonstrations with my laptop only in the coffee breaks in the morning and/or afternoon. Simply find me, and ask! I am exited to fold open my laptop for you to show you our work.

Introducing: Matrix compiler

In this third instalment of the 'Introducing...' series, I will be talking about the Matrix Compiler tool. While you can already see a lot of interesting things by just looking at the tables and views in the database that contains your data, visualization can be a huge benefit to recognize patterns. That means that you will need to somehow transform your data out of the database into a format that you can use for visualization.

The Matrix Compiler is a tool that can do this. It can use the database and translate it's data into a format that you can read into Pajek, a well known visualization and network analysis package.

Analysing and visualizing networks means that you will be dealing with two kinds of things. First, there are the objects that are connected, which we^* will call the nodes. Next, there are the connections between those nodes themselves. At least the latter should be available as a table or query/view in your database. For briefness, I will call them all a view from here. Depending on how you build up your database, it is possible that your connections view will contain complete labels (or other attributes), but it very possibly may only contain an ID of the node in the database, and the label for the node is defined in some other view. The Matrix Compiler supports both modes of operation: the labels for the nodes can be retrieved from either the connections view, or be looked up from an external view.

After opening the database and indicating the name of the output file, the matrix can be set up. When constructing the matrix, you start with selecting the connection view, by placing the cursor in the corresponding box and either typing the name or selecting it from the view by double clicking on it. You then select which field in that view represents, respectively, the value for the relationship, the row and the column. A connection view thus needs to have at least three fields: two to represent the nodes you are connecting, and one to indicate the strength of that connection.

The next step is to define the structure of your matrix. There are many options for that. If the types of objects in the rows and columns are the same, you may want to create a square matrix where all the nodes appear as both a row and a column. This makes sense if you are, for instance, constructing a matrix to represent co-authorships, but not if you want to display in which journals authors publish.

If you chose to create a square matrix, you may also choose if you want the matrix to be symmetric or not. In a symmetric matrix, the value of M(a,b) would be identical to M(b,a). Again, in case of co-authorships the meaning of author a sharing a co-authorship with author b is the same as saying that author b is sharing a co-authorship with author a. But for a citation relationship, a citation from a to b is different from one from b to a.
In both cases, you can also choose if you want to set the diagonal, that is M(a,a) to a set value, or if you want to use values occuring in your data. This can make sense to filter out things like self-citations from your data.
If your data contains data for the same relation more than once, you can choose how to deal with that. The options are to use the first occurrence, use the last, add the occurring values or multiply all the occurring values.

The last step of creating your matrix, is to choose where the labels for the nodes should come from. As explained above, there are two basic options. If the correlation view already contains the labels, just select the appropriate option and you are done. If that view contains references to nodes though, you can now select where to get the actual labels. To continue our example on the co-authorships, it would make sense to select the Authors table as the table to find the labels, and use the full author name as the label for your nodes in the network.

Note that if you use an external source of labels, you can choose how to deal with nodes in your label view that do not appear in the correlation view. For instance, authors in your Authors table may not have any co-authorships. That means that they will not show up in your correlation view. You may or may not want to include these unconnected nodes. The choise is yours.

Note that outputting very big matrix files, can take some time, as the output size is O(n²). We are planning to change the output format shortly, from a matrix form to a list form. That will result in smaller output files for big matrices, and will also allow the inclusion of attributes other than a label to both the nodes and the connections.

* Pajek itself uses a different terminology. It instead talks about Vertices for the nodes, and Arcs and Edges for the connections between these nodes.

woensdag 24 juni 2009

Introducing: Word Splitter

In the second installment of the "Introducing" series, I will tell you something about our small utility called Word Splitter. The idea of the word splitter is simple. You point it to a field in your database that contains text, like a title or an abstract. The word splitter then builds up a table with all the words that occur in the database in that field, and a table with pointers between the record identifier that the text came from and the record identifier in the words table, so you can find which words belong to which text-record. To make it possible to re-construct the text based on the sentence, the position the word was in is also stored in that pointer table.

Our tool can do more than that though. First of all, it can process multiple text fields from your database at the same time, making it more efficient to work with for you. So, you can split both that title as well as that abstract simultaniously.

Furthermore, the Word Splitter can use stopwords. Stopwords are, in their basic form, lists of words that are ignored when splitting the text, for instance because they are too common. That means that if you use stopwords, not all words from the text will be stored in the Words table, nor will pointers occur in the couple table. However, for different purposes, you may want to think of the procedure in two ways. One option is to first split the complete text, then remove the stopwords, and then store the words and their positions after removing the stopwords to the database. This will result in consecutive word positions in the couple table, even if there used to be one or more stopwords between two words in the original text. Alternatively, you can split the complete text, note each word's position, remove the stopwords and only then store them into the database. This will result in word positions that reflect the original position of the word in the text, but leaves them non-consecutive.
To provide maximum flexibility in the analysis, both these positions are stored in the couple table.

The Word Splitter can use several stop word lists at once, and furthermore knows three kinds of lists. First, it can use simple text files that contain lists of words. Second, it can use a field in an existing database that also contains a list of words. And last, it can use lists of regular expressions. These expressions are patterns that each word is matched against, and if it matches, it is regarded as a stop word. That allows you to, for instance, filter out numbers or dates without having to write them all out.

To make it easy to use these stop word lists, you can create sets of these lists, and store such a set as a file again. This way, you can easily review the stop words you used for your analysis, and you can re-use the same set later on. You can also use one such a set as your default set of stop words.

This was a basic introduction to our Word Splitter tool. I hope you will like using it!

Installer for windows online

With the official launch of the toolkit only a night away, I have just uploaded an installer for the Windows platform to the file storage we have on Assembla. Of course, our website has been updated to reflect that. Other files you can find there include documentation, but also testcases to reproduce bugs.

Tomorrow at 11 AM, I will do the first demonstration (note, the time has changed from the earlier announcement) on the e-Social Science conference in Cologne. I will give a little bit of background, and then quickly move on to actually showing the attendants the tools on some real life data. Of course, I will also show some pretty pictures that we made using the tools, courtesy of Edwin (thanks!)

I hope everything will go all right. There will be another demonstration on friday, so plenty of opportunities to see the tools in action!

vrijdag 19 juni 2009

Open source repository and issue tracker online

Yesterday, we reached an important milestone in the project. We have put a public repository online that contains the complete source code for all the tools. That's right: you can download the sources, tinker with them, and use them however you want.

We selected Assembla as our hosting for this project. It supports the Git distributed source code repository system, and nicely integrates that with an issue tracker. So far, it seems to be pretty flexible and works nicely. While the institute is developing her new website, we have put up a temporary website on the toolkit as well. The address will not change once the new site is up, so bookmark away!

If you are familiar with C++, or want to learn that: try your hand to help develop these tools. It's really not all that difficult. Of course, just reporting issues, suggesting documentation updates or giving ideas for improvements and extensions are also very valuable contributions!

Introducing: the ISI Data Importer

This is the first of what is to become a series of postings to introduce all the tools in the toolkit. I hope it will give a clear overview of what kind of tools we offer, and what they do.

The ISI data importer is aimed at importing bibliographic data that you downloaded from ISI/Web of Knowledge. You can download data on the resulting articles from your searches in a text format. The ISI Data Importer tool can read these files and output them to a structured database format. The usage of structured databases is one of the basic ideas of the Scrience Research Toolkit. Using structured, standard databases to house the data allows us to use standard tools. Databases have been in development for decades, and are quite efficient for many tasks that suit the kind of work we do with the data. Also, getting the data in a form that is as structured as possible, gives us maximum flexibility.

The interface of the ISI Data Importer is quite simple:

On the first tab, you select the input file or files. You can select as many files as you want, as long as they are located in a single directory. As Web of Knowledge only allows you to download a maximum of 500 records in one go, you can end up with lots of separate files that all contain a fraction of your data. Simply select them all, and they will all be imported in a single run. Double records will automatically be filtered out, so if you have created several sets that can overlap in their results, you will end up with a single, unified set without double data points that can ruin your similarity measures later on.

On the output tab, you can select an output file. Currently the only supported database backend are Microsoft Access files, but we are working on extending that to include other and better database backends. Access can be a bit limiting and slow, especially if you work with large datasets. The filename you select does not need to exist yet. It will simply be created for you if it doesn't.

Optionally, you can filter the data on the document types. Some of the more frequently occurring document types are included in the list on the Filter page. If you are missing an option, let me know, and I'll add it. Better yet: simply patch the list yourself, the sources are available!

woensdag 17 juni 2009

Demonstration on e-Social Science conference

As announced in the introductory posting, we will be launching our toolkit on the e-Social Science conference in Colone, Germany. We will be doing that in a 20 minute demonstration session, where will we demonstrate how you can easily use a set of data downloaded from ISI/Web of Knowledge to create some maps of a research field, using a couple of database queries and our tools.

update, june 18:
~~As soon as I know the exact time, date and location of this session, I will post it here.~~
There will be two demonstration sessions:

11:00 – 11:30: Thursday 25 June
~~16:00 – 16:30: Thursday 25 June~~
13:40 – 14:00: Friday 26 June.

All demo's have been allocated to take place in the main foyer of Maternushaus.

If you happen to be at this conference, don't hesitate to join in for this demonstration!

vrijdag 12 juni 2009

First post

Every blog needs an introductory posting, and this one is no different. What is it about? What can we expect to hear? Why even bother blogging? Those are the kinds of questions both you and I would like to see answered. "You and I", I hear you wonder? Yes, because it is not completely clear to me either what exactly I will and will not write about yet. So let's start with giving some idea about what I am doing, and why that is interesting to keep a weblog about.

The Science System Assessment department of the Rathenau Instituut is dealing, among other things, with trying to apply bibliometrics and patentometrics to map the dynamics of science and knowledge transfer. The problem that our department quickly ran into was that the available tools that can deal with this kind of data are relatively far between, require many manual, error prone and labour-intensive steps and don't fit together well. Worse, we soon ran into limitations with the amount of data we could handle in them that started to affect our research.

So, we decided to build some tools ourselves. Seeing that the tools that were (and are) available are not open, we had to start from scratch. That presented both a challenge and an opportunity, because in this way we could also rethink basic issues of how these tools should work. We decided to go for a design where all tools work against standard relational databases in which we structure the available data as well as possible. We also wanted the tools to be easy to use, so a clear graphical interface was a must. Since I have experience developing software using the excellent C++ based Qt toolkit, I chose to use that as the environment to build these tools in. As an added bonus, cross platform compatibility as well as database back end independence come practically for free.

As the first tools began to be available in early versions, more and more ideas about what else we could do and needed begun to pop up, and soon the idea to build some tools led to a complete toolkit that is still growing. Now the time has come to make these tools available to you too. The toolkit will officially be launched at the 5th conference of the National Centre for e-Social Science in Cologne. The first tree tools will be released in their "1.0" or "ready to use" versions, while the rest of them are made available "as is". Because we would have liked to contribute to the exisiting tools but could not, we have decided to avoid the same issue with our initiative.

We would love to hear from you, and even better, to work with to to improve these tools! We expressely invite you to use them, test them, and improve on them. To make that possible, we are making all sourcecode available under a liberal open source licence. We will also make a public issuetracker available, as well as a forum and other collaboration tools.

And that brings us to the why of this blog: we feel that it is important to keep you up to date with what is happening, what we are planning, and what others are doing with these tools. This blog is one of the ways in which you can do that. We are also working on a nice website, and a temporary website will be up soon. If you have other ideas about how to communicate, want to aggregate your own blog, or have any other comments: don't hessitate to contact me. I'd love to hear from you!

SAINT toolkit