vrijdag 26 juni 2009

Introducing: Matrix compiler

In this third instalment of the 'Introducing...' series, I will be talking about the Matrix Compiler tool. While you can already see a lot of interesting things by just looking at the tables and views in the database that contains your data, visualization can be a huge benefit to recognize patterns. That means that you will need to somehow transform your data out of the database into a format that you can use for visualization.

The Matrix Compiler is a tool that can do this. It can use the database and translate it's data into a format that you can read into Pajek, a well known visualization and network analysis package.

Analysing and visualizing networks means that you will be dealing with two kinds of things. First, there are the objects that are connected, which we* will call the nodes. Next, there are the connections between those nodes themselves. At least the latter should be available as a table or query/view in your database. For briefness, I will call them all a view from here. Depending on how you build up your database, it is possible that your connections view will contain complete labels (or other attributes), but it very possibly may only contain an ID of the node in the database, and the label for the node is defined in some other view. The Matrix Compiler supports both modes of operation: the labels for the nodes can be retrieved from either the connections view, or be looked up from an external view.

After opening the database and indicating the name of the output file, the matrix can be set up. When constructing the matrix, you start with selecting the connection view, by placing the cursor in the corresponding box and either typing the name or selecting it from the view by double clicking on it. You then select which field in that view represents, respectively, the value for the relationship, the row and the column. A connection view thus needs to have at least three fields: two to represent the nodes you are connecting, and one to indicate the strength of that connection.

The next step is to define the structure of your matrix. There are many options for that. If the types of objects in the rows and columns are the same, you may want to create a square matrix where all the nodes appear as both a row and a column. This makes sense if you are, for instance, constructing a matrix to represent co-authorships, but not if you want to display in which journals authors publish.

If you chose to create a square matrix, you may also choose if you want the matrix to be symmetric or not. In a symmetric matrix, the value of M(a,b) would be identical to M(b,a). Again, in case of co-authorships the meaning of author a sharing a co-authorship with author b is the same as saying that author b is sharing a co-authorship with author a. But for a citation relationship, a citation from a to b is different from one from b to a.
In both cases, you can also choose if you want to set the diagonal, that is M(a,a) to a set value, or if you want to use values occuring in your data. This can make sense to filter out things like self-citations from your data.
If your data contains data for the same relation more than once, you can choose how to deal with that. The options are to use the first occurrence, use the last, add the occurring values or multiply all the occurring values.

The last step of creating your matrix, is to choose where the labels for the nodes should come from. As explained above, there are two basic options. If the correlation view already contains the labels, just select the appropriate option and you are done. If that view contains references to nodes though, you can now select where to get the actual labels. To continue our example on the co-authorships, it would make sense to select the Authors table as the table to find the labels, and use the full author name as the label for your nodes in the network.

Note that if you use an external source of labels, you can choose how to deal with nodes in your label view that do not appear in the correlation view. For instance, authors in your Authors table may not have any co-authorships. That means that they will not show up in your correlation view. You may or may not want to include these unconnected nodes. The choise is yours.

Note that outputting very big matrix files, can take some time, as the output size is O(n2). We are planning to change the output format shortly, from a matrix form to a list form. That will result in smaller output files for big matrices, and will also allow the inclusion of attributes other than a label to both the nodes and the connections.



* Pajek itself uses a different terminology. It instead talks about Vertices for the nodes, and Arcs and Edges for the connections between these nodes.

Geen opmerkingen:

Een reactie posten