Tuesday, January 8, 2008

web 3.0 - Semantic Web

(an article request came from emo after our "Web 3.0 - Semantic Web" seminar in bilgi university, this is the english translation of the article we'd made with bekir, you can see original article here.)


With spreading blogs, created social networks and growing personal media; the data on net reached tremendeous sizes. To prevent this data to become garbage and to prevent monopolies on internet search bussiness, it's mandatory to create a semantic web where there is a context between data, concepts and people. "Web-3.0" and "semantic web" concepts are usually called together. Even though there is no official authority to version web, "web-1.0", "web-2.0" ve "web-3.0" names are pretty popular among marketting people. These version numbers symbolizes eras on net whose milestones are big economic changes on internet. At "web-1.0" era, we got acquainted with commerce on internet. At "web-2.0" era, e-communities grew with rich and distrubuted online content which gave direction to economy on internet. The value of the companies began to be measured with number of active users. By time, customization became important on net. Determined identities and user habits allowed to user specific appearances, but customizable appearance is not enough, we also need user specific service (functionality). Search engines are competing with each other to give different results according to person who is searching. Is internet ready to that change? We're still using special keywords for our website to be found by search engines. The content for people and machines are seperated. For real user specific search results, it's mandatory to associate online content with each other and with people. It's not possible to ask simple questions which are easily understandable by a human to search engines with lack of this association. Web-3.0 is a new era where some major changes on internet waits for us: Internet will be based on content associated with each other; it will be possible to make up simple sentences from built context; internet itself will become a huge database; we'll be able to ask simple questions to machines, and machines will be able to talk with each other with an human-understandable language to find answers to our questions, briefly where machines learns human-talking; service and server centered approaches will leave it's place to user-centered distributed structures. And semantic web is the infrastructure designed for this era which will determine the rules.

First of all, we need words and dictionaries where we can look for the meanings of words in order to associate online contents and build a context. Today our websites have non-human visitors too. There are RSS readers that makes our publishings easily followable, planets that merge related blogs, search engines that makes our contents easily foundable and etc. We have to keep some metadata to take attention of search engines which are probably the oldest non-human visitors of our sites. Metadata is data about data. To determine these metadatas, we consider routine behaviours of human visitors who are searching something on net. By semantic web, it's aimed to merge data and metadata. We're looking for new methods to make online content able to explain itself. In this case, keywords won't try to explain whole website, every content will have it's own explanation, it's aimed not to leave a single content without a description. Where do we get these definitions? We can find definitions from dictionary like specifications. Applications which are written using same specifications can talk same language (FOAF (Friend of A Friend) is a good example to these specifications[8]).

After forming dictionaries, we can turn content-tag associations to simple sentences based on simple grammer rules. We can consider our content associations as verbals of our sentences. Let's assume that we have a website where we publish people's personal informations. We're looking for "personal information" word in our dictionary. If we find a definition, it means our sentence is ready. Subject is content itself, object is our site and verb is "is a personal information". If we merge them, final sentence is "this content is a personal information of this site". And definition of "personal information" is so: "collection of knowledge about a human in which name-surname is mandatory, and nickname, web address, e-mail address, mail address, geo (latitude, longitude), photo, title, notes and identity number are optional". And let's assume we have a name-surname "john doe" published on our site, we can make a second sentence: "john doe is a name-surname of personal information".  After tagging our content with "personal information" and "name-surname" words, it's possible to use these tags to do any visual and textual design tricks to make our site easy to understand by human. And with the context we settled between "site", "personal information" and "name-surname" words, it's possible to make new sentences that's understandable by both humen and machines (ex: "john doe is a name-surname in this site" which is not written on site but easy to reason by simple logic rules).

With all these relations and sentences, it becomes possible to run queries on our site like a database. In figure 1, wee see a SPARQL (SPARQL Protocol and RDF Query Language) which runs on a site where there is an "include" relation between continents and countries, and "being a capital" relation between countries and cities. The names of countries which Africa includes, and the names of cities of these countries is queried. When we say query, SQL queries are flashed firstly on our minds because of widespread use of databases at the present time, this shows the cause of syntax smilarity between SPARQL and SQL syntaxes. The important point here is making an infrastructure that lets this kind of abstraction.

Figure 1:
SPARQL sample
\begin{figure}\begin{center} \footnotesize \line(1,0){230} \begin{verbatim}PRE... ...nContinent abc:africa. }\end{verbatim} \line(1,0){230}\end{center}\end{figure}

End point of query language abstraction is human-talking. It's aimed to make it possible for machines to answer a question like "What are the countries does Africa contain and what are the capitals of these countries?" instead of the SPARQL query in figure-1.

We need an infrastructure that lets machines to ask questions to each other too, not just human-machine talking. Think the scenerio where the query on figure-1 is run on two sites. One site holds the "include" relation between continents and countries, and other holds "being a capital" relation between countries and cities. First site is capable of answering question "What are the countries does Africa contain?", and it will need to ask question "what is the capital of this country?" to the other site for every country that Africa contains, and it will return the accurate answer for the user who asks combined question, with the information it collected from other site.

Communication between machines is not a new issue, but semantic web will let human-understandable comminication. For above example, one site asks to other "what is the capital of  Y country?" for every country in Africa. Human who listens the network will understand what the two machines are talking about, without reading any spec of any protocol, without taking any education to read that specs; because the protocol used between two machines will be a simple protocol based on simple grammer rules of daily human-talking.

Semantic web occurances needs user-centered, distributed structures instead of today's service-centered structures. There shouldn't be any restriction to force us to hold our contact list and e-mail account in same service of same company. It's aimed to make a contact list that we created in one service to be easily accessible from any email account service we owned. For this, the structure in which every user has an seperate user account for every service has to leave it's place to a structure in which there is one unique identity for every user, in which every service asks for that identity to users like a passport. Users and publishers will be able to decide which language to talk by determining their dictionaries. Distributed, endless and secure sharing environment is a requirement for semantic web.

Figure 2:
XHTML to RDF translation by XSLT
\begin{figure}\begin{center} \footnotesize \line(1,0){230} \begin{verbatim}<ht... ...-01-0 </rdf:Description>\end{verbatim} \line(1,0){230}\end{center}\end{figure}

"W3c"[1] working groups and the standards gives direction to semantic web researches. Studies that aims to turn internet into a database continues under two main focus. One way is to use RDF (Resource Definition Language)[3] and OWL (Web Ontology Language)[4] like languages to make semantic sentences and developing applications using that sentences. There are many tools developed according to W3C standards at present to create RDF and OWL, and many tools to run SPARQL queries on these RDF and OWLs, there are also many tools on development process. The number of publications about semantic web and the sizes of developer groups working on semantic web are pretty satisfying. The other way for semantic web studies is GRDDL (Gleaning Resource Descriptions from Dialects of Languages)[5] way in which it's aimed to change existing content a little to make it possible to translate it directly into an semantic format (see Figure 2). In this way tag attributes are used to create RDF or any other format from directly existing content. A well-known method based on this tag logic is "microformat"[7] whose co-founder is a Turk named Tantek Çelik who is an affective "W3C" member. With firefox extension "Tails Export"[9], you can see if the page you're browsing contains any content that is compatible with any microformat, and see what these contents are if exists any. You can download and try an another firefox extension "piggy bank"[11] that's developed under "smile"[10] project to investigate existing semantic web sites and portals. There are many existing applications of semantic web, for well-known dictionaries;

  • SKOS Core[12]

  • Dublin Core[13]

  • FOAF[8]

  • DOAP[14]

  • SIOC[15]

  • vCard in RDF[16]

well-known projects;

  • Pfizer[17]

  • NASA's SWEET[18]

  • Eli Lilly[19]

  • MITRE Corp.[20]

  • Elsevier[21]

  • EU Projeleri (ex: Sculpteur[22], Artiste[23])

  • UN FAO’s MeteoBroker

  • DartGrid[24]

  • Smile[10]

well-known portals;

  • Vodafone's Live Mobile Portal

  • Sun’s White Paper Collections[26] and System Handbook collections[27]

  • Nokia’s S60 support portal[25]

  • Harper’s Online magazine linking items via an internal ontology[28]

  • Oracle’s virtual press room[29]

  • Opera’s community site[30]

  • Yahoo! Food[31]

  • FAO's Food[32]

  • Nutrition and Agriculture Journal portal

and so on.

Semantic web researches took a long way on their direction. In figure-3, semantic web concepts are explained with layers. Till last 3 layers, there had been concrete steps with RDF, OWL, SPARQL and GRDDL. For "logic", "proof" and "trust" layers, there is no concrete output, but "W3C" working groups continues their searches. By "logic" layer, it's aimed to make new sentences via reasoning methods with simple logic rules by reading existing sentences. "Proof" layer will determine how to form evidences for establishing truth of our logic. The top layer "trust" will contain solutions to securely authenticate user and to protect people's privacy rights and confidential information.

Figure 3:
Semantic Web Layers
\begin{figure}\begin{center} \epsffile{swlevels.eps}\end{center}\end{figure}

Semantic web is not a dream, it's a need. Today's millions of internet users will be billions very soon. We need to give a meaning to online data mass which is exponentially growing via fastly growing internet population and spreading sharing culture. Semantic web suggests a solution to publishers to be easily foundable, and to searchers to fastly access right information. Context between sites on settled relations turns every site to servants with a weak artificial intelligence who is capable of talking with other servants and humen. For now we can only follow internet from narrow windows provided by today's search engines. In near future, our searches will start a fire that follows the paths built by semantic relations like a dedective, and it will find the accurate answer to our question. To be able to search on your local network, there will be no need to a search engine that have to visit all the sites on your local network and index them previously, all the search will be real-time. We will be able to determine our own limits ourselves, we can narrow or enlarge our living environment on net whenever we want. The detail of search we want to do, will not depend on the power of search engines any more, our choices will determine our power.

[1] http://www.w3.org/
[2] http://www.w3.org/2001/sw/
[3] http://www.w3.org/RDF/
[4] http://www.w3.org/2004/OWL/
[5] http://www.w3.org/2001/sw/grddl-wg/
[6] http://www.w3.org/TR/rdf-sparql-query/
[7] http://microformats.org/
[8] http://www.foaf-project.org/
[9] http://addons.mozilla.org/firefox/2240
[10] http://smile.mit.edu/
[11] http://simile.mit.edu/piggy-bank/
[12] http://www.w3.org/TR/swbp-skos-core-guide/
[13] http://www.dublincore.org/
[14] http://usefulinc.com/doap/
[15] http://sioc-project.org/
[16] http://www.w3.org/2006/vcard/ns
[17] http://www.pfizer.com
[18] http://sweet.jpl.nasa.gov/ontology/
[19] http://www.lilly.com/
[20] http://www.mitre.org/
[21] http://aduna.biz/dope/
[22] http://www.sculpteurweb.org/
[23] http://users.ecs.soton.ac.uk/km/projs/artiste/
[24] http://ccnt.zju.edu.cn/projects/dartgrid/intro.html
[25] http://www.forum.nokia.com/
[26] http://www.sun.com/servers/wp.jsp
[27] http://sunsolve.sun.com/handbook_pub/validateUser.do?target=index
[28] http://www.harpers.org/
[29] http://pressroom.oracle.com/
[30] http://my.opera.com/community/
[31] http://food.yahoo.com/
[32] http://www.fao.org/
[33] http://www.w3.org/2001/12/semweb-fin/w3csw
[34] http://www.w3.org/2007/Talks/0831-Singapore-IH/
[35] http://www.w3.org/2007/Talks/0424-Stavanger-IH/
[36] http://www.w3.org/TR/grddl-primer/
[37] http://en.wikipedia.org/wiki/Semantic_Web
[38] http://en.wikipedia.org/wiki/Ontology_%28computer_science%29
[39] http://en.wikipedia.org/wiki/Web_Ontology_Language
[40] http://en.wikipedia.org/wiki/Resource_Description_Framework
[41] http://en.wikipedia.org/wiki/GRDDL
[42] http://sramanamitra.com/2007/02/14/web-30-4c-p-vs
[43] http://www.ozgan.net/?sm=content.ybz&id=63
[44] http://xmlns.com/foaf/spec/