On May 27, Wikidata Lab XXIX (by Wiki Movement Brazil User Group) was held around the topic of data roundtripping with GLAM institutions and Wikidata. Richard Knipel and Andrew Lih spoke about experiences of the Metropolitan Museum of Art in implementing its own data roundtripping practices and was joined by Susanna Ånäs and Éder Porto as discussants.
They covered work previously discussed at Wikimania, WikidataCon and museum conferences such as MuseWeb 21 and Museum Computer Network. A good summary of data roundtripping by Sandra Fauconnier can also be found in a 2019 piece on the Diff blog.
The Wikidata Lab discussion highlighted some new observations for 2021:
The Met has not only contributed images and metadata under its Open Access initiative, it has also brought Wikidata content into its own TMS collections management system, directly storing Q numbers in its database to precisely link its metadata to Wikidata. The fields from the Met API that now link to external databases such as Wikidata include:
Object ID - QID
Constituent/artist - QID and Getty ULAN
Depiction/keywords - QID and Getty AAT/ULAN
This unambiguous link between the Met's metadata and Wikidata provides the foundation for a number of exciting future projects with the Wikimedia community. The presentation concluded with an example of how the data flows have been a virtuous cycle, especially with AI and image recognition projects related to depiction information.
Wikidata Knowledge Graphs without SPARQL
The recent Wikimedia Hackathon resulted in a number of useful upgrades to the PAWS system on Wikimedia Cloud. Read more about them in this month's Special Report.
The Wikidata Graph Browser project described below is an example of a rapid prototyping project performed on the PAWS system during the Wikimedia Hackathon by User:Fuzheado.
Wikidata Graph Browser is a project to provide a friendly graphical user interface for Wikidata knowledge graph creation and browsing. It utilizes the Jupyter Notebook widget system so users can craft Wikidata queries interactively, without needing to learn any SPARQL or Python code. It is available on Github for anyone to try out on the Wikimedia Cloud PAWS service, or by using MyBinder as an execution environment. (Click to launch on MyBinder)
One of the most compelling uses of Wikidata is the creation of knowledge graphs – two-dimensional, graphical visualizations showing linkages and connections among Wikidata items. For GLAM institutions, they can provide insights into collections data by displaying relationships among clusters of information.
The Wikidata Query engine already provides a "Graph" option that performs well across a wide variety of web browsers (and even mobile), bringing Wikidata's connections to life. Early work in advancing knowledge graphs in the GLAM domain was done by Martin Poulter, Wikimedian In Residence at Oxford University. His 2019 blog post "Making Wikidata Visible" showed a technique of explicitly listing all the nodes/items in Wikidata that were of interest.
However, there are a number of challenges in in creating useful and compelling knowledge graphs:
The operator needs to know at least basic SPARQL language code in order to return meaningful visualizations, even if they are copying an example.
The operator needs to know advanced SPARQL code if they want to do anything beyond a basic visualizations.
Even with knowledge of SPARQL, many labor intensive cycles of trial-and-error are required to find compelling insights. There is currently no software systems to support this.
In 2020, Andrew Lih (User:Fuzheado) worked on a tool to help illustrate collections data as the Wikimedia strategist at the Metropolitan Museum of Art. He developed Knowledge Grapher as a front-end user interface to create knowledge graphs with lists of Wikidata items or article names. The tool also has special modes to browse artists or film makers specifically.
The Wikimedia Hackathon 2021, provided an opportunity to implement a solution as an interactive Jupyter Notebook in the Wikimedia PAWS system. This would help create a more general approach for creating graphs, allowing users to add any number of "facets" to the knowledge graph – artists, GLAM institutions, artwork type, depiction information or genre, among others.
Such an interface could then generate SPARQL code for the Wikidata Query engine, while hiding these complexities from the user. Instead, the operator could focus on exploration and discovery within the data set.
Wikidata Graph Browser (run with MyBinder or visit the Github repository) was designed to provide a fast, point-and-click interface for browsing Wikidata knowledge graphs, utilizing the Jupyter Notebook graphical "ipywidgets" system in Python.
The interface starts up by loading three "facets" which can filter or cluster items for Wikidata Query to render as a graph. By default, any clicking on the interface prepares the query, and will only render a graph when the user selects "Live view" to be "On."
Users can add any number of facets by clicking on the "Add" button or delete facets with "Remove." The "Reset" button will zero out all facets and remove the knowledge graph.
In the current prototype, the facets consist of Wikidata properties oriented towards browsing artworks, seeded with content derived from The Metropolitan Museum of art's collection, but relevant to many art museums because of the wide ranging and encyclopedic nature of The Met's holdings:
instance of (P31)
The list of choices for collection/institution contain more than 200 popular world-class institutions and collections. The other facets have lists of Wikidata's most used items, including more than 4,000 artists and 1,200 depicted themes. All of these are loaded from CSV spreadsheet files for speed, as live Wikidata queries would be prohibitively slow.
After "Live view" is turned on, any click on any list item will immediately bring up the graph in the space below. The response time is immediate, allowing a user to browse graphs within seconds, depending on the complexity of the search.
The interface currently has no safeguards. Therefore, the user has to manage how reasonable their requests are. Trying to show a graph of all paintings, across all institutions that depict a "woman" will likely bring back a flood of items and cause an extreme slowdown in the web browser. As a result, care must be taken to ask for rational or reasonable visualizations. Fortunately, the user can immediately click on other interface elements (or the "Reset" button) to abandon a query that is too lengthy or taxing, causing no long term harm to the browsing session.
Filtering and clustering
The user can also choose to use the facet to filter the query, cluster like-items together around an item for that facet, or both filter and cluster at the same time. Users are encouraged to experiment with settings that achieve the desired result.
A good first test query would be to select three facets:
collection/institution (P195) - Met Museum - Filter
instance of (P31) - photograph - Filter
depiction (P180) - tree - Filter and Cluster
The user can try adding more institutions (or clicking on "Any" to show them all) or changing the "instance of" to paintings or prints. Multiple selections in a list are allowed by Control-clicking (Windows) or Command-clicking (Mac).
Developing this prototype in a Jupyter Notebook environment provides unmatched speed in going from concept to implementation. (The slight downside is that running the tool requires a rather heavyweight container to run Python on the backend, requiring a relatively long startup time when loading.)
In terms of future features and developments, please do feel free to contact the developer, or file an "Issue" at the Github repository. Some ideas already being discussed include:
There is currently no way to save the state of the interface, or even capture the existing query as it was sent to Wikidata Query. It is fairly simple to be able to save the SPARQL code for later use, but saving the interface state in some way so others can replicate it, or re-load it, would be ideal.
The facets are currently ordered for a particular reason, but it would be nice to be able to re-sort them for different needs. For example, they are almost all sorted by "popularity" or "frequency" as indicated by a number next to the facet. The artists are sorted by the number of wiki links that exist for their Wikidata entry, and depiction items are sorted by frequency. A button to sort them alphabetically would be useful here.
One can imagine uses for the tool beyond artworks and museums. Facets don't need to be artists or depiction information, and could instead be biographical info, geodata or publication info. A whole set of parameters could be defined to browse film, literature, sports, fashion or any number of domains. This could quite easily be handled by defining a much longer list of facets. But a more scalable way may be to allow anyone to customize the interface (see next item).
With the current system, configuration parameters are loaded from files - a primary YAML configuration file provides general parameters and one CSV file is defined for each facet. This was done for simplicity and speed. However, one path forward is to load the configuration not from a local file but from a URL, perhaps from a page on-wiki. Other tools in the Wikimedia sphere have used this method of storing a YAML configuration file on wiki, and reading the configuration from there (such as Huggle).
In addition, the possible values for facets could be customized or generated more dynamically. Instead of being loaded from a CSV file, facets could also be loaded from a wiki page, possibly as a table, or perhaps from Commons in the Data namespace. If this configuration was read from a wiki page, one could ostensibly use a tool like Listeria to generate the table from a SPARQL query, allowing people to collaborate on configuring the tool.
On May 26, the Wikimedia Foundation and the Asian American Journalists Association (AAJA) co-hosted a virtual event on the topic of increasing representation of Asian American and Pacific Islander (AAPI) communities in the news, Wikipedia, and across the information landscape. Panelists included:
Andrew Lih, Wikimedian at Large, Smithsonian Institution (@fuzheado)
Lori Matsukawa, former KING TV news anchor, NBC Seattle
Jareen Imam, founding member Women Do News, Director of Social News Gathering, NBC News (@JareenAI)
Anusha Alikhan (moderator), Senior Director Communications, Wikimedia Foundation (@AnushaA100)
More information can be found at the following links, including the Youtube video of the event: