Creating a Conceptual Search Engine and Multimodal Corpus
For Humanities Research
The majority of scholarship on how people use the internet has tended to ignore the medium of the internet as a context shaping communication and to focus on it as a tool for accessing study subjects. Social science methodologies identify user demographics; focus groups are asked for their perceptions; the "digital divide" has been analyzed as have consumer buying patterns. Pedagogical research focuses on what does or does not work in incorporating the internet into education. Internet users are treated as objects of study in ways that reflect traditional disciplinary assumptions. The growth of Web 2.0 which has seen an explosion of user-generated content published not solely for marketing or educational purposes shows just how great the gaps in existing scholarship are. Additionally, scholars are daunted by the sheer amount of information available on the internet and the problems of gathering or quantifying it. This practical problem is crucial: How can we aggregate the huge variety of texts created by groups of users on the internet? What tools can be developed to find and mark relevant texts to generate specific data?
The original proposal for this paper was tied to an NEH Digital Humanities grant involving linguistics and computer science. However, a research problem resulted in changes in technology and in personnel. The changes in technology are moving from a conceptual search engine to a customizable web-crawler. The changes in personnel involved the creation of a more interdisciplinary and larger team consisting of humanities, social and behavioral sciences, and computer scientists. The single unchanged factor is the goal of the multi-modal corpus.
The original collaborative project computer scientists working in data retrieval and humanities scholars working in linguistics and stylistics involved sharing expertise in interdisciplinary methodologies of analyzing data. The proposal was to develop an initial prototype of a conceptual search engine to use to develop a multimodal corpus of texts from online communities. The project plan was to illustrate how methods of data retrieval in computer science can be applied to the specific concerns of humanities scholars. But the computer science faculty member who was working to create a conceptual search engine was unable to continue work on the project.
As a result, while the new team continues to work on elements of the original linguistics project, our scope is now widened to make use of more interdisciplinary mix of scholars to develop grants for projects that will be submitted to both the NEH and the National Science Foundation. One reason we are able to widen the scope of our grant writing activities is that "linguistics" can be defined either as a humanities field or as a science! The NSF grant will have a primary focus not on the creation of a prototype web-crawling program which we have created this year, but on the creation of a collaborative and interdisciplinary research team to develop new theories and methods of scholarship and teaching.
The multimodal corpus from the original project will be created using the new spider-bot program. Corpus linguistics and corpus stylistics utilize and prioritize language analysis based on databases of collected samples of language created in natural language situations. These databases, or corpora, can be based on transcriptions of spoken language, or scans of printed or published works. However, corpora have not typically included language directly culled from postings by internet users until recently, and the growing scholarship on computational linguistics and corpus linguistics published in the last few years tends to a limited focus, either in terms of the data used (solely from Usenet, or Twitter, for example), or in terms of the questions asked/element analyzed (grammar checking).
For example, Sebastian Hoffman in "Processing Internet-derived Text--Creating a Corpus of Usenet Messages" worked with posts taken from 12 Usenet sites; Jonas Sjöbergh's article on "The Internet as a Normative Corpus: Grammar Checking with a Search Engine," used a single search engine to check grammar in internet texts in Swedish. Other linguists have constructed corpora from specific sites (such as drug information websites). Corpus linguistics and computational linguistics are beginning to conceptualize the internet as a pre-existing immense corpus in its own right, one which is comprised of textual, graphic and audio files existing on multiple social networks and blogs, a range of difficulties in accessing and analyzing data exist. Ongoing development of the new search tools, whether interdisciplinary teams are creating them in digital humanities work or whether the teams are using existing freeware or commercial programs, is an area that requires research and collection of information on what tools are already available, and how accessible the corpora are (i.e. free use, fee-charging, etc.). Part of our new interdisciplinary team's work will involve gathering information about and testing the new programs and tools as we develop and test our own.
With internal grant support from Texas A&M University's Office of Sponsored research, I have been working with linguists, a psychologist, a psychologist graduate student, and two computer science graduate students on our spider-bot project. The collaboration involved the creation of a prototype software program to gather, categorize and store information from Internet based message boards and communities. Unlike existing open source crawlers or the commercial products (not accessible to general users), the spider-bot we have developed is developing is customizable. It can be adapted to search a variety of different internet sites (via specific URLs), and, even more importantly, can be customized to perform secondary analysis in much more detail than usual when the data is saved in themed databases that will become part of the multimodal corpus. While our prototype works only with text data at this stage, we plan to incorporate ways of collecting graphic and audio material in the future.
After a semester's work, we're currently at the point where the spider-bot program exists and has been tested by the team on a limited basis. Further testing will take place over the next semester by individual members, as we develop a Data Management plan and work with our university and library specialists to arrange for storage and access. The spider-bot program is written in .NET and is able to access Internet based information using any of the popular web browsers such as Internet Explorer and Firefox. In addition, the bot is cable of being run from the Windows or Mac platforms. The Linux OS may also be capable of running the bot program; however, at the moment, the program is developed only for Windows and Mac. Currently, the primary function of the bot is to gather pre-defined public information from Internet-based bulletin boards. The information gathered by the bot will be public information; that is to say, anyone who goes to the sites where the bot gathered its information will be able to see the same information the bot recorded. Any information that requires a username and password is out of the scope of this project.
Some of the information targeted for gathering at this stage is: username, date, time, word count, words over six letters in length, words six letters or shorter in length, icons, unique words (e.g. LOL, OMG). One of the team members in psychology will use James Pennebaker's Linguistic Inquiry and Word Count (LIWC) text analysis program. This validated program provides empirical results based on word categories that are considered psychologically meaningful. The LIWC can be customized, but only with specific words entered into a dictionary (rather that with parsing elements such as grammatical categories) which means that the linguists on the team are planning work using some of the linguistic parsing programs that exist. Part of the process which has already begun in the weekly meetings we held during Fall 2011 is the ways in which interdisciplinary conversations involve questioning of methodological assumptions.
The information will to be stored on a SQL database administered by the Texas A&M University-Commerce library system. Access to the stored information will be allowed using role based access control (RBAC). Administrators and principle investigators will have full control to the information in the data repository. Others who want to access to all information in the data repository will be granted access by submitting a request to the library system administrator. Users who want general access require no special permission. A web portal will be used to specify the information from the database. Users will go to the web site select from available categories from a drop down list and submit the request. The output will be displayed in HTML, PDF, .txt or .csv format and will be available to save or print.
My work in this larger project is primarily that of fan studies expert who has done a number of pilot presentations drawing on sociolinguistic methodologies applied to fairly small (in internet terms) amounts of text. The multimodal corpus and our work will not be limited to fandom sites: however, since fans have been early adopters of technology and have been actively creative internet communities since the earliest iterations of the world wide web, their production are, as Henry Jenkins has argued in Convergence Culture, useful to study as early models of what has become more mainstream and commonly seen on Web 2.0, especially with the explosion of social networking sites and special interest communities. My work has focused on analyzing discourses of minority and majority fans in online communities, specifically fandom communities. As an active fan in online media fandom since 2003, as well as having been active in fandom during the 1970s, I have been immersed in some fan cultures. The debates that began to be widely held on the social networking LiveJournal site during 2006 and 2007 took place in a variety of media fandom communities: StarGate: Atlantis, Dr. Who, and Life On Mars, Harry Potter, and even multi-communities community. Debates over racial and class stereotypes in fan fiction, as well as racial and class stereotypes in the canon texts of the fandom, including racist terminology being used by fans that embodied histories and etymology not widely known outside the United States, and, finally, ignorance of a Jewish religious practices were hotly debated. Additional levels of conflict occurred because of the international demographic of online fandom, with debates over the history and contemporary racial attitudes in the primarily English speaking fandoms of the United States, United Kingdom, Canada, and Australia. Other disagreements concerning anti-racist strategies and rhetorics, including the issue of what "tone" can or should be taken when noting the existence of racist language, imagery, or characterizations, reflect different activist theories and practices. The issue of intersectionality (questioning the single focus on "race" or "gender" while ignoring class, sexuality, or ability status) has also affected the nature of the debates in online (and, increasingly, offline) fandom cultures.
I argue that how the internet shapes communication, and, specifically, how different internet sites shape communication differently, played a major part in how the debates spread. In later years, especially 2009, the differences between LiveJournal (and its clones, such as Insane Journal, and the fork, Dreamwidth) and blogs were highlighted by participants. In the LiveJournal debates, a single event (a story, a post, an announcement) could initiate a rapidly moving stream of posts which rapidly moved outside individual fandom communities because of cross-fandom communities dedicated to posting news and linking to posts across fandoms and because of fans' participation in multiple online fandoms. While many white fans criticized such debates as something new and unusual in their fandoms, often insisting that the conflicts were quote "harshing their squee," there was and is widespread agreement by some fans of color and allies that the events were simply the latest in an historical and consistent on-going pattern of white privilege in science fiction culture. That pattern included a range of racist behaviors that institutionalized marginalization and discrimination against fans of color; the problems, which had always been there, were only now becoming visible to the dominant majority (white fans) on the internet in ways that differed from what could be "seen" in the offline, con cultures, or, alternately in the earlier periods of online fan culture that were predominantly book oriented and existed on centrally controlled listservs and archives.
LiveJournal (which was begun in 1999) changed fandom structure from centralized listservs and archives to a web of individual journals and multiple communities. The ease of setting up LiveJournal communities, as opposed to maintaining the earlier listservs and archives, allowed for distinct differences between having a few listservs for a fandom. Fans who felt marginalized or ignored in the central or major listservs moderated by a fan or a group were able to leave and start others, or multiple others. With less centralization, discussions move rapidly across a number of individual journals, branching off rapidly. While there are lists of the debates (often called "Linkspams" made by fandom newsletters or individual fans), there is no guarantee that all posts relevant to the discussion were linked, or that people entering the discussions after some time has passed will be aware of the earlier discussions.
While discussions about race and racism in fandom and in the book and media texts is not new, or limited to LiveJournal (or currently, the newer social networks such as Tumblr and Twitter), I would argue that the discussions between can be relatively more easily accessed and viewed (compared to earlier print 'zines and listservs which may have been locked and are not easily accessed). Since my scholarship on literary texts already involved critical race theories as well as gender and sexuality topics, I became interested at how the freely available online texts by internet users might contribute to the scholarship in these areas. My original focus was debate known as Racefail 09, a wide-ranging series of posts in LiveJournal and on blogs that took place during the first three months of 2009. The major Racefail linkspam (by rydra_wong) lists 1000 plus posts (many of which have hundreds of comments) by science fiction fans and professional authors and editors. The sheer amount of data in Racefail taught me early on that it was impossible to work with traditional methodologies of rhetoric or discourse analysis. I began to explore the possibilities of a corpus which allows inclusion of a wider range of discussions that can grow in future to include a range of internet communities as well as those relating to online fandom. A corpus linguistics methodology analyzes patterns in large collections of text rather than individual intent, resulting in a pattern analysis of aggregated data. As the development of computational linguistics shows, not only are digital tools an absolute necessity when working with the huge and messy corpus that is "the internet," but so is an interdisciplinary approach that can only be gained by working with academics trained in multiple disciplines.