What I learned about #BigData this week

The Symposium on Digital Curation in the Era of Big Data: Career Opportunities and Educational Requirements (07192012) was hosted by the National Academies of Science. In no particular order, here are the highlights and comments that grabbed my attention.

Loved David Weinberger’s comment: “The value of a scientific information commons is to have your nerds arguing with my nerds”.

NSF Office of Cyber Infrastucuture see Data as a Transforming Agent

  • will begin non-governmental awards and working groups across global boundaries soon
  • NSF stresses agile development, rough consensus to push forward quickly, and community involvement
  • What infrastructure is needed to move terabytes and petabytes quickly? How do we build and sustain that network?

IMLS expects more proposals to educate MLIS and archivists in future; stressing the MLIS education funded thru grants to prepare to handling, life cycle mgmt digital content, analyzing data sets

David Weinberger, Harvard University, author of “Too Big to know” and “Everything is Miscellaneous”

  • move to filter data on the way out, not on the way in as a search and retrieval strategy
  • Today we do Collaboration across namespaces
  • Data as cells can be modeled by modeling a domain until you hit a certain level of complexity
  • Integrating multiple complex models increase the error rates
  • “The value of a scientific information commons is to have your nerds arguing with my nerds”

Joshua Greenberg, Sloan Foundation

  • A lack of digital curation capacity at the producer level provides a shaky foundation for big data future
  • What is a data scientist? Universities scrambling to train up staff, but perhaps not in digital curation activities
  • Sloan Foundation, funding data wrangling efforts, new skills in analysis, computational research

Myron Guttmann, NSF

  • Need specialized training and education w/in the scientific community itself; training for methodologists
  • Must integrate digital curation into the scientific research process; libraries, archivists and scientist join hands! Myron Guttmann, NSF: big data announcement from March drives toward more attention to curation, analysis, preservation in context
  • Committed to learning how scientific work is done, do as much training as they can by partnering w/ universities
  • How do we integrate those COIs into the scientific research community where the work is being done? What kinds of communities for data are out there? We need to build COI around data in new and interesting ways
  • NSF policy requirement will try to tease out the findings of data management plans submitted since Jan 2011; one change will likely be in the NSF bio sketch which will require a list of by-products of research (reports, data set, videos)

Michael Stebbins, White House OSTP

  • Must be cautious of burdens on scientists; agencies are already queuing up policies on managing open data
  • Funds will naturally be shifted to solve big data problems
  • Forming private-public partnerships accelerates the research; nucleates activity
  • Administration worked hard to improve public access to technical data, technical publications and raw data sets
  • Data management plans needed for agencies; having those plans being reviewed by peers was great idea
  • At crossroads assessing what concerns about burdens need to be addressed; deep concerns related to sharing data

Margarita Gregg, NOAA

  • The digital era includes digitizing and harmonizing data that exists in tangible format only.
  • NOAA understands they will not be able to preserve everything, all data, in perpetuity.
  • Knowledge skills that they need are intersection between scientists, IT, librarian
  • Requirement for the future is to find interdisciplinary trained, hybrid worker
  • People need to understand and be comfortable with manipulating, understanding, and extracting data and then be able to translate into useful products; be able to discover which tools people really need and how do we provide so they are understandable to the end user.
  • Most pressing personnel needs are in data mining; systems architects; scientific stewardship
  • Workers needs skills in digital rights management and intellectual property management

Anne Kenney, Director, Cornell University Libraries

  • 62% library budget goes to electronic resources, mostly just a few publishers; Big Science major driver for ACRL libraries
  • Digital curation related issues prompted eScience working group at ACRL which noted gaps in capabilities
  • Guide for research libraries published as result of NSF data management mandate; eScience Institute hosted by ACRL
  • 7 roles for librarians and archivists listed in work entitled “New Roles for New Times”
  • in Humanities, focus on digital learning, creation on scholarly products rather than focusing on research products
  • in eScience, focusing more on harmonization of initially captured data; social networking in virtual communities
  • Lots of interest in embedded scientists in the library
  • Reskilling for Research — identifies 9 gaps in training (data mining, metadata creation, etc.) from librarian perspective
  • may be a role for teaching libraries as there are teaching hospitals

Vicki Ferrini, Associate Research Scientist, Lamont-Doherty Earth Observatory, Columbia University

  • Works as a data scientist, marine scientist trained in geoinformatics; liaise with the scientists to translate and apply data models, develop data discovery tools, build data compliance tools for NSF requirements; build education materials
  • Scientific data continuum changed (only making data available as part of a published report); now Columbia using data archivists for feedback between data producer and data consumer
  • Intersect data producers, data consumers, and data providers to find the data scientist; need domain knowledge, need acquisition skills, understanding of the data requires grounding in the science

Elizabeth Liddy, Dean Syracuse University School of Information

  • Data scientists competencies: Data analytics, data structure, data mining, ability to run information extraction on unstructured data, statistical analysis, recognize risk and noise and data quality, data and information visualization, understanding of infoviz tools, being able to design
  • Data archiving and preservation, how to select and provide access and storage, data stewardship, migration of data
  • Task force assembled at the iConference worked on these competencies

Nancy McGovern, Curation and Preservation Services, MIT Library

  • Amazon web conference in May: challenges include scale, complexity, speed
  • Start with data, then end with BIG data
  • Create a SWOT analysis, take your library skills in and compare to desired skills to define gaps
  • She helped with UNC Chapel Hill outcome matrix with categories for skill sets
  • Look for findings from project DIPR on dissemination packages research

Leave a Reply