big data back to focus areas
Deriving new opportunities from big data
The world's data is growing at an astounding rate, expected to double every two years. However, data collected from a wide variety of sources such as blogs, emails, videos, social media sites, photos, GPS, and other types of sensors are often unused. What makes their analysis difficult are the three "V"s: their volume, the velocity with which they arrive, and their variety. In fact many of these new types of data are unstructured and do not fit into the schemas businesses and existing analytics solutions are used to handling.
Beneath such unstructured, seemingly non-relational data lie hidden treasures of new insights and opportunities. To capture them, we need to collect, organize, and analyze data, which requires highly sophisticated processing, modeling, and analytics capabilities.
PARC's key enablers
Scientists and software engineers at PARC are developing technologies for various levels of the big data stack.
It all starts with data -- lots of it. PARC regularly gets access to a variety of very interesting data sets, including those from retail purchase, click-stream, emails, hospital admissions, health care claims, system logs, fare and toll collection logs, in-game behavior (e.g. World of Warcraft), credit and debit card transactions, Twitter, Foursquare, and Wikipedia. Equipped with HIPAA-compliant compute facilities, Hadoop clusters, and a private cloud infrastructure, PARC provides the required data security and processing power to analyze sensitive data sets.
To a significant extent, the current availability of big data solutions must be attributed to the development of Hadoop. However, while Hadoop is the right tool for so-called "embarrassingly parallel" tasks, like many text processing tasks, it is rather inappropriate for various other kinds of problems. Graph analysis is one of them. Applications where the data takes the form of graphs and the analysis needs to take this graph structure into account are widespread, such as for social network analysis and PageRank. To accommodate these needs, and to enable real-time analysis of graph data, we are developing a high-performance, in-memory graph analysis engine that exploits parallelized search and compact graph representations that are amendable for traversal queries. We can then compute relevant graph properties and queries several orders-of-magnitude faster than existing solutions. For instance, we can compute the single-source shortest path for a set of 40 million Twitter users with 1.5 billion connections in less than 3 seconds, using only 6GB of RAM.
As more and more aspects of the data center are being virtualized, it is becoming harder and harder to diagnose problems and to optimally configure and control its operation. We extend Hadoop and other cloud computing platforms by applying model-based diagnosis, machine learning, and artificial intelligence planning and scheduling to the problems of 1) optimized scheduling of jobs onto available compute resources, and 2) diagnosing hard- and soft-faults automatically. Applying these techniques improves the throughput of Hadoop and reduces the manual overhead involved with diagnosing faulty nodes.
Machine Learning is at the core of most analytics solutions, and several, very good open-source toolkits are available today that make it possible for more and more people to get started with common data mining problems such as classification and clustering quickly. However, as with many things, the devil is in the detail. In order to get good results that can be used to make optimal business decisions, we need deep understanding and a lot of tacit knowledge of machine learning algorithms and features.
At PARC we combine strong statistical methods from machine learning with domain-specific modeling to achieve better accuracy and precision than methods that only use either. Our expertise in symbolic reasoning combined with state-of-the-art machine learning gives us an edge over approaches that rely solely on either of the two. In our applications we use this technology not to replace entirely, but to augment human expertise. In one customer project, for example, we made fraud auditors more productive by filtering out the vast majority of false positive cases generated by existing solutions, so they could focus instead on the most likely positives cases.
Sentiment, Topic, and Demographics Analysis
PARC‘s Empath platform not only understands the sentiments displayed by customers in various social media platforms, but also breaks down such understanding into finer categories including topics and demographics, all automatically. Today’s sentiment and topic analyses answer questions such as "Why are people unhappy?" "What are the major complaints?" and "Has the sentiment changed over time?" Empath goes beyond this and also uncovers the demographics of social media authors, such as their approximate location, gender, age, and level of education. Empath does this automatically, just based on the data extracted from the content.
PARC's team uses natural language and machine learning in innovative solutions for search and discovery. Our expertise in natural language contributed to the success of companies such as PowerSet, ScanSoft (Nuance), Microlytics, and Inxight. Powerset, a consumer search engine based on natural language processing technologies, was subsequently acquired by Microsoft. [learn more about the case study on Powerset]
PARC draws on deep expertise in social and behavioral sciences to provide solutions based on context-aware services. Our systems anticipate a user's situation, proactively serve their information needs, and personalize recommendations - creating many interesting applications for big data analytics. [learn more about PARC's contextual intelligence]
+1 650 812 4054
related case studies view all
in the news view all
Society's Next Big Challenge: Infinite Data
5 April 2013 | VentureBeat
Nebula Builds a Cloud Computer for the Masses
2 April 2013 | Bloomberg Businessweek
Nebula launches its OpenStack “system”
2 April 2013 | GigaOM
events view all
ThroughputScheduler: Learning to Schedule on Heterogeneous Hadoop Clusters
27 June 2013 | San Jose, CA
Understanding Email Writers: Personality Prediction from Email Messages
10 June 2013 | Rome, Italy
Bayesian Network Model for Predicting Insider Threats
24 May 2013 | San Francisco, CA