An “Unstructured Data” Story: My Research Journey from Prague to Palo Alto
Technology and innovation isn’t maybe the first thing that comes to mind when you think of the Czech Republic. After all, there’s the history, beautiful capital, world-famous beer. But the Czech Republic is the birthplace of the polarograph, where contact lenses were invented, the home of hundreds of tech start-ups.
Coming from Prague, I was always influenced by this spirit of technological innovation. Today, I’m a research assistant at PARC, far away from my hometown, focused on my main interest which is unstructured data — extracting domain knowledge from digital documents and storing them in suitable form so intelligent systems can exploit them. While I’m still only at the beginning of my career, getting here has already been quite an interesting journey, and one I wanted to share with others.
Find out what you love
After getting my undergraduate degree in Software Engineering and Management, I started an internship at IBM. I was young and eager and the idea of helping sell IT solutions seemed like a dream. While the internship was a great learning experience, I soon realized I wanted to dive deeper into the technical side and really explore what was possible in the world of IT. Therefore, I made the decision to start my master’s in Computer Science with a specialization in Cybersecurity at Czech Technical University (CTU) in Prague, hoping to open up my horizons. During my time at CTU, I worked as a release engineer, maintaining a system infrastructure so that developers could more efficiently work on the product and deliver it to customers. Yet again, I found myself not totally satisfied, and I forced myself to ask, “What excites you in IT? What will be creative and challenging for you? What is it that you really love?”
The answer came surprisingly easily. Data! Data is the water of the digital world. It is all around us. We have to learn and continually relearn how to swim in it. It also didn’t hurt that Harvard Business Review named Data Scientist the sexiest job of the 21st century. So, I made the decision to change my specialization to Knowledge Engineering and I started working on my master’s thesis under Jan Sedivy, an ex-IBMer and ex-Googler with experience in speech recognition and natural language processing. Together, we focused on extracting knowledge from unstructured data.
What I love about unstructured data
Extracting knowledge from unstructured data has huge potential for many companies, across a variety of industries. The use of correctly extracted data could be interesting in a lot of different contexts, for example:
- Leveraging information from medical records to better analyze patients, and using machine learning and statistics for predicting appropriate treatments
- Conversational assistants, like chatbots used by companies to communicate with their customers, being able to understand user demands and answer them most effectively according to the company’s business model
- Automated decision-making processes in “Industry 4.0”
To understand unstructured data, it helps to first start with structured data. Structured data is information that is stored according to a data model. This model defines the exact relations between the stored values. Consider a pancake recipe, as an example. The structured data might look like this:
The data in this table is easy to understand. You can easily look at any row in the “Step” column and find its “Instruction” value. Unstructured data, in contrast, might look more like this:
1: Mix ingredients to make batter
2: Pour batter into heated pan
3: Flip pancake when bubbles appear
4: Remove when both sides are golden brown
As you can see, this unstructured data is not structured with predefined values and doesn’t reside in a normal database. However, by looking at it, we as humans can easily infer the meaning – we can look at the numbers, understand the order, interpret the text to be instructions. But for a computer, these tasks need to be performed. To begin, it might, for example, have to extract the first character of each line and recognize that character as a digit. And the data of course can be even worse. Consider a pancake recipe like this:
How to make pancakes:
First, mix ingredients to make batter.
Then pour batter into heated pan.
Next, flip pancake when bubbles appear.
Lastly, remove when both sides are golden brown.
How would a computer understand that? The data is already quickly getting more difficult, and this is why natural language processing is so challenging. But the potential is immense. Many estimate that 80-90% of business information in any organization exists in unstructured form. Unstructured data can be everything from emails to photos to presentations to documents. In other words, there is a lot of unstructured data out there to extract, interpret and utilize.
Continuing my journey at PARC
After graduation, Jan Sedivy suggested I contact PARC regarding an internship following his tour as a finalist in the Amazon Alexa Prize contest, where the challenge was to have a meaningful 30-minute conversation with an Alexa device. Jan and his team were invited by PARC Research Scientist Filip Dvorak to visit PARC, through a connection by CzechInvest, a business development agency that helps promote the Czech Republic abroad. Jan thought PARC would be perfect for me. I knew of course about PARC’s legacy and contributions to the history of personal computing, their famous inventions like Ethernet and GUI, but I also knew they are a very different company today, and I was excited to learn about what they’re working on now.
I have been a PARC Research Assistant for the last three months in the Interaction and Analytics Laboratory, under PARC Research Manager Kyle Dent. Our work broadly covers application of artificial intelligence in conversational assistants, natural language processing and big data. So far, what’s been exciting for me is that PARC realizes how valuable knowledge extraction from data is for tackling new markets, especially as enterprises are going through digital transformations that output so much information in processable format. Simply put, we want to apply artificial intelligence to all this unstructured data out there, to try to understand it a bit more, to see if we can apply it to a new domain and create new business opportunities.
What’s next for me? I want to see how far my research goes, whether here or the Czech Republic or somewhere new. Maybe I’ll be able to help create a new product for Xerox, or one of PARC’s many other clients, or help spinoff a new start-up. Just as the possibilities are endless with unstructured data, I hope the same can be true for my research.
Learn more about PARC’s career and internship opportunities.
Our work is centered around a series of Focus Areas that we believe are the future of science and technology.
We’re continually developing new technologies, many of which are available for Commercialization.
PARC scientists and staffers are active members and contributors to the science and technology communities.