Thoughts of techie

Saturday, July 4, 2020

Digital Clone technology for devices - Thoughts from 2018

Image recognition using popular algorithms such as Imagenet have bettered the human analysis. Word Vectors composed from text corpus have increased Natural Language understanding capability for software systems. End to End Deep Learning has improved Speech Recognition and has achieved parity with humans. We can apply theses advancements, to redefine customer interactions with businesses and customers on devices.

Users carry mobile devices all the time. The phone captures all the information about the user’s favorite applications, search queries, and the places he visits. Despite the rich user information, the current generation of software and hardware is still not able to recommend content to the users, the user wants to see. The user still needs to go to a search engine or a portal to get the information and/or content. The user still needs to learn how to interact with each application, as there is no universal interface that can help the user. Switching context to Augmented Reality (AR) applications, the user needs to painfully drag objects into the real world to see how the virtual object looks in the real world. The current generation of AR applications have limited support for Natural Language Interaction. The user has to painstakingly move virtual things using hands.

In this invention disclosure, we will describe how smarter hardware and software leveraging Natural Language, Image, and User Behavior analysis can help users and businesses with their experiences. We will discuss the personalized behavior model of the user to simulate his thinking process. The personalized behavior model takes in aggregated user behavior and application context captured by pixels on the screen that the user is looking for and helps with generating actions and content recommendations.

It is to be noted that this is unlike, a virtual agent from Apple, Google, and Amazon, which are triggered by hot keywords from the user’s speech and don’t use the visual, previous context about the user to generate content recommendations.

Behavior processor:

The current generation of devices knows everything about the user. They can capture where the user was, what the user is seeing currently, has seen and read in the past, whom did he talk to, what messages he sent to the friends, etc. Deep Learning and ReInforcement learning techniques have improved Image Understanding, Text extraction, Natural Language Understanding capabilities.

Despite the rich information and technology progress, we have seen that is available to the user, the current generation of devices are still not able to predict the content that the user likes. The user still has to go to a search engine and painfully type the search query on a small keyboard to find the information he wants. The user still has to go to the browser and type www.yahoo.com, to read up the news articles. The user can’t interact with the applications on the devices, even though the machines now can read, understand what the user is reading and answer questions in Natural Language.

In this invention, we will describe a hardware and software component called behavior analyzer which can be embedded int devices. The behavior analyzer running on the device can use the application context by modeling what the user is looking at and using aggregated information about the user to generate content that the user will like and execute actions on behalf of the users.

In an embodiment, to figure out what the user is reading and viewing on phone, the behavior processor is going to have a Location Analyzer, Vision Analyzer, Text Analyzer, Application Context Analyzer, a Memory component, Controller component and a Model Manager in the behavior processor. The behavior processor will form a hypothesis on the activity of the user, continuously learn from user interactions with content on the mobile phone and try to generate appropriate content or response in a timely manner, using above components

In an embodiment, we can use a combination of multiple architectures as will be discussed in the disclosure below to generate content and action recommendations based on the context.

Behavior processor to recommend content:

Let us say, a user is taking pictures of his family in a new year. If a user is generally active on Social Network and posts his pictures on the social network after taking pictures then there is a high probability that he will share the new year pictures on Facebook. In the current experiences, the user has to go to a social network, choose pictures taken from the camera, and then post his pictures on Facebook.

Switching context to another experience, would it not be easy for the user to show search results before the user decides to go to a search engine such as Google.com and type a search query in the search engine.

Experiences like above can be improved substantially using behavior processor. The behavior processor can run as an offline process, whenever the user starts an application or as a process that executes every ‘x’ minutes or so. The number ‘x’ can be a configurable parameter. The application context analyzer component, can take the pixels on the device that the user is looking, process it against a text recognition component, to extract text embedding. The pixels can be fed to object detection DNN to get image embeddings in the application. In an embodiment of this, we can train a general model for the behavior processor based on the user cluster associated with the device. In embodiments the users can be clustered using a K-Means clustering algorithm.

The generalized model for the user cluster can be trained using a Neural Network on anonymous training data on the users in the cluster. We will use techniques borrowed from the provisional document 62543400 with title Techniques to improve Content Presentation Experiences for businesses to build a general model. In an embodiment, the generalized model for predicting a sequence of actions for the user can be done by training a Recurrent Neural Net or Sequence to Sequence Algorithm with attention on the user activity.

The generalized model can be built by using a Deep Neural Network by feeding the training data from Location Analyzer, Vision Analyzer, Application Context Analyzer, and Memory component and application actions, content follow-ups within the user cluster. The DNN will be learned to predict application actions such as entering search engine queries, sharing with social networks, sending an SMS message to a friend, calling a merchant, etc and content generation such as pro-actively showing interesting news items and an update from the social network.

The trained general model for the user cluster can then be pushed to the device. In an embodiment, the model manager component will initialize the general model either during the device setup or as part of the booting process.

The general model, can then be further retrained and personalized for the user. In an embodiment, this can be done by using Reinforcement Learning methods. We can model content and action recommendation as an MDP process. The aggregated user behavior updates from social networks, news articles can be the state for the user. The possible action space can be to show content recommendation, display an application action, or don’t do anything. A reward function can be correctly predicting the action at time t. We can then use Policy Leaning or Value Iteration approaches to figure out an action. To start with, a general Reinforcement Learning model can be learned offline on the user cluster, using the generalized model. The general model can then be personalized to the user by adjusting the Reinforcement Learning model to maximize explicit user interaction. The personalized user behavior model can then be persisted on a remote server using the internet. The personalized model can be used on other user devices and internet ecosystems.

In another embodiment, an End to End Neural Network using an architecture consisting of Policy Gradient Deep Reinforcement Learning on top of a Deep Neural Network (DNN). The DNN with attention can generate user behavior embeddings on the offline user cluster behavior data. The generic model than can be personalized for the user by adjusting the loss function in the Policy Gradient Deep Reinforcement Learning to predict the user actions.

In yet another embodiment, we can train a general model to do imitation learning for user clusters on the behavior sequence data. We can then apply techniques from One-Shot Learning to fine-tune user behavior.

It is to be noted that, we are proposing a different architecture for personalization and simulating user behavior from the current generation of ML models. Most of the current systems are built on the premise of a single global model for all groups of users. Personalization is done in a single global model by adding user features as an additional input to the mode. A single model for all users substantially simplifies the validation and debugging. The architecture in this disclosure builds out a single model for a user. A single model for a user gives the model more freedom to choose parameters that are applicable for that specific user. The single model can also be made complex to mimic the complex behavior and action of that user. We will remove the additional complexity of optimizing the burden on the model to optimize on all groups of users.

Behavior processor as a virtual agent for an application:

Patent Application US 15/356,512, talks about a Virtual Agent for an application/website which can talk in Natural Language based on using external API integration. The behavior processor can also act as a virtual agent which can interact in Natural Language/Natural Speech for an application, without the application manually adding an external API service.

The behavior processor has got the application context of what the user is seeing, what the user is looking at the application, who the user is, the buttons and text in the application. The behavior process will also have access to external intelligence added by manual rules and/or derived by crawling the application. The behavior processor can also use information about the user aggregated from multiple ecosystems.

In an embodiment, the behavior processor can use the information identified in the above paragraph to answer questions about the service in the application and do actions in the application.

In an embodiment, the behavior processor can use Imitation Learning and one-shot learning approaches to execute actions in the application on behalf of the user. The behavior processor can learn from other user interactions that happen on the cloud

Behavior processor to help with Augmented Reality application:

Companies such as Flipkart, Amazon, and Walmart sell furniture, dresses, shoes and other merchandising in their Mobile eCommerce Apps. Before purchasing merchandise, the user wants to see how the furniture fits in their living room. He also wants to check, how the dress fits on him, before purchasing.

The eCommerce companies use experiences from augmented reality to increase user engagement with merchandising in their Mobile Applications. For instance, a user can choose a TV stand from the furniture category, point the camera on their Mobile Phone at their living room, move the chosen TV stand to get a physical sense of how the TV looks in the living room.

This painful experience of moving virtual object such as furniture in the Mobile App to the physical world can be improved by adding a software Virtual Agent which can interact in Natural Language to the flow. This virtual agent can be embedded within the app or can be triggered through a general voice agent on the phone such as SIRI on iPhone or Google Assistant on Google. The virtual agent can be embedded in the app using a third-party library code or be part of the application. The behavior processor described above can also act as a virtual agent for the eCommerce application.

The virtual agent can take the voice input, convert the voice to text optionally, and figure out the intent of the user. The entities associated with the user utterance can be figured out using slot filling algorithms. The bitmap of physical visual images captured by the physical camera, a textual description of the image, and the image of the object in shopping can be provided as additional context to the virtual agent. The virtual agent can use this additional context in figuring out intents and entities.

In an embodiment, the virtual agent can use Neural Module Networks to understand the Virtual Image in the application, the title, and category of the image, the Physical Context, and the Natural Language utterance. In an implementation, the Neural Module Networks can be dynamically assembled by parsing the Natural Language utterance. In another embodiment, we can train an end to end model using Reinforcement learning.

After understanding the intent using Neural Modules, we need to complete the action. An action can move the virtual object on the site to the physical environment of the user. Another example of action, can take a fish on Google Images and put it into a physical aquarium to see how the virtual fish looks in an aquarium at home.

Action sequences for the intent such as moving an object from one location to another can be configured manually for a Natural Language Intent. A Deep Neural Network can also be used to train actions from training data consisting of actions, Natural Language utterances, and scene input. In an embodiment, we can use the Deep Reinforcement Learning approach on top of Neural Modules for Natural Language Understanding, Object Detection, and Scene Understanding to execute actions.

In another embodiment, we can use Imitation Learning techniques to execute action sequences. We can use techniques borrowed from search engine rewrite to gather training data for imitation learning. For instance, let us say a user says, I want to see how an “I want to see how Guppy fish looks in my aquarium” pointing the Augmented Reality Device to his aquarium.

Let us say the behavior processor does not recognize the utterance in the context of the visual scene and says “Sorry, I can’t help you”. The user will then go to an Image Search Engine such as Google.com and search for Guppy Fish and then move Guppy Fish to the aquarium.

The behavior processor can learn from this interaction for the user cluster and apply it for future iterations down the line. This can be done by applying one-shot learning techniques on the general model, that we trained for AR applications.

Unified Model for different application scenarios:

In this disclosure, we talked about how a Behavior Processor can use application and user context to simplify user interactions.

We proposed different use cases for Behavior Processor. We also note that we proposed different DNN architectures for the Behavior processor for different use cases. We can use a Unified software component by combining different use cases. In an embodiment, we can run a simple Deep Learning classifier on the application and user context to decide which model to run. In another embodiment, we can train an end to end Neural Network on all the use cases and build a unified model to help the user in different application contexts.

Summary:

In this disclosure, we propose a behavior processor on the user’s devices. The behavior processor simulates user behavior by leveraging application and user context and helps the user with different use cases using Natural Language and Vision techniques.

Monday, February 24, 2020

Generic virtual assistant platform

How can you add capabilities to Google' Dialog Flow, Amazon Lex, and Microsoft Bot Framework, so that every website can have a conversational agent with few clicks? You can crawl the website offline, gather HTML tags, use a knowledge graph and build intents that can be used in Natural Language conversations.

I wrote this patent back in 2015, when Conversational Systems were just catching up, anticipating a big product gap that can be addressed using technology. I am happy to share that my general Conversational Assistant platform (https://lnkd.in/gC4yuAT) got approved by the Indian Patent Office (Indian patent office, in general, is conservative in approvals compared to USPTO). Please reach out to info at voicy dot ai, if any Corp dev/Legal folks at Google, Microsoft, GoDaddy and Amazon would be interested in licensing or acquiring the patent.

How can you build a constantly learning Virtual Assistant using Graph and Search techniques

Have you ever run into a problem where-in your chatbot/virtual agent at some point of time, is not able to handle a conversation sequence with the user and has to back out to humans for help. The human than analyzes the context and answers questions from the user.

Can you use the human in the loop to constantly improve the capabilities of the virtual agent?

Let us say you are developing a virtual assistant to handle customer service calls on Telphone for Hotel Chain. Your virtual assistant had to back out and take the help of a human to resolve the customer issue.

The virtual assistant can listen to the recording of the conversation between the customer service representative and the customer, converts the conversation to text using speech to text recognition techniques and analyzes the conversation for future use.

The stored conversations/dialog are used to improve the intelligence of the software system on a continuous basis by storing the conversations in a graph data structure on an inverted index for efficient future retrieval.

A dialog can be defined as the smallest element of the customer and business interaction. The system can build a bipartite graph with a hierarchy of dialogues. A dialog itself can be represented by two nodes and an edge between them. The dialogs are connected and branched off as new combinations rise for the business interactions across different communication platforms. The graph can be built on an inverted index data structure to support efficient text search.

Elaborating further, to start with the opening sentence from customer service representative such as “Hello {Customer Name}! This is {Company}. How can I help you” will be represented as the root node of the graph. We note that the data in the node will have placeholders for the customer name, the business name. The placeholders in the conversation for building the graph are identified by looking for fuzzy string matches from the input dictionary consisting of inputs such as the business name, the customer name, the items served by the business, etc. The node is annotated with information about who the speaker (customer or customer service representative) was. The node will also have features such as semantic mappings of the sentence, vector computed using sentence2vec algorithm by training a convolutional neural network on the domain that the software agent is trained for.

A different semantic response from the customer is created as a child node for the question from the customer representative. The semantic equivalence to the existing nodes on the graph can be done using learn to rank algorithms such as Lambda Mart borrowed from the search techniques after doing a first pass inexpensive ranking on the inverted index of the graph of conversation. In an implementation, the result with the highest score with Learning to Rank algorithm exceeding a certain threshold is used as a representative for the customer input. The semantic equivalence comparison and scoring is done after tokenizing, stemming, normalizing and parametrizing (recognizing placeholders) input query. Slot filling algorithms are used to parametrize the customer responses. The slot filling algorithms can use HMM/CRF models to identify part of speech tags associated with the keywords and statistical methods to identify the relationships between the words. If there is a match to an existing dialog from the customer, then the software system will store the dialog context and not create a new node. In there is not a match, than a new node is added to the node of the last conversation.

Some tasks are simple question and answers such as “User: What is your speciality? Customer Service Representative: Our specialty is Spicy Chicken Pad Kee Mow ”. These tasks can be indexed on the graph as orphan parent-child relationships in the graph.

One of the challenges we run into when we are building the graph to constantly learn is the change in context. If there is no change in the context, we create the node as a child of the previous node. If there is a change in the context, we need to start a new node different from the previous state in the graph. To figure out a change in the context when the customer talks to the customer service representative, we can use a Bayesian or SVM Machine Learning classifier. The classifier can be trained on crowdsourced training data using features such as the number of tokens common to current and the previous task, the matching score percentage between what the customer has said and the maximum score match of an existing dialog. To improve the accuracy of the classifier, we can train a different classifier for each domain.

It is to be noted that the graph can be constructed manually by an interaction designer, which can then be inserted in an inverted index. In yet another implementation, a Recurrent Neural Network can be trained on the interaction between the customer and the customer service representative, if we have a lot of training data. To implement personalization to models in a recurrent neural network, user profiles can be clustered into several macro groups. We can use an unsupervised clustering algorithm such as K-Means clustering to accomplish this or create manually curated clusters based on information about the user such as age group, location, and gender of the customer. We can then boost the weight of the examples which had a positive conversion from the customer service representatives. In an implementation, this can be done by duplicating the positive inputs in the training data. The positive inputs can be characterized by things such as the order price and satisfaction from the customer. It is to be noted that the idea of personalization in neural networks is not specific to conversational customer interactions and can be used in things such as building model which send an automatic response to emails.

The graph on the inverted index is then used to answer questions about the business by a software agent. The software agent starts from the root node of the graph and greets the customer on a call, SMS and Facebook Messenger. The customer can respond to the greeting with a question about the business by searching for the closest match to the question from the customer using techniques borrowed from information retrieval. In an implementation, this can be done using an inverted index to look up possible matches for the user input using an in-expensive algorithm to start with and then evaluating the matches with an expensive algorithm such as Gradient Boosted Decision Tree. Before hitting the inverted index, we have to run stemming, tokenization and normalization algorithm on the input query to make sure that the input can be searched properly by the algorithms looking for a match.

This was an idea I wrote in 2016, in a patent application for Vocy.AI. Components such as Sentence2Vec can be replaced now with BERT and RNN can be augmented further with attention techniques.

This approach gives control as well as the evolution of the virtual agent to enterprises.

Conversational AI Marketing

Stepping into 2020, I have been ruminating about progress in Conversational AI Marketing space. I have presented some of my thoughts in Conversational Interaction conference in 2017. The Slides for the conference are at https://lnkd.in/g4Paqnq.

I still see a blue ocean in the AI-backed Interactive Marketing space. What are your thoughts?

Alternative platform for AB Tests

Lots of companies use A/B test results as a way to measure the value of features to their users.

Do you think the practice is still relevant? I feel that A/B testing needs to be phased out by an infrastructure leveraging ideas from Contextual Bandits, Deep Reinforcement Learning and Counterfactual Policy Estimation.

The features would get faster to the market and companies would be able to use the best algorithms for a given context.

What are your thoughts?
#reinforcementlearning #artificialintelligence

Thursday, February 13, 2020

Conference Presentation: Techniques to personalize conversations for virtual assistants

It was great pleasure presenting in the Conversational Interaction conference (https://lnkd.in/gtcVYtQ) on the topic of "Techniques to personalize conversations for virtual assistants".

I met several interesting people and heard about research happening in the space. It is a great conference for people specializing in Conversational AI.

I have updated my slides at https://lnkd.in/gRbqYjb

It would be great to know your thoughts on my presentation.

Thursday, January 30, 2020

Catching up with developments in Recommendation Algorithms using Deep Learning

In my quest to identify technology and business gaps for Voicy.AI, I have been spending time to catch up with developments in Recommendation Algorithms using Deep Learning. I started my research reading Recommendations paper from YouTube. Recommendation in general is a two step process consisting of Retrieval and Re-Ranking. The authors have phrased the retrieval as a Multi Class classification instead of reusing Inverted Index scoring mechanisms. I liked the tricks of Negative Sampling and Sub-Linear scoring using hashing techniques to optimize for training and serving in production respectively. I than moved to another important development in the recommendation systems leveraging joint training of Wide And Deep Learning Neural Nets pioneered by Apps team at Google. I was impressed with the observation of authors about how Wide Model is good for Memorization and Deep Model is good for generalization.

I than stumbled upon another research paper from UCL folks. The authors have focused on the retrieval problem of recommendations in the context of journalism. I liked how the authors have used structure of the problem and seperate attention models to construct profiles to predict recommendations. It was impressive to see the big leaps by DL algorithms for Recommendation problem from Collaborative Filtering algorithms few years back.

What is your opinion about the next DL paradigm for recommendations? Any suggestions for more popular research papers in DL based recommendations?