* It does a remarkably good job of context switching between programming languages based on the semantics of the question! If the question is about SQL it often includes SQL in < code > tags. If it’s about JavaScript it will include JavaScript! The syntax isn’t perfect due to the tokenizer mangling some things but it’s pretty close!
* The English grammar isn’t perfect but it’s pretty good.
* It doesn’t seem to lose closing track of closing tags and quotes.
* It's learned to sometimes pre-emptively thank people for their answers and to "edit" in "updates" at the end of the post.
If you find any interesting ones you can share them with the permalink! Use the "Fresh Question" button to load a new one.
> Originally, I wanted to predict the number of upvotes and views questions would get (which, intuitively, I thought would be a good proxy for their quality). Unfortunately, after working on this for about a week straight I've come to the conclusion that there is no correlation between question content and upvotes/views.
> I tried several different models (including adapting an AWD_LSTM classifier, a random forest on a bag of words, and using Google's AutoML) and none of them produced anything better than random noise.
> I also tried using myself as a "human classifier" and given two random questions from StackOverflow I can't predict which one will be more popular.
Thanks, brilliant work! Some questions are downright hilarious (see a suite of automated packaging techniques [0]), and the broken English just adds extra ESL credibility to the questions.
[0]
>I want to write an update statement with a sequence of values I can run through a database. i've written the below code to broken up my character string into the columns. All of the articles i've read seem to suggest that I 'll need a suite of automated packaging techniques for my environment to all update the database.
> Answering Questions Right now the model only generates questions. In version 2 I want to train it to answer questions. If I could get this working it'd actually become a useful tool instead of a fun toy.
Looking forward to that part :D
I mean, those answers are probably not going to be correct, but I wonder how close they will be to something useful.
Yes, many times the questioner does not actually need an answer to the question, he just needs to look a little closer to the situation, which is potentially able to be automated. But one should not disguise such automation as an 'answer': more like a query autocheck but more tooled-up.
I wonder what percentage of questions just need a correctly working example because the questioner is unsure of how to use a given API. Automation of this I imagine could actually be doable.
It’s a different model, the AWD_LSTM. The inventor of this model spoke about GPT-2 on this podcast and talked a bit about the differences: https://overcast.fm/+Goog4jsR8
"I have creating a PNG Image file where I am printing out the image's with different colors and Image Types. Now I am sure I am drawing properly, but what I'm seeing is that the image is not differently jpeg (ie FF or Chrome) and Safari (for Firefox) is different from the one in Firefox. "
As a bit of a connoisseur of babblebots over the decades, one of the interesting things about this generation is that it is producing text that has a very interesting effect in my mind. There is a part of the parsing process where the above text went down smooth; yup, that's what Stack Overflow questions from early developers tend to look like. That part of my brain issues no objection. But the next layer up screams bloody murder about how nonsensical that is. And it's not just "that's a bad question but I still see the order under it", but nonsense.
It's a combination I've not experienced before. Previous generation babblebots could often produce a lot of fun text, but every processing level above raw word processing has always been able to tell it's computer garbage, even when it blundered onto a particularly entertaining piece of garbage. We've actually successfully moved up a level here.
The experience you are describing reminds me of the comparative illusion [0], which is a grammatical illusion where certain sentences seem grammatically correct when you read them, but upon further reflection actually make no sense, example:
Fascinating. There's a sentence I picked up from a friend in childhood, "Although the Moon is only an eighth the size of the Earth, it is much farther away," which seems to be similar, but not quite a CI, if I'm reading the Wikipedia article correctly. Thanks for the link.
This is like a mental tarpit, where you waste time reading trying to understand what the person is saying only to realize they were a bot and all your effort was for nothing, that is time you will never get back now.
This is a terrifying way to destroy an online community if a person floods it with nonsense content like this.
Terrifying -> inevitable. Imagine a botnet full of fake AI users trained with a corpus of legit HN posts. Let them loose commenting on random articles, beginning slowly but ramping up until they’re 99% of all comments.
In a few years, the standard Silicon Valley “Growth Hacking” job description will include using AI to deploy fake content to your competitors’ sites, destroying their user community.
Two potential solutions: Reputation and a new account fee.
Nonsense flooding will make it more difficult for people to establish their identities on a network, but once it's established, they'll be in the clear. If someone has to pay to have their first thread or two reviewed, it will take serious money to flood a site to death.
(A similar solution to email spam has been waiting to happen for decades -- charge a fraction of a penny per email, and nobody is harmed but spammers. Maybe allow exceptions for officially recognized organizations that have to send a lot of messages, like political campaigns.)
This would be even better: Email recipients grant free access to whoever they want. A tiny price would be charged only when sending to someone who has not granted such access.
Some is deliberately auto-generated, like https://twitter.com/choochoobot ; but yes, there is definitely an awful lot of auto-liked auto-shared fake engagement out there.
I'm positively sure these tactics are already being deployed as a weapon in order to shut down debate of certain inconvenient topics and disrupt problematic communities.
Indeed, there’s something almost unsettling about text that initially appears to follow a sort of internal logic, yet doesn’t. Some of the results read like a programming fever dream:
“I set each thread * pointer, adding a new thread and in a loop inside this function. The thread would be immediately on the thread, but the thread resulted in the exception. If I return the thread to the first thread and finally the thread is left, the thread doesn't hang, and I couldn't kill thread # 1 - because the thread method made first thread calls the native thread. But, the thread is waiting for the thread blocking and all the other threads to be started. In other words, the thread is always destroyed.”
Unsettling is exactly the word. https://stackroboflow.com/#!/question/16662 leaves me trying to work out what on earth the poster's really trying to do - even though I know full well that there is no poster...
A few times, I've come across Stackoverflow questions on technical topics I'm not very familiar with, and the question makes no sense to me (there are clear spelling, grammatical, and consistency errors). But there's an answer, and a comment exchange that seemed to resolve the question. So, I conclude, it's just my unfamiliarity prevents me from seeing through those errors.
A related phenomena is seeing fundamental errors in a newspaper article on a topic you're expert in... but believing articles on topics you're not familiar with.
This can operate as a partial turing test: a gradient for iteration.
"What's the best way to indeed start a process on an OS x machine?
What is the best way to start a process on Mac OS x Snow Leopard?
There I just need to be able to run the OS x.exe from the command line and it's working fine (make it available in Windows). But I'm on an Mac and I haven't figured out how to do this for a Linux machine.
Another reason I ask is that I only have a Unix shell running with the Python process in it (it's my an Ubuntu machine, nothing didn't work in the shell).
Gather equal numbers of the least intelligible questions from SO (possibly using a metric based on low views/upvotes/comments/answers over long time) and a random selection from stackroboflow.
Present human judges with both sets of questions and ask them to tell the difference.
Having read numerous SO questions from newbie developers whose grasp of English was tenuous at best, I doubt I could tell the difference.
The next step up: the same test, but with mathematics or scientific papers judged by non-experts in the field.
We may actually be there already - I'm not sure.
All of which makes me wonder when we'll reach the point where the bar has been raised so high that the comparison will need to be against the best SO questions and scientific/mathematics papers judged by subject matter experts.
I can claim to have experience [0] with generating funny nonsense based on Stackoverflow data (what a wired thing to say :))
Seems like you beat me to my plan to make a Neural Network based variant and I really like the results (especially that they stay a topic instead of totally drifting off into fun nonsense like my Markov Chains.
Have you tried also using other Stackexchange sites as a source? In my experience they result in more fun questions as they have more "human" interactions (especially the more personal advice based sites) which creates things like:
- Do Greeks driving affect the whaling industry?
- Essential windsurfing equipment to fish?
- Do mountaineers eat grass?
- Can I toast
I reviewed 1600 edits at StackOverflow. And I can say that some of the automatically generated questions are more intelligible than the average SO question. For example, this one looks fine to me: https://stackroboflow.com/#!/question/11235
Fascinating. I wonder if our current discussion boards on the interwebs can survive the coming influx of content like this and the next generations of it that follow.
There are a lot of SO questions posted by very weak non-native speakers of English and some of these are hard to distinguish from those. Kind of scary!
What possible positive outcomes do you see for this kind of (admittedly inevitable) capability?
I am actually a bit worried that I’m already starting to see search engine traffic coming in...
I hope that the good will outweigh the bad. I’d love to create an answer generator, for example.
Once enough questions are generated I’m going to try creating a classifier to see if a neural net can differentiate between real questions and fake ones.
> I am actually a bit worried that I’m already starting to see search engine traffic coming in...
We can discuss hypothetical systems that could maliciously flood us with generated content. The creator of this particular service which is being discussed here and now could also begin taking steps to ensure that his creation does not inadvertently create a problem for some hapless Google user.
Well, no. You have to read further up the thread to see the issue I was referring to.
>I wonder if our current discussion boards on the interwebs can survive the coming influx of content like this and the next generations of it that follow.
Yes the robots.txt is a good and trivial step he could take to ensure well behaved robots do not pick up his content. So your comment suggesting robots.txt is a good comment in its narrow frame, but one that missed the larger picture. That minor problem is solved. The interesting problem is of a different nature.
I was clicking through a couple funny ones but as soon as I got one[1] that fell into this uncanny valley I immediately forgot this was generated and tried to understand it, getting super confused.
There is immense value in training these to synthesize test data sets for sensitive information you can't safely put in a preprod environment.
Health information would be the main case I can think of now.
Having synthesized data for testing new services in govt would be a huge improvement.
De-identification is basically impossible and there are a bunch of companies who will lie to you if you pay them to, but synthesized data covers many use cases for de-identification and for homomorphic encryption.
Unfortunately I’ve come to the conclusion that upvotes on Stackoverflow aren’t correlated with question content (or I’m not skilled enough to be able to differentiate between “good” and “bad” questions). Check out the linked write up for more detailed info.
I think the original comment is being sarcastic and suggesting that the actual humans that close discussions and mark them as "off topic" don't understand the question and perform these actions at random. This is a sentiment shared by many who don't "live" in those types of forums.
Could you not also train a classifier that correlates question content with mod decision? Questions like "what is the best X? " that are obviously subjective
, for example.
Maybe even some kind of crazy generative model that learns to post questions that aren't closed by the AI moderator!
"i've been asked to use Json to call a webservice. I don't modify a JSON object at all. However, when calling JSON returned by the Json object, it fails because the object life isn't array!"
Oh, great, now I know how my clueless questions look like to a knowledgeable person! Example:
>"I need to create an image from a imported wav file (for a user - friendly format find enough header for the cookie). I looked for a solution, but that didn't work either."
I presume this is due to tokenization or something, but there's a lot of extra whitespace in the code samples that make them look very unrealistic:
def _ _ init__(self, default):
" " "
See if the default value for the field on a view is
timespan.
" " "
< select >
< option > value < / option >
< option > value < / option >
< / select >
And indentation is also missing completely. Maybe you need to use another NN to guess which language the fake code is in and autoformat it accordingly!
It is, the tokenizer isn't reversible (and it adds spaces all over the place).
But a lot of these I should be able to add to my regex that converts the output back into more human readable format (in the raw output, there's a space before every punctuation mark so I already remove those extraneous spaces from periods, commas, etc).
I just haven't gotten around to adding in any heuristics specifically for code but adding a bit more post-processing is on my to-do list.
I updated my regexes to clean up some of the tokenizer noise last night. So many of the formatting in the code snippets should look a bit more natural now.
Comgratulations, you have simulated a million monkeys at typewriters with a million monkeys at typewriters. Has anyone really been far even as decided to use even go want to do look more like?
Reminds me of those online chatbots I used to torture back 10 or 15 years ago. One I started asking about personal information about its creator. It was remarkably evasive, constantly attempting to switch the subject.
Every good invention can be terrifying if it falls in the hands of bad guys (Nuclear technology for example).
It's true for AI also. I am sure bad guys must be training similar AI agents by only feeding fake news, conspiracy theories etc. and it's easy to build AI agents as there is so much Open Source material online about AI.
Think things like election meddling. Propagating truly fake news to cater to the emotions of what people simply want to be true. Humans are weak against Confirmation Bias, ten minutes on Facebook will show you for sure.
I think it would still be considered a "Does Not Exist"-valid website if the generated questions would have some auto-formatter for the code. Main issue I see is extra spaces everywhere, often in a syntax breaking way (and missing spaces for formatting) (not that all SO questions have those).
Yeah this is a shortcoming of the tokenizer. It splits things up in ways that are not 1:1 mappable back to their source unfortunately.
I did a bit of post-processing to get it formatted a bit better (re-combining the “would“ and “n’t” tokens and changing html tags to markdown for example) but there’s still room for improvement.
Spacing specifically is different based on the context. Outside of code blocks you want a space after a period. Inside you probably don’t. But since the tokenizer has one in both places there’s no opportunity for the neural net to learn this (it can’t see any difference). And my naive formatted doesn’t know the difference either. (If you’re curious you can find it in the JS file)
I updated my regexes to clean up some of the tokenizer noise last night. So many of the formatting in the code snippets should look a bit more natural now.
The way the language model is trained is by rewarding it for correctly predicting the next word in a sequence.
The output of the model is a predicted probability distribution of the next word and a “state” — the next iteration takes the state output of the previous interation and generates another word and state (and this process repeats many times).
Since there’s a probabilistic dimension, what may have happened in this case is that it happened to repeat once by chance and the model had learned that if something repeats 2x it’s likely that it will repeat a third, fourth, and fifth time.
Basically it’s just trying to game the loss function which rewarded it for predicting the next word in the sequence correctly.
Thanks for the explanation. Your description superficially reminds me of a Markov chain (https://en.wikipedia.org/wiki/Markov_chain). Is this related or is it totally different?
I haven't read the paper the work is based on, but if the RNN outputs a probability distribution for the next letter/word then they form Markov Chains (since then they only depend on the current state and not the previous state)!
RNNs are just fancy parametric functions that take a (state, input)-pair and return a new (state', output)-pair.
This looks absolutely amazing! I would be very curious to know how you went about conceptualizing the project and the AI beneath. Do you have a blogpost on it or planning to write one?
You are a very well trained neural net... The concept is based off of actual Neurons in our brain. Can't tell if you're serious or trolling though lol.
I protest getting -4 points. They used github and stackoverflow. Wrote a function to connect the two based on tags and then randomly generate a question off of that. It's lame. Do something useful or cool.
Full writeup available here: https://stackroboflow.com/about/index.html
Interesting things I’ve noticed so far:
* It does a remarkably good job of context switching between programming languages based on the semantics of the question! If the question is about SQL it often includes SQL in < code > tags. If it’s about JavaScript it will include JavaScript! The syntax isn’t perfect due to the tokenizer mangling some things but it’s pretty close!
* The English grammar isn’t perfect but it’s pretty good.
* It doesn’t seem to lose closing track of closing tags and quotes.
* It's learned to sometimes pre-emptively thank people for their answers and to "edit" in "updates" at the end of the post.
If you find any interesting ones you can share them with the permalink! Use the "Fresh Question" button to load a new one.