Notes for myself of R stuff. Maybe someone will get some use from them.
So I bumped into this problem this week:
I was doing some datamungery of json that I’d gotten out of our Solr engine. Now keep in mind that Solr will return json, xml or csv. But csv is just ugly unless you’re needing stuff to drop directly into a dataframe or Excel and XML wouldn’t work because what I needed was actually being listed as an attribute rather than a result.
<data “The data that Clint needs here”> ,8 </data>
Like that, see? And using the XML package in R and xmlTreeParse wasnt getting me anywhere because the only thing I could search on was the results. “,8″ in this case.
Eventually I hit on something from StackOverflow that used something like this:
lines2bParsed <- readLines(textConnection(getURL(“http://oursolrserver.company.com?querygoeshere”) which strings together the Curl request piped to a text connection piped to a readLines.
So text connections in R are freaking cool. They let you read in data from a character vector like it was a file. How awesome is that? Then you can just readLines it and you get actual text that you can do something with instead of some weird XML/json/HTML tree thing object. (Note, I tried readLines(“URL of our solr server with the paramaterized query here”) and it didn’t like it, even a little. It really needed the Curl request for the authentication.)
I digress.
So after I got all of that, I had to munge and clean it. Basically get it from json to clean data of the type I needed in the format that my next step needed.
A few regexes later and I’ve got it cleaned (It might be totally uncool, but I really like to do my regexes in increments in case something breaks, that way I can isolate the fault), but I only need the lines from the json string that have the stuff I need. Luckily, the stuff I need kind of stands out, so again from StackOverflow I find out how to subset based on some sort of parameter:
cleanedData <- lines2bParsed[grep(“your regex goes here”, lines2bParsed)
But there was a problem. Something in my regex was making the whole thing puke. On an off chance, I tried something I didn’t think would work at all, but did and that’s the whole point of this post: You can use an object in R as your regex in a grep. And probably in lots of other stuff where it calls for a regex. Check this out:
horribleComplicatedAwfulHeironymousBoschRegEx <- “your reg ex goes here”
Now when you do your subset, you can just drop the name of your object in place of the actual pattern to be grepped. Separate the regex itself as an object and it becomes that much simpler to deal with. Unhairball that stuff.
Simplify, simplify, simplify.