Merry Christmas (link to query text)
A few notes on this query:
- It is written for MySQL. I don't know how well it would adapt to other programs.
- It is written for use with the Baseball-Databank. bbdatabank is the name of my database; if you have named yours something different, change occurrences of "bbdatabank.*" accordingly.
- This query creates a table that lists each defensive position played by a player and indicates whether the position is that player's primary position. The line "CREATE TABLE bbdatabank.prim_pos AS" (line 10) is currently commented so that the query will not create the table but display the results instead. You can run it as is to check the output, or you can un-comment that line (delete the "#" in front of it) to create the table in the bbdatabank database.
- I classified both corner OF spots as one position (for example, Ty Cobb's primary position is listed as CF, while Hank Aaron's is corner OF). I do this because if a player has 100 games each at LF and RF, and 105 G at CF, I consider him primarily a corner OF and not a center fielder, but you can change this with some minor tweaks to the code if you want.
- For years prior to and including 1955, OF games are broken down by position in a separate table of the Baseball-Databank db, so to get games split out between LF/RF/CF, you need to use two different tables to get defensive games. However, 1954 and 1955 have OF games split out by individual position in the main fielding table as well, so you need to exclude those years from one table or the other to avoid double counting. I chose to just use the fieldingOF table in its entirety and cut 1954-55 from the main fielding table, but you can switch that if you want. There are some minor discrepancies between the two tables, but it shouldn't make a huge difference (at least not from any of the entries I have looked at).
- I apologize if the query looks much longer and nastier than it needs to. It could probably be written more concisely, especially if I used joins, but I try to avoid using joins on large tables like this if I can help it, because using unions and variables to do the same thing runs much, much faster on my computer. Speaking of which, this query actually has multiple statements that need to be run (the various statements at the beginning that set the initial variable values, and then the one long query), so it would be useful to run this as a script. That is not a problem in the new MySQL Workbench or the command line, but if you are still using the old Query Browser, that might not work. If you are using the old Query Browser, all you have to do is run each of those SET statements on its own before running the main query (all in the same tab). I haven't tested the query without running those statements first to check whether the results are still accurate, so I guess it might work anyway, but I strongly recommend using them.
The method for determining a player's primary position has two steps. First, each position is categorized as follows:
-P: Pithcer
-C: Catcher
-off_IF: 1B, DH (I know DH is not an infield position; I just wanted to classify these two positions together, and this is what I called it. You can change it if it bothers you)
-def_IF: 2B, 3B, SS
-OF: cOF, CF (remember that I combine LF and RF into cOF)
I total a player's games in each category and used that to determine a player's primary category. Then, I looked at the number of games at each position in each category and determine the primary position within each category. The primary position in the primary category is the player's overall primary position. For example, let's look at Willie Bloomquist:
DH- 21 G; 1B-21 G; 2B-106 G; 3B- 121 G; SS-150 G; CF-186 G; cOF-37 G
That gives us category totals of:
P: 0
C: 0
off_IF: 58
def_IF: 377
OF: 295
Here, we see that Willie's primary category is def_IF. That means his primary position has to come from this category. Of those positions (2B, 3B, SS), he has the most games as SS, so that is his primary position, even though he has more games at cOF than at SS.
The fields generated for this table are:
- playerID
- prim_pos: 0 if the position is not the player's primary position, 1 if it is
- pos: defensive position
- cat: the category the position is classified in, as described above
- G: defensive games
- tot: combined defensive games at all positions
- cat_tot: combined defensive games at all positions in that category
- maxP: 0 if the position is not the primary position for that category (Pos with max games in that category only), 1 if it is
- prim_cat: 0 if the category is not the primary category, 1 if it is
- percPos: the percentage of total defensive games that are at that position (G/tot)
- percCat: the percentage of total defensive games that are in that category (cat_tot/tot)
- percP_C: the percentage of defensive games in that category that are at that position (G/cat_tot)
Hacking the Hack:
Splitting up cOF into LF and RF: Perhaps you don't like that I combine LF and RF into one position. Splitting them back up shouldn't be too much trouble. Where pos is defined as "if(pos='LF','cOF',if(pos='RF','cOF',pos))", (such as in line 108, but there are other places that will need to be changed as well), change that to just "pos" (no quotes). For example:
(SELECT playerID, if(pos='LF','cOF',if(pos='RF','cOF',pos)) as pos
,sum(if(yearID<=1955 and pos regexp 'F$',0,G)) as G
,if(pos='2b' or pos='3b' or pos='ss','def_IF'
,if(pos='1b' or pos='DH','off_IF'
,if(pos='P','P' ,if(pos regexp 'F$','OF','C')))) as cat
from bbdatabank.fielding
where pos!='OF'
group by playerID, if(pos='LF','cOF',if(pos='RF','cOF',pos))
would become
(SELECT playerID, pos
,sum(if(yearID<=1955 and pos regexp 'F$',0,G)) as G
,if(pos='2b' or pos='3b' or pos='ss','def_IF'
,if(pos='1b' or pos='DH','off_IF'
,if(pos='P','P' ,if(pos regexp 'F$','OF','C')))) as cat
from bbdatabank.fielding
where pos!='OF'
group by playerID, pos
I think there are 4 instances where that has to be changed (the above two changes at lines 108 and 116, plus two more at 137/145).
Also, you will need to change the query of the fieldingOF table to split cOF into LF and RF:
UNION ALL
SELECT playerID, 'cOF' as pos, sum(Glf+Grf) as G, 'OF' as cat
from bbdatabank.fieldingof
group by playerID
should become:
UNION ALL
SELECT playerID, 'LF' as pos, sum(Glf) as G, 'OF' as cat
from bbdatabank.fieldingof
group by playerID
UNION ALL
SELECT playerID, 'RF' as pos, sum(Grf) as G, 'OF' as cat
from bbdatabank.fieldingof
group by playerID
The above change will need to be done in two places.
These changes will make it so the 100/105/100 split for OF games at LF/CF/RF selects CF as the primary position. It would also be possible to keep the corner OF designation but still specify LF or RF as a primary overall position. You would want to create a second type of category just for corner OF. You could do this by copying and pasting the appropriate part of the code and then editing the appropriate lines, but this is going to require knowing which part of the code adds the new columns for "cat" and "pos". This would probably be more straightforward if the code used joins, but it doesn't (sorry). I would actually prefer to do it this way, but I was tired of writing this query and it is not that important to me to split up LF and RF, so I didn't get to it. If I do it sometime later, I'll update the linked query text.
Changing the groupings of position categories: Don't like how I chose to group the positions? Not a problem. The section you need to change is:
if(pos='2b' or pos='3b' or pos='ss','def_IF'
,if(pos='1b' or pos='DH','off_IF'
,if(pos='P','P'
,if(pos regexp 'F$','OF','C')))) as cat
I believe this appears at lines 45, 59, 79, 110, and 139 (assuming you haven't added or removed lines editing other things, of course). For each occurrence, edit that statement to fit your liking (just make sure you edit each occurrence to the same thing). For example, you can move 1B into one IF group with 2B, 3B, and SS and leave DH as its own separate group, or group the positions into corner positions and up the middle positions. If you want to add more groups, just add a new line to the if statement just like the others, and add a close-parenthesis at the end next to the others. If you want to delete a category, make sure to also delete a close-parenthesis. The catcher category right now is just left as whatever is leftover after the other positions are assigned, but you can add a separate line for the catcher class if you want. For example, you might do this:
if(pos='2b' or pos='ss' or pos='c','middle_IF'
,if(pos='1b' or pos='3b','corner_IF'
,if(pos regexp 'F$','OF'
,if(pos='P','P'
,if(pos='DH','DH',''))))) as cat
Note that when I used an explicit if statement for each individual category (i.e. no just assigning the leftovers to a category, like I did with catchers above), I still have to end with ",''" before the close-parentheses. You need something there between the last comma and the series of close-parentheses for the query to run.
IMPORTANT: the fielding table has entries for both individual outfield positions and for combined OF totals. If you don't do something about this, OF games will be double counted, so the following line appears near each of the category definition occurrences:
where pos!='LF' and pos!='RF' and pos!='CF'
This is because I am classifying all OF positions in the same category, so I just use the OF entry and ignore the LF, CF, and RF entries. If you want to split those up (say, for example, put CF in an up the middle category with 2B and SS, and LF/RF in a corner category with 1B and 3B), then you'll have to change this to "where pos!='OF'". Bear in mind that this will not return any OF games before 1954, and you'll have to use the fieldingOF table as well, which would be sort of a pain.
Clean up the code by eliminating the need to use two separate OF tables for pre- and post-1955: Say you don't want to deal with having to combine two tables to get your OF games. You can create a new table before running this query that lists games played at each position by combining the fielding and fieldingOF tables. Then, instead of referencing those two tables in this query, just reference your new one and you won't ever have to worry about complicating your queries referencing additional tables every time you want OF games split up by position.
Utilize the percentages: You can use the percPos, percCat, and percP_C fields to further classify positions. You can create multiple primary positions or categories, classify primary positions/categories as strong or weak primaries, identify players with no true primary position, etc.
Add more types of categories: As indicated above, you could add a new type of category for corner OF, and then if cOF is the primary position, look in that category to choose between LF and RF. Additionally, you could break out DH into its own type of category, and first determine whether a player is primarily a fielder or a DH, and then if he is a fielder, look at the fielding categories to determine primary fielding category and then primary position.
Those are just some possibilities for what you could change. If you have other changes you want to make, go right ahead. Whether you want to use this as-is or just as a base to develop your own query from, as long as you find it useful (or I guess even if you don't), do whatever you want with it.
Link in case you missed it above:
https://docs.google.com/leaf?id=0B6kgiU_VMWjUYjVlNGI4YWItNjFjNy00N2EyLThjOTktMTgyNTE3Yzk5Y2U2&hl=en
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.